Logic Regression

Logic regression is a (generalized) regression methodology that is primarily applied when most of the covariates in the data to be analyzed are binary. The goal of logic regression is to find predictors that are Boolean (logical) combinations of the original predictors.

On this page you can information about downloading the software for logic regression and find the basic information you need to run the software. Logic regression is available as an R package. R is a freely available statistical software. The core of the logic regression code has been translated into Fortran, which is called from the R command line.

The R version of the software is now available from CRAN [LogicReg], For the S version of the software, follow the link below.

The Logic Regression methodology was developed by Ingo Ruczinski, Charles Kooperberg, and Michael LeBlanc at the Fred Hutchinson Cancer Research Center in Seattle. The software was implemented by Ingo Ruczinski and Charles Kooperberg. Logic Regression is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form.

The current version of the code is 1.3.1 dated December 15, 2004 [S-Plus] and 1.4.12 date February 14, 2012 [R].


Introduction to Logic Regression

Logic Regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates.

In most regression problems a model is developed that only relates the main effects (the predictors or transformations thereof) to the response. Although interactions between predictors are considered sometimes as well, those interactions are usually kept very simple (two- to three-way interactions at most). But often, especially when all predictors are binary, the interaction between many predictors is what causes the differences in response. This issue often arises in the analysis of SNP microarray data or in data mining problems. Given a set of binary predictors X, we try to create new, better predictors for the response by considering combinations of those binary predictors. For example, if the response is binary as well (which is not required in general), we attempt to find decision rules such as ``if X1, X2, X3 and X4 are true'', or ``X5 or X6 but not X7 are true'', then the response is more likely to be in class 0. In other words, we try to find Boolean statements involving the binary predictors that enhance the prediction for the response. In more specific terms: Let X1,…,Xk be binary predictors, and let Y be a response variable. We try to fit regression models of the form g(E[Y]) = b0 + b1L1 + …+ bnLn, where Lj is a Boolean expression of the predictors X, such as Lj=[(X2 or X4c) and X7]. The above framework includes many forms of regression, such as linear regression (g(E[Y])=E[Y]) and logistic regression (g(E[Y])=log(E[Y]/(1-E[Y]))). For every model type, we define a score function that reflects the ``quality'' of the model under consideration. For example, for linear regression the score could be the residual sum of squares and for logistic regression the score could be the deviance. We try to find the Boolean expressions in the regression model that minimize the scoring function associated with this model type, estimating the parameters bj simultaneously with the Boolean expressions Lj. In general, any type of model can be considered, as long as a scoring function can be defined. For example, we also implemented the Cox proportional hazards model, using the partial likelihood as the score.

Since the number of possible Logic Models we can construct for a given set of predictors is huge, we have to rely on some search algorithms to help us find the best scoring models. We define the move set by a set of standard operations such as splitting and pruning the tree (similar to the terminology introduced by Breiman et al). We implemented two types of algorithms: a greedy and a simulated annealing algorithm. While the greedy algorithm is very fast, it does not always find a good scoring model. The simulated annealing algorithm usually does, but computationally it is more expensive. Since we have to be certain to find good scoring models, we usually carry out simulated annealing for our case studies. However, as usual, the best scoring model generally over-fits the data, and methods to separate signal and noise are needed. We implemented methods for candidate models found by either the greedy or simulated annealing algorithms. In the latter case, a definition of model size was needed, and a technique was implemented to find the best scoring model of a particular size. For the model selection itself we developed and implemented randomization tests and tests using cross-validation. If sufficient data is available, an analysis using a training and a test set can also be carried out. These tests are rather complicated, so I will not go into detail here and refer you to the Logic Regression manuscript instead.

Methodology Publications

The main publication from JCGS
Ruczinski I, Kooperberg C, LeBlanc ML (2003): Logic Regression, Journal of Computational and Graphical Statistics, 12 (3), pp 475-511.

An even longer description
Selected chapters from the dissertation of Ingo Ruczinski.

An application to single nucleoteide polymorphism data
Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L (2001): Sequence Analysis using Logic Regression, Genetic Epidemiology, 21:S626-S631.

Monte Carlo Logic regression - link doesn't work
Kooperberg C, Ruczinski I (2005): Identifying interacting SNPs using Monte Carlo Logic regression, Genetic Epidemiology, 28: 157-170.
A method to obtain more than one model, and measures of variable importance.


Software Information

Logic regression is now a package for R, a freely available statistical software, and Splus, a commercial software, distributed by Tibco. The core of the logic regression code has been translated into Fortran77, which is called from the R or Splus command line. The logic regression package can be downloaded here. It can be used to fit one logic regression model, it can be used to fit logic regression models of prespecified sizes, to carry out cross-validation, and to do various randomization tests. See the documentation (in particular the article from JCGS) for more information.

The help files for the packages can be downloaded here. The format of the output on your screen (if you choose to do so) is described in the helpfile of the function logreg(). The format of stored trees is explained here using gifs, and in the helpfile of the function logregtree() (using ascii characters only). Currently the Logic Regression methodology has scoring functions for linear regression (residual sum of squares), logistic regression (deviance), classification (misclassification), and proportional hazards models (partial likelihood). A feature of the Logic Regression methodology is that it is easily possible to extend the method to write ones own scoring function if you have a different scoring function. We describe here how to do that.


Download CRAN Package

Routines for fitting Logic Regression models. Logic Regression is described in Ruczinski, Kooperberg, and LeBlanc (2003). Monte Carlo Logic Regression is described in Kooperberg and Ruczinski (2005).

Version: 1.6.2
Depends: R (≥ 2.10), survival
Imports: stats, graphics, utils, grDevices
Publshed: 2019-12-07
Author: Charles Kooperberg and Ingo Ruczinski
Maintainer: Charles Kooperberg
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
Needs Compilation: yes
In views: Survival
CRAN checks: LogicReg results

Downloads:

Reference manual: LogicReg.pdf
Package source: LogicReg_1.6.2.tar.gz
Windows binaries: r-devel: LogicReg_1.6.2.zip, r-devel-UCRT: LogicReg_1.6.2.zip, r-release: LogicReg_1.6.2.zip, r-oldrel: LogicReg_1.6.2.zip
macOS binaries: r-release: LogicReg_1.6.2.tgz, r-oldrel: LogicReg_1.6.2.tgz
Old sources: LogicReg archive

Reverse dependencies:

Reverse depends: logicFS
Reverse imports: trio
Reverse suggests: fscaret, SuperLearner

Linking:

Please use the canonical form https://CRAN.R-project.org/package=LogicReg to link to this page.


Write your own scoring function

Logic regression comes with a bunch of off-the-shelf scoring functions - misclassification, least squares, deviance, and partial likelihood for classification, linear regression, logistic regression, and proportional hazards models respectively. But you can write your own scoring function! This may be useful if you have a model other than those which we already programmed in. You need to provide two routines in the file My_own_scoring.f . A routine My_own_fitting, which fits your model, and a routine My_own_scoring, which (given the betas) provides the score of your model. The details are spelled out in the helpfile for logreg.myown. In these helpfiles we give an example, using the file condlogic.ff (available in the source code) to show how things work for conditional logistic regression.


Documentation for package ‘LogicReg’ version 1.4.12

Download the help files as a PDF document

cumhaz Cumulative hazard transformation
eval.logreg Evaluate a Logic Regression tree
frame.logreg Constructs a data frame for one or more Logic Regression models
logreg Logic Regression
logreg.anneal.control Control for Logic Regression
logreg.mc.control Control for Logic Regression
logreg.myown Writing your own Logic Regression scoring function
logreg.savefit1 Sample results for Logic Regression
logreg.testdat Test data for Logic Regression
logreg.tree.control Control for logreg
logregmodel Format of class logregmodel
logregtree Format of class logregtree
plot.logreg Plots for Logic Regression
plot.logregmodel Plots for Logic Regression
plot.logregtree A plot of one Logic Regression tree.
predict.logreg Predicted values Logic Regression
print.logreg Prints Logic Regression Output
print.logregmodel Prints Logic Regression Formula
print.logregtree Prints Logic Regression Formula

Examples

I ran the currently available options for logic regression on a very simple simulated data set. The data have 500 cases, and 20 binary variables. Each predictor k is simulated from an independent Bernoulli(pk) random variables, with success probabilities pk between 0.1 and 0.9. The response variable is simulated from the model

Y = 3 + 1 L1 - 2 L2 + N(0,1),

where L1=(X1 or X2) and L2=(X3 or X4). The data are available in the R logic regression package as logreg.testdat. So the task is to use the linear model in the logic regression framework to find L1 and L2. Follow the script below to see how we ran this example. The first two steps are to verify that there is signal in the data. Then we find the best scoring models for various sizes, and carry out the two options for model selection. It should be obvious that the best model has two trees, and a total of four leaves. In between we added some lines that are not really necessary, but illustrate how to use the code and grab the objects it generates. If you use R, we recommend using the option paper="letter" if you create a postscript file and do not want the default 'A4' size.

Make sure to check out the documents about the methodology, so you know what's going on here.


### LOAD THE LOGIC REGRESSION PACKAGE ################################
library(LogicReg)    # for Splus, use the attach() function



### FIND THE BEST SCORING MODEL (BIG SIZE) ##############################

# SET THE ANNEALING PARAMETERS
myanneal <- logreg.anneal.control(start=-1,end=-4,iter=25000,update=1000)

# FIND THE BEST SCORING MODEL USING UP TO TWO TREES AND EIGHT LEAVES
# PER TREE (CHECK THE HELP FILES FOR THE DEFAULT SETTINGS)
fit1 <- logreg(resp=logreg.testdat[,1],bin=logreg.testdat[,2:21],
               type=2,select=1,ntrees=2,anneal.control=myanneal)

# GENERATE POSTSCRIPT FILES OF THE TWO TREES
plot(fit1,pscript=T)

# CUSTOMIZE THE PLOT OF THE SECOND TREE
plot.logregtree(fit1$model$trees[[2]],info=T,coef=F,nms=LETTERS)



### NULL MODEL TEST FOR SIGNAL IN THE DATA ##############################

# TURN OFF THE ITERATION UPDATES ON THE SCREEN
myanneal2 <- logreg.anneal.control(start=-1,end=-4,iter=25000,update=0)

# A PERMUTATION TEST FOR SIGNAL IN THE DATA, 20 REPLICATES
fit4 <- logreg(select=4,anneal.control=myanneal2,oldfit=fit1,nrep=20)

# GENERATE A POSTSCRIPT FILE OF THE REFERENCE DISTRIBUTION
plot(fit4,pscript=T)



### FIND THE BEST SCORING MODEL FOR VARIOUS SIZES #######################

# FIND THE BEST SCORING MODELS OF VARIOUS SIZES, ALLOWING ONE OR TWO
# TREES, AND BETWEEN ONE AND SEVEN LEAVES PER TREE
fit2 <- logreg(resp=logreg.testdat[,1],bin=logreg.testdat[,2:21],
               type=2,select=2,ntrees=c(1,2),nleaves=c(1,7),
               anneal.control=myanneal2)

# AN EASIER WAY (EQUIVALENT TO THE ABOVE)
fit2 <- logreg(oldfit=fit1,select=2,ntrees=c(1,2),nleaves=c(1,7),
               anneal.control=myanneal2)

# LOOK AT THE SCORES
plot(fit2)

# GENERATE POSTSCRIPT FILES OF THE SCORES AND ALL TREES
plot(fit2,pscript=T)



### CROSS VALIDATION ####################################################

# 10-FOLD CROSS-VALIDATION ON THE TREES IN FIT2
fit3 <- logreg(select=3,oldfit=fit2)

# GENERATE POSTSCRIPT FILES OF THE TRAINING AND TEST SCORES
plot(fit3,pscript=T)

# PRINT A TABLE WITH THE SCORES
n1<-seq(10,dim(fit3$cvscores)[1],10);n2<-c(1,2,6,8)
print(round(fit3$cvscores[n1,n2],3))



### RANDOMIZATION TEST FOR MODEL SELECTION ############################

# PERMUTATION TEST FOR MODEL SELECTION, 10 REPLICATES
fit5 <- logreg(select=5,oldfit=fit2,nrep=10)

# GENERATE A POSTSCRIPT FILE OF THE REFERENCE DISTRIBUTIONS
plot(fit5,pscript=T)



### SUMMARY ############################################################

# print all scores
print(fit2$allscores)

# plot the trees for the best model - two trees, four leaves
postscript("final.ps")
par(mfrow=c(1,2),mar=rep(0,4))
plot.logregtree(fit2$alltrees[[10]]$trees[[1]],
indents=c(0.5,0.2,0.2,0.2),coef.rd=2)
plot.logregtree(fit2$alltrees[[10]]$trees[[2]],
indents=c(0.5,0.2,0.2,0.2),coef.rd=2)
dev.off()