Logic regression is a (generalized) regression methodology that is primarily applied when most of the covariates in the data to be analyzed are binary. The goal of logic regression is to find predictors that are Boolean (logical) combinations of the original predictors.

On this page you can information about downloading the software for logic regression and find the basic information you need to run the software. Logic regression is available as an R package. R is a freely available statistical software. The core of the logic regression code has been translated into Fortran, which is called from the R command line.

**The R version of the software is now available from CRAN [LogicReg], For the S version of the software, follow the link below.**

The Logic Regression methodology was developed by Ingo Ruczinski, Charles Kooperberg, and Michael LeBlanc at the Fred Hutchinson Cancer Research Center in Seattle. The software was implemented by Ingo Ruczinski and Charles Kooperberg. Logic Regression is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form.

*The current version of the code is 1.3.1 dated December 15, 2004 [S-Plus] and 1.4.12 date February 14, 2012 [R].*

Introduction | Software Information | Download and Install Software | Writing Scoring Functions | Help Files | Examples

Logic Regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates.

In most regression problems a model is developed that only relates the main effects (the predictors or transformations thereof) to the response. Although interactions between predictors are considered sometimes as well, those interactions are usually kept very simple (two- to three-way interactions at most). But often, especially when all predictors are binary, the interaction between many predictors is what causes the differences in response. This issue often arises in the analysis of SNP microarray data or in data mining problems. Given a set of binary predictors X, we try to create new, better predictors for the response by considering combinations of those binary predictors. For example, if the response is binary as well (which is not required in general), we attempt to find decision rules such as ``if X_{1}, X_{2}, X_{3} and X_{4} are true'', or ``X_{5} or X_{6} but not X_{7} are true'', then the response is more likely to be in class 0. In other words, we try to find Boolean statements involving the binary predictors that enhance the prediction for the response. In more specific terms: Let X_{1},…,X_{k} be binary predictors, and let Y be a response variable. We try to fit regression models of the form g(E[Y]) = b_{0} + b_{1}L_{1} + …+ b_{n}L_{n}, where Lj is a Boolean expression of the predictors X, such as L_{j}=[(X_{2} or X_{4c}) and X_{7}]. The above framework includes many forms of regression, such as linear regression (g(E[Y])=E[Y]) and logistic regression (g(E[Y])=log(E[Y]/(1-E[Y]))). For every model type, we define a score function that reflects the ``quality'' of the model under consideration. For example, for linear regression the score could be the residual sum of squares and for logistic regression the score could be the deviance. We try to find the Boolean expressions in the regression model that minimize the scoring function associated with this model type, estimating the parameters bj simultaneously with the Boolean expressions L_{j}. In general, any type of model can be considered, as long as a scoring function can be defined. For example, we also implemented the Cox proportional hazards model, using the partial likelihood as the score.

Since the number of possible Logic Models we can construct for a given set of predictors is huge, we have to rely on some search algorithms to help us find the best scoring models. We define the move set by a set of standard operations such as splitting and pruning the tree (similar to the terminology introduced by Breiman et al). We implemented two types of algorithms: a greedy and a simulated annealing algorithm. While the greedy algorithm is very fast, it does not always find a good scoring model. The simulated annealing algorithm usually does, but computationally it is more expensive. Since we have to be certain to find good scoring models, we usually carry out simulated annealing for our case studies. However, as usual, the best scoring model generally over-fits the data, and methods to separate signal and noise are needed. We implemented methods for candidate models found by either the greedy or simulated annealing algorithms. In the latter case, a definition of model size was needed, and a technique was implemented to find the best scoring model of a particular size. For the model selection itself we developed and implemented randomization tests and tests using cross-validation. If sufficient data is available, an analysis using a training and a test set can also be carried out. These tests are rather complicated, so I will not go into detail here and refer you to the Logic Regression manuscript instead.

The main publication from JCGS

Ruczinski I, Kooperberg C, LeBlanc ML (2003): Logic Regression, *Journal of Computational and Graphical Statistics*, 12 (3), pp 475-511.

An even longer description

Selected chapters from the dissertation of Ingo Ruczinski.

An application to single nucleoteide polymorphism data

Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L (2001): Sequence Analysis using Logic Regression, *Genetic Epidemiology*, 21:S626-S631.

Monte Carlo Logic regression - link doesn't work

Kooperberg C, Ruczinski I (2005): Identifying interacting SNPs using Monte Carlo Logic regression, *Genetic Epidemiology*, 28: 157-170.

A method to obtain more than one model, and measures of variable importance.

Logic regression is now a package for R, a freely available statistical software, and Splus, a commercial software, distributed by Tibco. The core of the logic regression code has been translated into Fortran77, which is called from the R or Splus command line. The logic regression package can be downloaded here. It can be used to fit one logic regression model, it can be used to fit logic regression models of prespecified sizes, to carry out cross-validation, and to do various randomization tests. See the documentation (in particular the article from JCGS) for more information.

The help files for the packages can be downloaded here. The format of the output on your screen (if you choose to do so) is described in the helpfile of the function logreg(). The format of stored trees is explained here using gifs, and in the helpfile of the function logregtree() (using ascii characters only). Currently the Logic Regression methodology has scoring functions for linear regression (residual sum of squares), logistic regression (deviance), classification (misclassification), and proportional hazards models (partial likelihood). A feature of the Logic Regression methodology is that it is easily possible to extend the method to write ones own scoring function if you have a different scoring function. We describe here how to do that.

Routines for fitting Logic Regression models. Logic Regression is described in Ruczinski, Kooperberg, and LeBlanc (2003). Monte Carlo Logic Regression is described in Kooperberg and Ruczinski (2005).

Version: | 1.6.2 |

Depends: | R (≥ 2.10), survival |

Imports: | stats, graphics, utils, grDevices |

Publshed: | 2019-12-07 |

Author: | Charles Kooperberg and Ingo Ruczinski |

Maintainer: | Charles Kooperberg |

License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |

Needs Compilation: | yes |

In views: | Survival |

CRAN checks: | LogicReg results |

Reference manual: | LogicReg.pdf |

Package source: | LogicReg_1.6.2.tar.gz |

Windows binaries: | r-devel: LogicReg_1.6.2.zip, r-devel-UCRT: LogicReg_1.6.2.zip, r-release: LogicReg_1.6.2.zip, r-oldrel: LogicReg_1.6.2.zip |

macOS binaries: | r-release: LogicReg_1.6.2.tgz, r-oldrel: LogicReg_1.6.2.tgz |

Old sources: | LogicReg archive |

Reverse depends: | logicFS |

Reverse imports: | trio |

Reverse suggests: | fscaret, SuperLearner |

Please use the canonical form https://CRAN.R-project.org/package=LogicReg to link to this page.

Logic regression comes with a bunch of off-the-shelf scoring functions - misclassification, least squares, deviance, and partial likelihood for classification, linear regression, logistic regression, and proportional hazards models respectively. But you can write your own scoring function! This may be useful if you have a model other than those which we already programmed in. You need to provide two routines in the file **My_own_scoring.f **. A routine My_own_fitting, which fits your model, and a routine My_own_scoring, which (given the betas) provides the score of your model. The details are spelled out in the helpfile for logreg.myown. In these helpfiles we give an example, using the file condlogic.ff (available in the source code) to show how things work for conditional logistic regression.

cumhaz | Cumulative hazard transformation |

eval.logreg | Evaluate a Logic Regression tree |

frame.logreg | Constructs a data frame for one or more Logic Regression models |

logreg | Logic Regression |

logreg.anneal.control | Control for Logic Regression |

logreg.mc.control | Control for Logic Regression |

logreg.myown | Writing your own Logic Regression scoring function |

logreg.savefit1 | Sample results for Logic Regression |

logreg.testdat | Test data for Logic Regression |

logreg.tree.control | Control for logreg |

logregmodel | Format of class logregmodel |

logregtree | Format of class logregtree |

plot.logreg | Plots for Logic Regression |

plot.logregmodel | Plots for Logic Regression |

plot.logregtree | A plot of one Logic Regression tree. |

predict.logreg | Predicted values Logic Regression |

print.logreg | Prints Logic Regression Output |

print.logregmodel | Prints Logic Regression Formula |

print.logregtree | Prints Logic Regression Formula |

I ran the currently available options for logic regression on a very simple simulated data set. The data have 500 cases, and 20 binary variables. Each predictor k is simulated from an independent Bernoulli(p_{k}) random variables, with success probabilities p_{k} between 0.1 and 0.9. The response variable is simulated from the model

Y = 3 + 1 L1 - 2 L2 + N(0,1),

where L_{1}=(X_{1} or X_{2}) and L_{2}=(X_{3} or X_{4}). The data are available in the R logic regression package as *logreg.testdat*. So the task is to use the linear model in the logic regression framework to find L1 and L2. Follow the script below to see how we ran this example. The first two steps are to verify that there is signal in the data. Then we find the best scoring models for various sizes, and carry out the two options for model selection. It should be obvious that the best model has two trees, and a total of four leaves. In between we added some lines that are not really necessary, but illustrate how to use the code and grab the objects it generates. If you use R, we recommend using the option *paper="letter" *if you create a postscript file and do not want the default 'A4' size.

Make sure to check out the documents about the methodology, so you know what's going on here.

### LOAD THE LOGIC REGRESSION PACKAGE ################################

library(LogicReg) # for Splus, use the attach() function

### FIND THE BEST SCORING MODEL (BIG SIZE) ##############################

# SET THE ANNEALING PARAMETERS

myanneal <- logreg.anneal.control(start=-1,end=-4,iter=25000,update=1000)

# FIND THE BEST SCORING MODEL USING UP TO TWO TREES AND EIGHT LEAVES

# PER TREE (CHECK THE HELP FILES FOR THE DEFAULT SETTINGS)

fit1 <- logreg(resp=logreg.testdat[,1],bin=logreg.testdat[,2:21],

type=2,select=1,ntrees=2,anneal.control=myanneal)

# GENERATE POSTSCRIPT FILES OF THE TWO TREES

plot(fit1,pscript=T)

# CUSTOMIZE THE PLOT OF THE SECOND TREE

plot.logregtree(fit1$model$trees[[2]],info=T,coef=F,nms=LETTERS)

### NULL MODEL TEST FOR SIGNAL IN THE DATA ##############################

# TURN OFF THE ITERATION UPDATES ON THE SCREEN

myanneal2 <- logreg.anneal.control(start=-1,end=-4,iter=25000,update=0)

# A PERMUTATION TEST FOR SIGNAL IN THE DATA, 20 REPLICATES

fit4 <- logreg(select=4,anneal.control=myanneal2,oldfit=fit1,nrep=20)

# GENERATE A POSTSCRIPT FILE OF THE REFERENCE DISTRIBUTION

plot(fit4,pscript=T)

### FIND THE BEST SCORING MODEL FOR VARIOUS SIZES #######################

# FIND THE BEST SCORING MODELS OF VARIOUS SIZES, ALLOWING ONE OR TWO

# TREES, AND BETWEEN ONE AND SEVEN LEAVES PER TREE

fit2 <- logreg(resp=logreg.testdat[,1],bin=logreg.testdat[,2:21],

type=2,select=2,ntrees=c(1,2),nleaves=c(1,7),

anneal.control=myanneal2)

# AN EASIER WAY (EQUIVALENT TO THE ABOVE)

fit2 <- logreg(oldfit=fit1,select=2,ntrees=c(1,2),nleaves=c(1,7),

anneal.control=myanneal2)

# LOOK AT THE SCORES

plot(fit2)

# GENERATE POSTSCRIPT FILES OF THE SCORES AND ALL TREES

plot(fit2,pscript=T)

### CROSS VALIDATION ####################################################

# 10-FOLD CROSS-VALIDATION ON THE TREES IN FIT2

fit3 <- logreg(select=3,oldfit=fit2)

# GENERATE POSTSCRIPT FILES OF THE TRAINING AND TEST SCORES

plot(fit3,pscript=T)

# PRINT A TABLE WITH THE SCORES

n1<-seq(10,dim(fit3$cvscores)[1],10);n2<-c(1,2,6,8)

print(round(fit3$cvscores[n1,n2],3))

### RANDOMIZATION TEST FOR MODEL SELECTION ############################

# PERMUTATION TEST FOR MODEL SELECTION, 10 REPLICATES

fit5 <- logreg(select=5,oldfit=fit2,nrep=10)

# GENERATE A POSTSCRIPT FILE OF THE REFERENCE DISTRIBUTIONS

plot(fit5,pscript=T)

### SUMMARY ############################################################

# print all scores

print(fit2$allscores)

# plot the trees for the best model - two trees, four leaves

postscript("final.ps")

par(mfrow=c(1,2),mar=rep(0,4))

plot.logregtree(fit2$alltrees[[10]]$trees[[1]],

indents=c(0.5,0.2,0.2,0.2),coef.rd=2)

plot.logregtree(fit2$alltrees[[10]]$trees[[2]],

indents=c(0.5,0.2,0.2,0.2),coef.rd=2)

dev.off()

Fred Hutchinson Cancer Center | 1100 Fairview Ave. N., Seattle, WA 98109

© 2022 Fred Hutchinson Cancer Center, a 501(c)(3) nonprofit organization.

© 2022 Fred Hutchinson Cancer Center, a 501(c)(3) nonprofit organization.