C# .NET Algorithm for Variable Selection Based on the Mallow’s Cp Criterion

Jessie Chen, MEng.
Massachusetts Institute of Technology, Cambridge, MA
jic@mit.edu


Abstract: Variable selection techniques are important in statistical modeling because they seek to simultaneously reduce the chances of data overfitting and to minimize the effects of omission bias. The Linear or Ordinary Least Squared regression model is particularly useful in variable selection because of its association with certain optimality criterions. One of these is the Mallow’s Cp Criterion which evaluates the fit of a regression model by the squared distance between its predictions and the true values. The first part of this project seeks to implement an algorithm in C# .NET for variable selection using the Mallow’s Cp Criterion and also to test the viability of using a greedy version of such an algorithm in reducing computational costs. The second half aims to verify the results of the algorithm through logistic regression. The results affirmed the use of a greedy algorithm, and the logistic regression models also confirmed the Mallow’s Cp results. However, further studies on the details of the Mallow’s Cp algorithm, a calibrated logistic regression modeling process, and perhaps incorporation of techniques such as cross-validation may also be useful before drawing final conclusions concerning the reliability of the algorithm implemented. Keywords: variable selection; overfitting; omission bias; linear least squared regression; Mallow’s Cp; logistic regression; C-Index


Paper (pdf)


Appendices:

A: C# Code Implementation

B: Pima Indian Dataset

C: CpTable.txt

D: CpTableAll.txt

E: Logistic Regression Results