Video 4: Logistic Regression in R
In this video, we’ll be using the dataset quality (CSV) to build a logistic regression model in R. Please download this file to follow along.
An R script file with all of the commands used in this lecture can be downloaded here: Unit3_ModelingExpert (R).
The variables in the dataset quality.csv are as follows:
- MemberID numbers the patients from 1 to 131, and is just an identifying number.
- InpatientDays is the number of inpatient visits, or number of days the person spent in the hospital.
- ERVisits is the number of times the patient visited the emergency room.
- OfficeVisits is the number of times the patient visited any doctor’s office.
- Narcotics is the number of prescriptions the patient had for narcotics.
- DaysSinceLastERVisit is the number of days between the patient’s last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
- Pain is the number of visits for which the patient complained about pain.
- TotalVisits is the total number of times the patient visited any healthcare provider.
- ProviderCount is the number of providers that served the patient.
- MedicalClaims is the number of days on which the patient had a medical claim.
- ClaimLines is the total number of medical claims.
- StartedOnCombination is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
- AcuteDrugGapSmall is the fraction of acute drugs that were refilled quickly after the prescription ran out.
- PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.
In this video we learned how to use the sample.split() function from the caTools package to split data for a classification problem, balancing the positive and negative observations in the training and testing sets.
If you wanted to instead split a data frame data, where the dependent variable is a continuous outcome (this was the case for all the datasets we used last week), you could instead use the sample() function. Here is how to select 70% of observations for the training set (called “train”) and 30% of observations for the testing set (called “test”):
spl = sample(1:nrow(data), size=0.7 * nrow(data))
train = data[spl,]
test = data[-spl,]