Project: Logistic Regression and Random Forests¶

For this project we create different models analyzing a Heart Disease data set using Logistic Regression and Random Forests.

Scenario¶

For this project we are researching risk factors for heart disease at a university hospital. We have access to a large set of historical data that you can use to analyze patterns between different health indicators (e.g. fasting blood sugar, maximum heart rate, etc.) and the presence of heart disease. We will create different logistic regression models that predict whether or not a person is at risk for heart disease. A model like this could eventually be used to evaluate medical records and look for risks that might not be obvious to human doctors. We create a classification random forest model to predict the risk of heart disease and a regression random forest model to predict the maximum heart rate achieved.

There are several variables in this data set, but you will be working with the following important variables:

Variable	What does it represent?
age	The person's age in years
sex	The person's sex (1 = male, 0 = female)
cp	The type of chest pain experienced (0=no pain, 1=typical angina, 2=atypical angina, 3=non-anginal pain)
trestbps	The person's resting blood pressure
chol	The person's cholesterol measurement in mg/dl
fbs	The person's fasting blood sugar is greater than 120 mg/dl (1 = true, 0 = false)
restecg	Resting electrocardiographic measurement (0=normal, 1=having ST-T wave abnormality, 2=showing probable or definite left ventricular hypertrophy by Estes' criteria)
thalach	The person's maximum heart rate achieved
exang	Exercise-induced angina (1=yes, 0=no)
oldpeak	ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot)
slope	The slope of the peak exercise ST segment (1=upsloping, 2=flat, 3=downsloping)
ca	The number of major vessels (0-3)
target	Heart disease (0=no, 1=yes)

Install Libraries¶

In the following code block, we installed the appropriate libraries to use in this project.

Click the Run button on the toolbar to run this code.

install.packages("ResourceSelection")
install.packages("pROC")
install.packages("rpart.plot")

Installing package into '/home/codio/R/x86_64-pc-linux-gnu-library/3.4'
(as 'lib' is unspecified)
Installing package into '/home/codio/R/x86_64-pc-linux-gnu-library/3.4'
(as 'lib' is unspecified)
Installing package into '/home/codio/R/x86_64-pc-linux-gnu-library/3.4'
(as 'lib' is unspecified)

Prepare the Data Set¶

In the following code block, we used R code to prepare our data set.

Click the Run button on the toolbar to run this code.

heart_data <- read.csv(file="heart_disease.csv", header=TRUE, sep=",")

# Converting appropriate variables to factors  
heart_data <- within(heart_data, {
   target <- factor(target)
   sex <- factor(sex)
   cp <- factor(cp)
   fbs <- factor(fbs)
   restecg <- factor(restecg)
   exang <- factor(exang)
   slope <- factor(slope)
   ca <- factor(ca)
   thal <- factor(thal)
})

head(heart_data, 10)

print("Number of variables")
ncol(heart_data)

print("Number of rows")
nrow(heart_data)

[1] "Number of variables"

[1] "Number of rows"

Model #1 - First Logistic Regression Model¶

We created a logistic regression model for heart disease (target) using the variables age (age), resting blood pressure (trestbps), and maximum heart rate achieved (thalach).

We used scripts to get the outputs of our regression analysis.

logit <- glm(target ~ age + trestbps + thalach , data = heart_data, family = "binomial")

summary(logit)

Call:
glm(formula = target ~ age + trestbps + thalach, family = "binomial", 
    data = heart_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0257  -1.0069   0.5688   0.9203   2.0476  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.576198   1.633928  -2.189   0.0286 *  
age         -0.009424   0.016080  -0.586   0.5578    
trestbps    -0.016019   0.007767  -2.063   0.0392 *  
thalach      0.042697   0.006950   6.144 8.06e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 417.64  on 302  degrees of freedom
Residual deviance: 353.28  on 299  degrees of freedom
AIC: 361.28

Number of Fisher Scoring iterations: 3

conf_int <- confint.default(logit, level=0.95)
round(conf_int,4)

library(ResourceSelection)


print("Hosmer-Lemeshow Goodness of Fit Test")
hl = hoslem.test(logit$y, fitted(logit), g=50)
hl

ResourceSelection 0.3-5 	 2019-07-22

[1] "Hosmer-Lemeshow Goodness of Fit Test"

	Hosmer and Lemeshow goodness of fit (GOF) test

data:  logit$y, fitted(logit)
X-squared = 41.978, df = 48, p-value = 0.7168

# Predict default or no_default for the data set using the model
default_model_data <- heart_data[c('age', 'trestbps', 'thalach')]
pred <- predict(logit, newdata=default_model_data, type='response')

# If the predicted probability of default is >=0.50 then predict credit default (default='1'), otherwise predict no credit 
# default (default='0') 
depvar_pred = as.factor(ifelse(pred >= 0.5, '1', '0'))

# This creates the confusion matrix
conf.matrix <- table(heart_data$target, depvar_pred)[c('0','1'),c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": default=")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": default=")

# Print nicely formatted confusion matrix
print("Confusion Matrix")
format(conf.matrix,justify="centre",digit=2)

[1] "Confusion Matrix"

library(pROC)

labels <- heart_data$target
predictions <- logit$fitted.values

roc <- roc(labels ~ predictions)

print("Area Under the Curve (AUC)")
round(auc(roc),4)

print("ROC Curve")
# True Positive Rate (Sensitivity) and False Positive Rate (1 - Specificity)
plot(roc, legacy.axes = TRUE)

Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

Setting levels: control = 0, case = 1
Setting direction: controls < cases

[1] "Area Under the Curve (AUC)"

[1] "ROC Curve"

print("Prediction: age (age='50'), trestbps is 122, thalach is 140 ")
newdata1 <- data.frame(age=50, trestbps=122, thalach=140)
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)

print("Prediction: age (age='50'), trestbps is 140, thalach is 170 ")
newdata1 <- data.frame(age=50, trestbps=140, thalach=170)
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)

[1] "Prediction: age (age='50'), trestbps is 122, thalach is 140 "

[1] "Prediction: age (age='50'), trestbps is 140, thalach is 170 "

Model #2 - Second Logistic Regression Model¶

We are asked to create a logistic regression model for heart disease (target) using the variables maximum heart rate achieved (thalach), age of the individual (age), sex of the individual (sex), exercise-induced angina (exang), and type of chest pain (cp). We also include the quadratic term for age and the interaction term between age and maximum heart rate achieved.

We used scripts to get the outputs of your analysis.

logit <- glm(target ~ thalach + age + sex + exang + cp + I(age^2) + age:thalach , data = heart_data, family = "binomial")

summary(logit)

Call:
glm(formula = target ~ thalach + age + sex + exang + cp + I(age^2) + 
    age:thalach, family = "binomial", data = heart_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4225  -0.6167   0.2083   0.6646   2.5398  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.634e+01  1.205e+01  -1.356 0.175117    
thalach      1.390e-01  5.701e-02   2.438 0.014760 *  
age          2.049e-01  3.112e-01   0.658 0.510325    
sex1        -1.709e+00  3.590e-01  -4.762 1.91e-06 ***
exang1      -9.348e-01  3.586e-01  -2.607 0.009133 ** 
cp1          1.766e+00  4.821e-01   3.663 0.000249 ***
cp2          1.820e+00  3.844e-01   4.734 2.21e-06 ***
cp3          1.674e+00  5.764e-01   2.904 0.003684 ** 
I(age^2)     4.921e-04  2.054e-03   0.240 0.810599    
thalach:age -2.017e-03  9.999e-04  -2.017 0.043666 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 417.64  on 302  degrees of freedom
Residual deviance: 263.42  on 293  degrees of freedom
AIC: 283.42

Number of Fisher Scoring iterations: 5

conf_int <- confint.default(logit, level=0.95)
round(conf_int,4)

library(ResourceSelection)


print("Hosmer-Lemeshow Goodness of Fit Test")
hl = hoslem.test(logit$y, fitted(logit), g=50)
hl

[1] "Hosmer-Lemeshow Goodness of Fit Test"

	Hosmer and Lemeshow goodness of fit (GOF) test

data:  logit$y, fitted(logit)
X-squared = 60.596, df = 48, p-value = 0.1048

# Predict default or no_default for the data set using the model
default_model_data <- heart_data[c('age', 'sex', 'exang', 'cp', 'thalach')]
pred <- predict(logit, newdata=default_model_data, type='response')

# If the predicted probability of default is >=0.50 then predict credit default (default='1'), otherwise predict no credit 
# default (default='0') 
depvar_pred = as.factor(ifelse(pred >= 0.5, '1', '0'))

# This creates the confusion matrix
conf.matrix <- table(heart_data$target, depvar_pred)[c('0','1'),c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": default=")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": default=")

# Print nicely formatted confusion matrix
print("Confusion Matrix")
format(conf.matrix,justify="centre",digit=2)

[1] "Confusion Matrix"

library(pROC)

labels <- heart_data$target
predictions <- logit$fitted.values

roc <- roc(labels ~ predictions)

print("Area Under the Curve (AUC)")
round(auc(roc),4)

print("ROC Curve")
# True Positive Rate (Sensitivity) and False Positive Rate (1 - Specificity)
plot(roc, legacy.axes = TRUE)

Setting levels: control = 0, case = 1
Setting direction: controls < cases

[1] "Area Under the Curve (AUC)"

[1] "ROC Curve"

print("Prediction: age (age='30'), thalach is 145, sex = 1, exang=1, cp=0  ")
newdata1 <- data.frame(age=30,thalach=145, sex='1', exang='1', cp='0')
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)

print("Prediction: age (age='30'), thalach is 145, sex = 1, exang=0, cp=2  ")
newdata1 <- data.frame(age=30,thalach=145, sex='1', exang='0', cp='2')
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)

[1] "Prediction: age (age='30'), thalach is 145, sex = 1, exang=1, cp=0  "

[1] "Prediction: age (age='30'), thalach is 145, sex = 1, exang=0, cp=2  "

Random Forest Classification Model¶

You have been asked to create a random forest classification model for the presence of heart disease (target) using the variables age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), slope of peak exercise (slope), and number of major vessels (ca). Before writing any code, review Section 5 of the Summary Report template to see the questions you will be answering about your model.

Run your scripts to get the outputs of your regression analysis. Then use the outputs to answer the questions in your summary report.

Note: Use the + (plus) button to add new code blocks, if needed.

set.seed(511038)

# Partition the data set into training and testing data
samp.size = floor(0.80*nrow(heart_data))

# Training set
print("Number of rows for the training set")
train_ind = sample(seq_len(nrow(heart_data)), size = samp.size)
train.data = heart_data[train_ind,]
nrow(train.data)

# Testing set 
print("Number of rows for the testing set")
test.data = heart_data[-train_ind,]
nrow(test.data)

[1] "Number of rows for the training set"

[1] "Number of rows for the testing set"

set.seed(511038)
library(randomForest)
model_rf1 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 2)

# Confusion matrix
print("======================================================================================================================")
print('Confusion Matrix: TRAINING set based on random forest model built using 2 trees')
train.data.predict <- predict(model_rf1, train.data, type = "class")

# Construct the confusion matrix
conf.matrix <- table(train.data$target, train.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")

# Print a nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)


print("======================================================================================================================")
print('Confusion Matrix: TESTING set based on random forest model built using 2 trees')
test.data.predict <- predict(model_rf1, test.data, type = "class")

# Construct the confusion matrix
conf.matrix <- table(test.data$target, test.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")

# Print a nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)

[1] "======================================================================================================================"
[1] "Confusion Matrix: TRAINING set based on random forest model built using 2 trees"

[1] "======================================================================================================================"
[1] "Confusion Matrix: TESTING set based on random forest model built using 2 trees"

set.seed(511038)
library(randomForest)
model_rf2 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 5)

# Confusion matrix
print("======================================================================================================================")
print('Confusion Matrix: TRAINING set based on random forest model built using 5 trees')
train.data.predict <- predict(model_rf2, train.data, type = "class")

# Construct the confusion matrix
conf.matrix <- table(train.data$target, train.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")

# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)


print("======================================================================================================================")
print('Confusion Matrix: TESTING set based on random forest model built using 5 trees')
test.data.predict <- predict(model_rf2, test.data, type = "class")

# Construct the confusion matrix
conf.matrix <- table(test.data$target , test.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")

# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)

[1] "======================================================================================================================"
[1] "Confusion Matrix: TRAINING set based on random forest model built using 5 trees"

[1] "======================================================================================================================"
[1] "Confusion Matrix: TESTING set based on random forest model built using 5 trees"

set.seed(511038)
library(randomForest)

# checking
#=====================================================================
train = c()
test = c()
trees = c()

for(i in seq(from=1, to=200, by=1)) {
    #print(i)
    
    trees <- c(trees, i)
    
    model_rf3 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = i)
    
    train.data.predict <- predict(model_rf3, train.data, type = "class")
    conf.matrix1 <- table(train.data$target, train.data.predict)
    train_error = 1-(sum(diag(conf.matrix1)))/sum(conf.matrix1)
    train <- c(train, train_error)
    
    test.data.predict <- predict(model_rf3, test.data, type = "class")
    conf.matrix2 <- table(test.data$target, test.data.predict)
    test_error = 1-(sum(diag(conf.matrix2)))/sum(conf.matrix2)
    test <- c(test, test_error)
}
 
#matplot (trees, cbind (train, test), ylim=c(0,0.5) , type = c("l", "l"), lwd=2, col=c("red","blue"), ylab="Error", xlab="number of trees")
#legend('topright',legend = c('training set','testing set'), col = c("red","blue"), lwd = 2 )

plot(trees, train,type = "l",ylim=c(0,1.0),col = "red", xlab = "Number of Trees", ylab = "Classification Error")
lines(test, type = "l", col = "blue")
legend('topright',legend = c('training set','testing set'), col = c("red","blue"), lwd = 2 )

set.seed(511038)
library(randomForest)
model_rf4 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 20)

# Confusion matrix
print("======================================================================================================================")
print('Confusion Matrix: TRAINING set based on random forest model built using 20 trees')
train.data.predict <- predict(model_rf4, train.data, type = "class")

# Construct the confusion matrix
conf.matrix <- table(train.data$target, train.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")

# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)


print("======================================================================================================================")
print('Confusion Matrix: TESTING set based on random forest model built using 20 trees')
test.data.predict <- predict(model_rf4, test.data, type = "class")

# Construct the confusion matrix
conf.matrix <- table(test.data$target, test.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")

# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)

[1] "======================================================================================================================"
[1] "Confusion Matrix: TRAINING set based on random forest model built using 20 trees"

[1] "======================================================================================================================"
[1] "Confusion Matrix: TESTING set based on random forest model built using 20 trees"

Random Forest Regression Model¶

You have been asked to create a random forest regression model for maximum heart rate achieved using the variables age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), slope of peak exercise (slope), and number of major vessels (ca). Before writing any code, review Section 6 of the Summary Report template to see the questions you will be answering about your model.

Run your scripts to get the outputs of your analysis. Then use the outputs to answer the questions in your summary report.