For this project we create different models analyzing a Heart Disease data set using Logistic Regression and Random Forests.
For this project we are researching risk factors for heart disease at a university hospital. We have access to a large set of historical data that you can use to analyze patterns between different health indicators (e.g. fasting blood sugar, maximum heart rate, etc.) and the presence of heart disease. We will create different logistic regression models that predict whether or not a person is at risk for heart disease. A model like this could eventually be used to evaluate medical records and look for risks that might not be obvious to human doctors. We create a classification random forest model to predict the risk of heart disease and a regression random forest model to predict the maximum heart rate achieved.
There are several variables in this data set, but you will be working with the following important variables:
Variable |
What does it represent?
|
|---|---|
age |
The person's age in years
|
sex |
The person's sex (1 = male, 0 = female)
|
cp |
The type of chest pain experienced (0=no pain, 1=typical
angina, 2=atypical angina, 3=non-anginal pain)
|
trestbps |
The person's resting blood pressure
|
chol |
The person's cholesterol measurement in mg/dl
|
fbs |
The person's fasting blood sugar is greater than 120
mg/dl (1 = true, 0 = false)
|
restecg |
Resting electrocardiographic measurement (0=normal,
1=having ST-T wave abnormality, 2=showing probable or
definite left ventricular hypertrophy by Estes'
criteria)
|
thalach |
The person's maximum heart rate achieved
|
exang |
Exercise-induced angina (1=yes, 0=no)
|
oldpeak |
ST depression induced by exercise relative to rest ('ST'
relates to positions on the ECG plot)
|
slope |
The slope of the peak exercise ST segment (1=upsloping,
2=flat, 3=downsloping)
|
ca |
The number of major vessels (0-3)
|
target |
Heart disease (0=no, 1=yes)
|
In the following code block, we installed the appropriate libraries to use in this project.
Click the Run button on the toolbar to run this code.
install.packages("ResourceSelection")
install.packages("pROC")
install.packages("rpart.plot")
In the following code block, we used R code to prepare our data set.
Click the Run button on the toolbar to run this code.
heart_data <- read.csv(file="heart_disease.csv", header=TRUE, sep=",")
# Converting appropriate variables to factors
heart_data <- within(heart_data, {
target <- factor(target)
sex <- factor(sex)
cp <- factor(cp)
fbs <- factor(fbs)
restecg <- factor(restecg)
exang <- factor(exang)
slope <- factor(slope)
ca <- factor(ca)
thal <- factor(thal)
})
head(heart_data, 10)
print("Number of variables")
ncol(heart_data)
print("Number of rows")
nrow(heart_data)
We created a logistic regression model for heart disease (target) using the variables age (age), resting blood pressure (trestbps), and maximum heart rate achieved (thalach).
We used scripts to get the outputs of our regression analysis.
logit <- glm(target ~ age + trestbps + thalach , data = heart_data, family = "binomial")
summary(logit)
conf_int <- confint.default(logit, level=0.95)
round(conf_int,4)
library(ResourceSelection)
print("Hosmer-Lemeshow Goodness of Fit Test")
hl = hoslem.test(logit$y, fitted(logit), g=50)
hl
# Predict default or no_default for the data set using the model
default_model_data <- heart_data[c('age', 'trestbps', 'thalach')]
pred <- predict(logit, newdata=default_model_data, type='response')
# If the predicted probability of default is >=0.50 then predict credit default (default='1'), otherwise predict no credit
# default (default='0')
depvar_pred = as.factor(ifelse(pred >= 0.5, '1', '0'))
# This creates the confusion matrix
conf.matrix <- table(heart_data$target, depvar_pred)[c('0','1'),c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": default=")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": default=")
# Print nicely formatted confusion matrix
print("Confusion Matrix")
format(conf.matrix,justify="centre",digit=2)
library(pROC)
labels <- heart_data$target
predictions <- logit$fitted.values
roc <- roc(labels ~ predictions)
print("Area Under the Curve (AUC)")
round(auc(roc),4)
print("ROC Curve")
# True Positive Rate (Sensitivity) and False Positive Rate (1 - Specificity)
plot(roc, legacy.axes = TRUE)
print("Prediction: age (age='50'), trestbps is 122, thalach is 140 ")
newdata1 <- data.frame(age=50, trestbps=122, thalach=140)
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)
print("Prediction: age (age='50'), trestbps is 140, thalach is 170 ")
newdata1 <- data.frame(age=50, trestbps=140, thalach=170)
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)
We are asked to create a logistic regression model for heart disease (target) using the variables maximum heart rate achieved (thalach), age of the individual (age), sex of the individual (sex), exercise-induced angina (exang), and type of chest pain (cp). We also include the quadratic term for age and the interaction term between age and maximum heart rate achieved.
We used scripts to get the outputs of your analysis.
logit <- glm(target ~ thalach + age + sex + exang + cp + I(age^2) + age:thalach , data = heart_data, family = "binomial")
summary(logit)
conf_int <- confint.default(logit, level=0.95)
round(conf_int,4)
library(ResourceSelection)
print("Hosmer-Lemeshow Goodness of Fit Test")
hl = hoslem.test(logit$y, fitted(logit), g=50)
hl
# Predict default or no_default for the data set using the model
default_model_data <- heart_data[c('age', 'sex', 'exang', 'cp', 'thalach')]
pred <- predict(logit, newdata=default_model_data, type='response')
# If the predicted probability of default is >=0.50 then predict credit default (default='1'), otherwise predict no credit
# default (default='0')
depvar_pred = as.factor(ifelse(pred >= 0.5, '1', '0'))
# This creates the confusion matrix
conf.matrix <- table(heart_data$target, depvar_pred)[c('0','1'),c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": default=")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": default=")
# Print nicely formatted confusion matrix
print("Confusion Matrix")
format(conf.matrix,justify="centre",digit=2)
library(pROC)
labels <- heart_data$target
predictions <- logit$fitted.values
roc <- roc(labels ~ predictions)
print("Area Under the Curve (AUC)")
round(auc(roc),4)
print("ROC Curve")
# True Positive Rate (Sensitivity) and False Positive Rate (1 - Specificity)
plot(roc, legacy.axes = TRUE)
print("Prediction: age (age='30'), thalach is 145, sex = 1, exang=1, cp=0 ")
newdata1 <- data.frame(age=30,thalach=145, sex='1', exang='1', cp='0')
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)
print("Prediction: age (age='30'), thalach is 145, sex = 1, exang=0, cp=2 ")
newdata1 <- data.frame(age=30,thalach=145, sex='1', exang='0', cp='2')
pred1 <- predict(logit, newdata1, type='response')
round(pred1, 4)
You have been asked to create a random forest classification model for the presence of heart disease (target) using the variables age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), slope of peak exercise (slope), and number of major vessels (ca). Before writing any code, review Section 5 of the Summary Report template to see the questions you will be answering about your model.
Run your scripts to get the outputs of your regression analysis. Then use the outputs to answer the questions in your summary report.
Note: Use the + (plus) button to add new code blocks, if needed.
set.seed(511038)
# Partition the data set into training and testing data
samp.size = floor(0.80*nrow(heart_data))
# Training set
print("Number of rows for the training set")
train_ind = sample(seq_len(nrow(heart_data)), size = samp.size)
train.data = heart_data[train_ind,]
nrow(train.data)
# Testing set
print("Number of rows for the testing set")
test.data = heart_data[-train_ind,]
nrow(test.data)
set.seed(511038)
library(randomForest)
model_rf1 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 2)
# Confusion matrix
print("======================================================================================================================")
print('Confusion Matrix: TRAINING set based on random forest model built using 2 trees')
train.data.predict <- predict(model_rf1, train.data, type = "class")
# Construct the confusion matrix
conf.matrix <- table(train.data$target, train.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")
# Print a nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)
print("======================================================================================================================")
print('Confusion Matrix: TESTING set based on random forest model built using 2 trees')
test.data.predict <- predict(model_rf1, test.data, type = "class")
# Construct the confusion matrix
conf.matrix <- table(test.data$target, test.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")
# Print a nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)
set.seed(511038)
library(randomForest)
model_rf2 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 5)
# Confusion matrix
print("======================================================================================================================")
print('Confusion Matrix: TRAINING set based on random forest model built using 5 trees')
train.data.predict <- predict(model_rf2, train.data, type = "class")
# Construct the confusion matrix
conf.matrix <- table(train.data$target, train.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")
# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)
print("======================================================================================================================")
print('Confusion Matrix: TESTING set based on random forest model built using 5 trees')
test.data.predict <- predict(model_rf2, test.data, type = "class")
# Construct the confusion matrix
conf.matrix <- table(test.data$target , test.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")
# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)
set.seed(511038)
library(randomForest)
# checking
#=====================================================================
train = c()
test = c()
trees = c()
for(i in seq(from=1, to=200, by=1)) {
#print(i)
trees <- c(trees, i)
model_rf3 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = i)
train.data.predict <- predict(model_rf3, train.data, type = "class")
conf.matrix1 <- table(train.data$target, train.data.predict)
train_error = 1-(sum(diag(conf.matrix1)))/sum(conf.matrix1)
train <- c(train, train_error)
test.data.predict <- predict(model_rf3, test.data, type = "class")
conf.matrix2 <- table(test.data$target, test.data.predict)
test_error = 1-(sum(diag(conf.matrix2)))/sum(conf.matrix2)
test <- c(test, test_error)
}
#matplot (trees, cbind (train, test), ylim=c(0,0.5) , type = c("l", "l"), lwd=2, col=c("red","blue"), ylab="Error", xlab="number of trees")
#legend('topright',legend = c('training set','testing set'), col = c("red","blue"), lwd = 2 )
plot(trees, train,type = "l",ylim=c(0,1.0),col = "red", xlab = "Number of Trees", ylab = "Classification Error")
lines(test, type = "l", col = "blue")
legend('topright',legend = c('training set','testing set'), col = c("red","blue"), lwd = 2 )
set.seed(511038)
library(randomForest)
model_rf4 <- randomForest(target ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 20)
# Confusion matrix
print("======================================================================================================================")
print('Confusion Matrix: TRAINING set based on random forest model built using 20 trees')
train.data.predict <- predict(model_rf4, train.data, type = "class")
# Construct the confusion matrix
conf.matrix <- table(train.data$target, train.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")
# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)
print("======================================================================================================================")
print('Confusion Matrix: TESTING set based on random forest model built using 20 trees')
test.data.predict <- predict(model_rf4, test.data, type = "class")
# Construct the confusion matrix
conf.matrix <- table(test.data$target, test.data.predict)[,c('0','1')]
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ": ")
colnames(conf.matrix) <- paste("Prediction", colnames(conf.matrix), sep = ": ")
# Print nicely formatted confusion matrix
format(conf.matrix,justify="centre",digit=2)
You have been asked to create a random forest regression model for maximum heart rate achieved using the variables age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), slope of peak exercise (slope), and number of major vessels (ca). Before writing any code, review Section 6 of the Summary Report template to see the questions you will be answering about your model.
Run your scripts to get the outputs of your analysis. Then use the outputs to answer the questions in your summary report.
Note: Use the + (plus) button to add new code blocks, if needed.
set.seed(511038)
# Partition the data set into training and testing data
samp.size = floor(0.80*nrow(heart_data))
# Training set
print("Number of rows for the training set")
train_ind = sample(seq_len(nrow(heart_data)), size = samp.size)
train.data = heart_data[train_ind,]
nrow(train.data)
# Testing set
print("Number of rows for the testing set")
test.data = heart_data[-train_ind,]
nrow(test.data)
set.seed(511038)
library(randomForest)
# Root mean squared error
RMSE = function(pred, obs) {
return(sqrt( sum( (pred - obs)^2 )/length(pred) ) )
}
# checking
#=====================================================================
train = c()
test = c()
trees = c()
for(i in seq(from=1, to=80, by=1)) {
trees <- c(trees, i)
model_rf7 <- randomForest(thalach ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = i)
pred <- predict(model_rf7, newdata=train.data, type='response')
rmse_train <- RMSE(pred, train.data$thalach)
train <- c(train, rmse_train)
pred <- predict(model_rf7, newdata=test.data, type='response')
rmse_test <- RMSE(pred, test.data$thalach)
test <- c(test, rmse_test)
}
plot(trees, train,type = "l",ylim=c(0,60),col = "red", xlab = "Number of Trees", ylab = "Root Mean Squared Error")
lines(test, type = "l", col = "blue")
legend('topright',legend = c('training set','testing set'), col = c("red","blue"), lwd = 2 )
set.seed(511038)
library(randomForest)
model_rf5 <- randomForest(thalach ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 2)
# Root mean squared error
RMSE = function(pred, obs) {
return(sqrt( sum( (pred - obs)^2 )/length(pred) ) )
}
print("======================================================================================================================")
print('Root Mean Squared Error: TRAINING set based on random forest regression model built using 2 trees')
pred <- predict(model_rf5, newdata=train.data, type='response')
RMSE(pred, train.data$thalach)
print("======================================================================================================================")
print('Root Mean Squared Error: TESTING set based on random forest regression model built using 2 trees')
pred <- predict(model_rf5, newdata=test.data, type='response')
RMSE(pred, test.data$thalach)
set.seed(511038)
library(randomForest)
model_rf5 <- randomForest(thalach ~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree = 17)
# Root mean squared error
RMSE = function(pred, obs) {
return(sqrt( sum( (pred - obs)^2 )/length(pred) ) )
}
print("======================================================================================================================")
print('Root Mean Squared Error: TRAINING set based on random forest regression model built using 17 trees')
pred <- predict(model_rf5, newdata=train.data, type='response')
RMSE(pred, train.data$thalach)
print("======================================================================================================================")
print('Root Mean Squared Error: TESTING set based on random forest regression model built using 17 trees')
pred <- predict(model_rf5, newdata=test.data, type='response')
RMSE(pred, test.data$thalach)