1. Introduction
We are analysts. We have access to two unique historical data sets. The dataset explores
many types of variables necessary to create different models analyzing a Heart Disease data set.
The variable used can determine what type of regression model will be developed given to the
specific scenario. We are trying to do a case study and develop a regression model. A dataset can
let us know or predict when a recession will occur and help us prepare for it. It is essential to
public health or anyone like a health industry analysis to prepare a patient for a crisis. The
different logistic regression models predict whether or not a person is at risk for heart disease.
The model may help look for risks that might not be obvious to human doctors. This data
analysis can better prepare for different scenarios, such as what affects a given area. Data sets
are essential and developing regression models. Especially nowadays, with everything collecting
data around us, finding validity is more critical than ever. This regression model can help
someone see if they are likely to default. Regression models are used, such as in case studies or
anywhere is very important in general.
2. Data Preparation
The data sets that have been given are from heart_disease.csv. The data set has many
variables related to the risk of heart-related issues. The data set consists of 13 columns and
around 317 rows. The columns are the particular variable. The rows are the different values of
historical data sets for the risk that their customers will default on their credit is given particular
variables being compared. They had so many variables that could be extremely important for a
data analyst to develop some regression model based on the data.
3. Model #1 - First Logistic Regression Model
Reporting Results
1.) The general form of a logistic regression model for heart disease (target) using
variables age (age), resting blood pressure (trestbps), and maximum heart rate achieved
(thalach).
Then, The general form of this logistic regression model is:
󰇛
󰇜
󰇛



󰇜
󰇛



󰇜
y = heart disease (target)
= age
= resting blood pressure (trestbps)
= maximum heart rate achieved (thalach)
󰇛
󰇜
󰇛



󰇜

󰇛



󰇜


󰇛
󰇜
󰇛



󰇜
󰇛



󰇜

2.) The general form of this logistic regression model can be converted to the model that
is linear in the beta terms
󰇡
󰇢
󰇡
󰇢
 


3.) The left side of the equation above is the natural log of odds, so this can be written as:

󰇛

󰇜

󰇛

󰇜
 


Where odds are the odds of defaulting is default = 1
From the general form of the model above, mean in terms for heart disease (target)?
a.
󰇛
󰇜

󰇛
󰇜
󰇛



󰇜

󰇛



󰇜
, where Y is a binary response variable of
defaulting on credit. Then
󰇛
󰇜
is the proportion and



. In the
regression model. Then the corresponds to the probability from the logistic regression model in
terms of the beta values versus the independent. The is now in terms of the probability for the
dependent variable. is the probability that the event occurs. However, this is dependent on the
beta values of the regression model. Then we can say that π is the probability that an
individual will get heart disease. It is the p(default=1).
b.
Since is the probability that the event occurs, then 1- is the probability that the event
does not occur.

. A binary event is the ratio of the event's probability to the probability that
the event does not occur. These are dependent on the coefficients' beta terms, which mimic a
linear relationship between them, which is a strong indication of having good characteristic
numbers for probability ratios given the regression model. Then we can say that π/(1-π) is the
odds of an individual getting heart disease.
Then the odds for the binary event is the ratio of the probability the event does not occur.
The odds are the odds of defaulting is default = 1. Then the odds are expressed as odds =

Furthermore, this is the probability that the event occurs.
Interpret the estimated coefficient of the maximum heart rate achieved variable.
Then the estimated coefficient for variable age is , which means that, on
average, the change in log odds for heart disease is  for each percentage decrease in
heart disease, given that all other variables are constant.
This can be expressed in terms of odds.


Then, Probabilty =  Odds = /(1-)= 0.00009425
Keep in mind that all other variables are constant. Then we can say the odds of getting heart
disease decrease by 0.00009425 percent for each percentage decrease in age.
Then what we can say that the estimated coefficient for variable resting blood pressure
(trestbps) is , which means that, on average, the change in log odds for heart disease is
 for each percentage decrease in heart disease, given that all other variables are
constant.
This can be expressed in terms of odds.


Then, Probabilty =  Odds = /(1-)= 0.00016022
Keep in mind that all other variables are constant. Then we can say the odds of getting
heart disease decrease by 0.00016022 percent for each percentage decrease in resting blood
pressure (trestbps) is .
Then what we can say that the estimated coefficient for variable maximum heart rate
achieved (thalach) is .04269. This means that on average, the change in log odds for heart
disease is  for each percentage increase in heart disease, given that all other variables
are constant.
This can be expressed in terms of odds.


Then, Probability =  Odds = /(1-)= 0.0004271
Keep in mind that all other variables are constant. Then we can say the odds of getting
heart disease increases by 0.0004271percent for each percentage decrease in resting blood
pressure (trestbps) is 0.0004271.
The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1 (Chan, 2020). Predicting the category of a categorical response is known as
classification. Because of the classification and regression models, I need a cutoff point where
the predictive value will be True or False. Because the Y dependent variable is a probability
percentage, we need a cutoff point where a particular value will be true or false to develop some
regression model analysis. A confusion matrix can evaluate a logistic regression model's
performance on the dataset used to create the model. The table's rows represent the predicted
outcomes, while the columns represent the actual outcomes (Chan, 2020).
Confusion Matrix
Prediction =
0
Prediction =
1
Actual =
0
00
01
Actual =
1
10
11
Confusion Matrix
Prediction =
0
Actual =
0
TN
Actual =
1
FN
TP= Ture Positive
TN = Ture Negative
FP = False Positive
FN = False Negative
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



Once again, a logistic regression model's goal is to predict whether the binary response Y
takes on a value of 0 or 1. We can see that our accuracy is , which is close to 1. Accuracy
is the ratio of the number of correct predictions to the total number of observations. When
assessing the classification model's performance, our accuracy is near one, which is
exceptionally well, which means that. Our model is excellently distinguishing what is in a binary
value, saying what is a 0 and what is a 1. Then we should expect good results are moving
forward with her model and Theory.
B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our precision is , which is close to 1. Precision is the
ratio of correct positive predictions to the total predicted positives. When assessing the
classification model's performance, our accuracy is near one, which is exceptionally well, which
means that. Our model is excellently distinguishing what is in a binary value, saying what is a 0
and what is a 1.
C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. Recall is , which is close to 1. The recall is the ratio of correct positive
predictions to the total positives examples. When assessing the classification model's
performance, our accuracy is near one exceptionally well, which means that. Our model is
excellently distinguishing what is in a binary value, saying what is a 0 and what is a 1.
Evaluating Model Significance
Using the Hosmer-Lemeshow goodness of fit test, is the model is appropriate at significant at a
5% level of significance?
: Model fits the data
: Model does not fit the data
Then:
0
The coefficient of age is x1, the coefficient of thalach x2 , and the coefficient of trestbps x3 is 0
in the regression equation. Then under the null hypothesis, the model fits the data.

󰇛󰇜
The coefficient of age is x1, the coefficient of thalach x2 , and the coefficient of trestbps x3 or all
are non-zero in the regression equation. Then under the alternative hypothesis, the model does
not fit the data.
X-Squared = 41.978
P-value= 0.7168
P-value > = 0.05
The Null Hypothesis is not rejected. Since the p-value is 0.7168 and greater than the level
of significance of = 0.05, we do not reject the null hypothesis and conclude that the regression
model fits the data.
Which terms are significant in the model based on Wald's test? Use a 5% level of
significance.
Age:
The confidence interval for age slope parameter is: (-0.0409, 0.0221)
The null hypothesis
for the Wald test of the parameter is
.
Age: Z-Value = -0.586, P-Value = 5578 and level of significance = 0.05.
Based on the 95% confidence interval for age, the best conclusion is: Not Reject the null
hypothesis at level of significance = 0.05 and conclude that age is not significant in the model
since the p-value is greater than the level of significance = 0.05.
TRESTBPS:
The confidence interval for trestbps slope parameter is: (-0.0312, -0.0008)
The null hypothesis
for the Wald test of the parameter is
.
trestbps: Z-Value = -2.063, P-Value = .0392 and level of significance = 0.05.
Based on the 95% confidence interval for age, the best conclusion about is: Reject the null
hypothesis at level of significance = 0.05 and conclude that trestbps is significant in the model
since the p-value is less than the level of significance = 0.05.
THALACH:
The confidence interval for thalach slope parameter is: (0.0291, 0.0563)
The null hypothesis
for the Wald test of the thalach parameter is
.
Missed_Payment: Z-Value = 6.144, P-Value = 8.06E-10 and level of significance = 0.05.
Based on the 95% confidence interval for thalach, the best conclusion is: Reject the null
hypothesis at level of significance = 0.05 and conclude that thalach is significant in the model
since the p-value is less than the level of significance = 0.05.
Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and
explain what it illustrates.
A ROC curve measures the performance of a classifier at various threshold settings. Part of the
ROC curve is that the area under the curve indicates how well the model distinguishes between
Y=0 and Y=1, which is prospective to visualize how while the model distinguishes between 0
and 1. the larger the curve, the better the model will become because of the curve looks like a
square function, then that means that if there is a perfect distinguish between 1 and 0 better
when it corresponds to predicting binary class especially in functions.
What is the value of AUC? Interpret what this value represents.
A perfect model has AUC = 1: This means that we can measure separability between Y=1 and
Y=0.
A mode with AUC = 0.5: This means the model has no type of separation between Y=1 and
Y=0, and there is a hard time distinguishing between 0 and 1.
A model with AUC = 0. This means it has the worst separability. The function is not usable. It
thinks that 0 is a 1 and 1 is a 0.
Relation between Sensitivity and Specificity
Accuracy = , This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .7696 the ability of the test to get heart disease.
Specificity = =





= .60144 the ability of the test to properlyget heart disease.
Making Predictions Using Model
The logistic regression model predicts the probability of y=1. The probability of an
individual who is 50 years old, has a resting blood pressure of 122, and has a maximum heart
rate of 140 having heart disease percent chance is 0.4939, which would be a label = 0.
The logistic regression model predicts the probability of y=1. The probability of an
individual who is 50 years old, has a resting blood pressure of 140, and has a maximum heart
rate of 170 with heart disease is 0.7248, a label = 1.
4. Model #2 - Second Logistic Regression Model
Reporting Results
1.) The general form of a logistic regression model for model for heart disease (target)
using the variables maximum heart rate achieved (thalach), age of the individual (age),
sex of the individual (sex), exercise-induced angina (exang), and type of chest pain (cp).
You also must include the quadratic term for age and the interaction term between age
and the maximum heart rate achieved.
Then, The general form of this logistic regression model is:
󰇛
󰇜


















y = heart disease (target)
= thalach
= age
= sex
= exang
= dummy for cp
= dummy for cp
= dummy for cp
= quadratic term for age
= interaction term between age and thalach
󰇛
󰇜



















󰇛
󰇜


















2.) The general form of this logistic regression model can be converted to the model that
is linear in the beta terms
󰇡
󰇢


󰇡
󰇢
  
 

 



 

3.) The left side of the equation above is the natural log of odds, so this can be written as:

󰇛

󰇜




󰇛

󰇜
  
 






 

Where odds are the odds of heart disease is = 1
The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1 (Chan, 2020). Predicting the category of a categorical response is known as
classification. Because of the classification and regression models, I need a cutoff point where
the predictive value will be True or False. Because the Y dependent variable is a probability
percentage, we need a cutoff point where a particular value will be true or false to develop some
regression model analysis. A confusion matrix can evaluate a logistic regression model's
performance on the dataset used to create the model. The table's rows represent the predicted
outcomes, while the columns represent the actual outcomes (Chan, 2020).
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



Once again, a logistic regression model's goal is to predict whether the binary response Y
takes on a value of 0 or 1. We can see that our accuracy is , which is close to 1. Accuracy
is the ratio of the number of correct predictions to the total number of observations. When
assessing the classification model's performance, our accuracy is near one, which is
exceptionally well, which means that. Our model is excellently distinguishing what is in a binary
value, saying what is a 0 and what is a 1. Then we should expect good results are moving
forward with her model and Theory.
B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our precision is , which is close to 1. Precision is the
ratio of correct positive predictions to the total predicted positives. When assessing the
classification model's performance, our accuracy is near one, which is exceptionally well, which
means that. Our model is excellently distinguishing what is in a binary value, said what is a 0 and
what is a 1.
C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our recall is , which is close to 1. The recall is the ratio
of correct positive predictions to the total positives examples. When assessing the classification
model's performance, our accuracy is near one exceptionally well, which means that. Our model
is excellently distinguishing what is in a binary value, saying what is a 0 and what is a 1.
Evaluating Model Significance
Using the Hosmer-Lemeshow goodness of fit test, is the model is appropriate at significant at a
5% level of significance?
: Model fits the data
: Model does not fit the data
Then:

0
The coefficient thalach, age, sex , exang, cp, the quadratic term for age. and the interaction term
between age and thalach is 0 in the regression equation. Then under the null hypothesis, the
model fits the data.

󰇛󰇜
The coefficient thalach, age, sex , exang, cp, the quadratic term for age. and the interaction term
between age and thalach or all are non-zero in the regression equation. Then under the alternative
hypothesis, the model does not fit the data.
X-Squared = 60.596
P-value= 0.1048
P-value > = 0.05
The Null Hypothesis is not rejected. Since the p-value is 0.1048 and greater than the level
of significance of = 0.05, we do not reject the null hypothesis and conclude that the regression
model fits the data.
Which terms are significant in the model based on Wald's test? Use a 5% level of
significance.
THALACH:
The confidence interval for thalach slope parameter is: (0.0273, 0.2507)
The null hypothesis
for the Wald test of the thalach parameter is
.
thalach: Z-Value = 2.438, P-Value = 0.014760 and level of significance = 0.05.
Based on the 95% confidence interval for thalach, the best conclusion is: Reject the null
hypothesis at level of significance = 0.05 and conclude that thalach is significant in the model
since the p-value is less than the level of significance = 0.05.
Age:
The confidence interval for age slope parameter is: (-0.04051, 0.8148)
The null hypothesis
for the Wald test of the parameter is
.
Age: Z-Value = 0.658, P-Value = 0.510325 and level of significance = 0.05.
Based on the 95% confidence interval for age, the best conclusion is: Not Reject the null
hypothesis at level of significance = 0.05 and conclude that age is Not significant in the model
since the p-value is greater than the level of significance = 0.05.
SEX:
The confidence interval for sex slope parameter is: (-2.4130, -1.0059)
The null hypothesis
for the Wald test of the parameter is
.
sex: Z-Value = -4.762, P-Value = 1.91E-06 and level of significance = 0.05.
Based on the 95% confidence interval for age, the best conclusion about is: Reject the null
hypothesis at level of significance = 0.05 and conclude that sex is significant in the model
since the p-value is less than the level of significance = 0.05.
EXANG:
The confidence interval for exang slope parameter is: (-1.6377, -0.2320)
The null hypothesis
for the Wald test of the parameter is
.
exang: Z-Value = -1.6377, P-Value = 0.009133 and level of significance = 0.05.
Based on the 95% confidence interval, the best conclusion is: Reject the null hypothesis at level
of significance = 0.05 and conclude that excang is significant in the model since the p-value is
less than the level of significance = 0.05.
CP:
The confidence interval for cp slope parameter is: (0.8209, 2.7106)
The null hypothesis
for the Wald test of the parameter is
.
cp: Z-Value = 3.663, P-Value = 0.000249 and level of significance = 0.05.
Based on the 95% confidence interval for thalach, the best conclusion is: Reject the null
hypothesis at level of significance = 0.05 and conclude that cp is significant in the model since
the p-value is less than the level of significance = 0.05.
CP2:
The confidence interval for cp slope parameter is: (1.0662, 2.5732)
The null hypothesis
for the Wald test of the parameter is
.
cp: Z-Value = 4.734, P-Value = 2.21E-06 and level of significance = 0.05.
Based on the 95% confidence interval for thalach, the best conclusion is: Reject the null
hypothesis at level of significance = 0.05 and conclude that cp is significant in the model since
the p-value is less than the level of significance = 0.05.
AGE^2:
The confidence interval for age^2 slope parameter is: (-0.0035, 0.0045)
The null hypothesis
for the Wald test of the age^2 thalach parameter is
.
age^2: Z-Value = 0.240, P-Value = 0.810599 and level of significance = 0.05.
Based on the 95% confidence interval for age^2, the best conclusion is: Not Reject the null
hypothesis at level of significance = 0.05 and conclude that age^2 is Not significant in the
model since the p-value is greater than the level of significance = 0.05.
Thalach:Age:
The confidence interval for Thalach:Age slope parameter is: (-0.0040, -0.0001)
The null hypothesis
for the Wald test of the parameter is
.
Thalach:Age: Z-Value = -2.017, P-Value = 0.043666 and level of significance = 0.05.
Based on the 95% confidence interval for thalach, the best conclusion is: Reject the null
hypothesis at level of significance = 0.05 and conclude that Thalach:Age is significant in the
model since the p-value is less than the level of significance = 0.05.
Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and
explain what it illustrates.
A ROC curve measures the performance of a classifier at various threshold settings. Part of the
ROC curve is that the area under the curve indicates how well the model distinguishes between
Y=0 and Y=1. This is prospective to visualize how while the model distinguishes between 0 and
1. the larger the curve, the better the model will become because the curve looks like a square
function, then that means that if there is a perfect distinguish between 1 and 0 , better when it
corresponds to predicting binary class, especially in functions.
What is the value of AUC? Interpret what this value represents.
A perfect model has AUC = 1: This means we can measure separability between Y=1 and Y=0.
A mode with AUC = 0.5: This means the model has no separation between Y=1 and Y=0, and
there is a hard time distinguishing between 0 and 1.
A model with AUC = 0. This means it has the worst separability. The function is not usable. It
thinks that 0 is a 1 and 1 is a 0.
Relation between Sensitivity and Specificity
Accuracy = , This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .8363 the ability of the test to get heart disease.
Specificity = =





= .7976 the ability of the test to get heart disease.
Making Predictions Using Model
The logistic regression model predicts the probability of y=1. The probability of a male
individual having heart disease who is 30 years old; has a maximum heart rate of 145;
experiences exercise-induced angina; and does not experience chest pain related to typical
angina, atypical angina, or non-anginal pain having heart disease percent chance is 0.2654,
which would be a label = 0.
The logistic regression model predicts the probability of y=1. The probability of a male
individual having heart disease who is 30 years old, has a maximum heart rate of 145, and does
not experience exercise-induced angina but experiences typical angina having heart disease is
0.8502, which would be a label = 1.
5. Random Forest Classification Model
Reporting Results
What is the training set and testing set? The training set is the data used to train the decision
tree model, and then once the model is trained, we use the testing set to test that data. However,
why do we want to split the original data into training and testing data sets? The main reason is
the disadvantages that decision trees have is that they can overfit. If the data set is used to train
those models and produce a decision tree model that fits the training data sets well; However, if
it fits it so well, it has biased and does not perform well on any future data sets. The decision tree
does not generalize very well and goes around this overfitting issue, and we can use the training
set to train the model. Then we are asked to split the heart disease data set into training and
validation sets using 80% and 20% split
In the original data set of the heart_disease.csv there are 303 data set values. There are 242
number rows for the training set and 61 number rows for the validation set.
What is the use of training and testing sets when creating a random forest model?
Random forest is a viral machine learning algorithm to make predictions on classification and
regression problems. The critical part of the random forest is that the algorithm uses not just one
but multiple learners to obtain better predictive performance. Then we use the multiple decision
trees on the training data to make predictions. We obtained a model that we then use to make
predictions on the testing data set. However, how would the predictions occur well? In case we
use classification data sets, we have to predict a binary variable. For example, yes or no, we can
use the majority rule to create a random forest model with decision trees. We can then drive
down the classification model to split the point into multiple samples to better predict the actual
model. Using the training set and testing is extremely important and creating a random forest
model.
Will create the same random forest classification model using five decision trees. Notice the
slight improvement in true positive and true negative counts.
Confusion Matrix Training set using 5 Trees
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



Confusion Matrix Testing set using 5 Trees
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



We are asked to graph the training and testing error against the number of trees using a
classification random forest model for the presence of heart disease (target) using variables age
(age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement
(chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), the
slope of peak exercise (slope), and a number of major vessels (ca). Use a maximum of 200 trees.
We are interested in the point at which the error sort of levels off. We are interested in the
testing set side. Both the errors will drop at first, and usually, the training set error will keep
dropping until it goes very close to zero. As the number of trees grows, the training
classification decreases to near zero. The testing complication is that as the number of trees
grows, it also makes a similar characteristic it decreases. However, in our case, there is noise in
our plotted graph, affecting the overall picture. We could see that the graph does decrease over
time as the number of trees grows. The training will keep dropping the testing set error, and this
will continue to decline at first, and then after some number of trees will start moving upwards
again. The inflection point is the point at which the model is overfitting the training data set.
Beyond that point, it is unnecessary because even though the training error keeps dropping, the
testing error starts moving upwards. The model is now doing an excellent job on the
classification of the training set. However, any more significant number on the testing set and the
model will start a lousy job. We would call overfitting the training set in the model, and at no
point, after some number of trees keep increasing, it will not improve the model. We stopped at
the most significant value that gives the minimum, so that is the criteria. Finding the
classification error and number of trees that best fit the regression model's data set values will
help us produce an effective model.
What is the optimal number of trees for the random forest model? We are looking for the
inflection point at which the model is overfitting the training data set. Beyond that point is not
needed because even though the training error keeps dropping, the testing error starts moving
upwards. When looking at the graph, we can see that after 20 trees, the training set stays near
zero. Another critical factor is optimization; more trees take up more processing power, so we
want to be optimized for over 20 trees since near-zero will not significantly change the number
value. For this reason, 20 trees would be the ideal number.
Evaluating Utility of Model
Confusion Matrix Training set using 20 trees
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



Once again, a logistic regression model's goal is to predict whether the binary response Y
takes on a value of 0 or 1. We can see that our accuracy is .9958. Accuracy is the ratio of the
number of correct predictions to the total number of observations. When assessing the
classification model's performance, our accuracy is near one, which is exceptionally well, which
means that. Our model is excellently distinguishing what is in a binary value, saying what is a 0
and what is a 1. Then we should expect good results are moving forward with her model and
Theory.
B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our precision is .9924. Precision is the ratio of correct positive
predictions to the total predicted positives. When assessing the classification model's
performance, our accuracy is near one exceptionally well, which means that. Our model is
excellently distinguishing what is in a binary value, saying what is a 0 and what is a 1.
C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our recall is 1.00. The recall is the ratio of correct positive
predictions to the total positives examples. When assessing the classification model's
performance, our accuracy is near one exceptionally well, which means that. Our model does an
excellent job distinguishing what is in a binary value, saying what a 0 and a 1.
Confusion Matrix Testing set using 20 trees
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



Once again, a logistic regression model's goal is to predict whether the binary response Y
takes on a value of 0 or 1. We can see that our accuracy is .7541. Accuracy is the ratio of the
number of correct predictions to the total number of observations. When assessing the
classification model's performance, our accuracy is near one, which is exceptionally well, which
means that. Our model is excellently distinguishing what is in a binary value, saying what is a 0
and what is a 1. Then we should expect good results are moving forward with her model and
Theory.
B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our precision is .7777. Precision is the ratio of correct positive
predictions to the total predicted positives. When assessing the classification model's
performance, our accuracy is near one exceptionally well, which means that. Our model is
excellently distinguishing what is in a binary value, saying what is a 0 and what is a 1.
C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our recall is 0.80. The recall is the ratio of correct positive
predictions to the total positives examples. When assessing the classification model's
performance, our accuracy is near one exceptionally well, which means that. Our model does an
excellent job distinguishing what is in a binary value, saying what a 0 and a 1.
Making Compairson of Model using 5 trees vs 20 trees.
Confusion Matrix Testing set using 5 trees
Accuracy = 
Precision = 
Recall = 
Confusion Matrix Testing set using 20 trees
Accuracy = 
Precision = 
Recall = 
Model using 5 trees vs 20 trees.
Accuracy =    = .0948
Precision =  -  = .0826
Recall =  -  = .0857
As we can see, the number of trees increased for accuracy, precision, and recall or the
probability of performance percentage also improved. The closer the number is to 1, the better
through precision our model. However, the critical point is finding the number of trees where our
accuracy does not improve anymore and does not overfit the model, and 20 trees are a good
number to use for the random forest model.
6. Random Forest Regression Model
Reporting Results
What is the training set and testing set? The training set is the data used to train the decision
tree model, and then once the model is trained, we use the testing set to test that data. However,
why do we want to split the original data into training and testing data sets? The main reason is
the disadvantages that decision trees have is that they can overfit. If the data set is used to train
those models and produce a decision tree model that fits the training data sets well; However, if
it fits it so well, it has biased and does not perform well on any future data sets. The decision tree
does not generalize very well and goes around this overfitting issue, and we can use the training
set to train the model. Then we are asked to split the heart disease data set into training and
validation sets using 80% and 20% split
What is the use of training and testing sets when creating a random forest model? Random
forest is a viral machine learning algorithm to make predictions on classification and regression
problems. The critical part of the random forest is that the algorithm uses not just one but
multiple learners to obtain better predictive performance. Then we use the multiple decision trees
on the training data to make predictions. We received a model that we then use to make
predictions on the testing data set. However, how would the predictions occur well? In case we
use classification data sets, we have to predict a binary variable. For example, yes or no, we can
use the majority rule to create a random forest model with decision trees. We can then drive
down the classification model to split the point into multiple samples to better predict the actual
model. Using the training set and testing is extremely important and creating a random forest
model.
We are asked to graph the mean squared error against the number of trees for a random forest
regression model for maximum heart rate achieved using age (age), sex (sex), chest pain type
(cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting
electrocardiographic measurement (restecg), exercise-induced angina (exang), slope of peak
exercise (slope), and number of major vessels (ca).
We are interested in the point at which the error sort of levels off. We are interested in the
testing set side. Both the errors will drop at first, and usually, the training set error will keep
dropping until it goes very close to zero. As the number of trees grows, the training
classification decreases to near zero. The testing complication is that as the number of trees
grows, it also makes a similar characteristic it decreases. However, in our case, there is noise in
our plotted graph, affecting the overall picture. We could see that the graph does decrease over
time as the number of trees grows. The training will keep dropping the testing set error, and this
will continue to drop at first, and then after some number of trees will start moving upwards
again. The inflection point is the point at which the model is overfitting the training data set.
Beyond that point is not needed because even though the training error keeps dropping, the
testing error starts moving upwards. The model is now doing an excellent job on the
classification of the training set. However, any more significant number on the testing set and the
model will start a bad job. We would call overfitting the training set in the model, and at no
point, after some number of trees keep increasing, it will not improve the model. We stopped at
the most significant value that gives the minimum, so that is the criteria. Finding the
classification error and number of trees that best fit the regression model's data set values will
help us produce an effective model.
What is the optimal number of trees for the random forest model? We are looking for the
inflection point at which the model is overfitting the training data set. Beyond that point is not
needed because even though the training error keeps dropping, the testing error starts moving
upwards. When looking at the graph, we can see that after 16 trees, the training set stays near
zero. Another critical factor is optimization; more trees take up more processing power, so we
want to be optimized for over 16 trees since near-zero will not significantly change the number
value. For this reason, 16 trees would be the ideal number.
Evaluating Utility of Model
The formula gives RMSE:

󰇛


󰇜
What is the root mean squared error for the training set?
RMSE is the standard deviation of residuals, and the RMSE can be used in a Decision
Tree Regressor to evaluate the model's performance (Chan, 2020). To see how accurate our
model is in predicting the values, we use Root Mean Square. We want to see our model's
accuracy then the root mean square of the error between the test values and the predicted values.
So it is the difference between the predicted values and what is observed, and it is a reference to
see how our function is in or how accurate it is. In our case, the RMSE is 10.2138. If the
predicted versus The observed values were closer to each other, then the RMSE would be closer
to 10 in a particular class to predict wage growth. On average our model only has a difference of
0.2128 from having a value of 10.2138, which is not bad. It is suitable for predicting wage
growth for our model.
Moreover, having residuals close to each other also correlates to a strong linear
relationship between the predicted and observed value sets. We want to develop an equation to
use test data sets to develop the equation; however, the equation will be used for real-world
applications. We want to know the difference in error between the tested add the predicted.
Which has a strong linear relationship between the projected and observable, gives us
RMSE closer to 10.
What is the root mean squared error for the testing set?
RMSE is the standard deviation of residuals, and the RMSE can be used in a Decision
Tree Regressor to evaluate the model's performance (Chan, 2020). To see how accurate our
model in predicting the values we use Root Mean Square. We want to see our model's accuracy
then the root mean square of the error between the test values and the predicted values. So it is
the difference between the predicted values and what is observed, and it is a reference to see how
our function is in or how accurate it is. In our case, the RMSE is 18.052. If the predicted versus
The observed values were closer to each other, then the RMSE would be closer to 19 in a
particular class to predict wage growth. On average are model is only has a difference of 0.948
from having a value of 18.052, which is not bad. It is suitable for predicting wage growth for our
model.
Moreover, having residuals close to each other also correlates to a strong linear
relationship between the predicted and observed value sets. We want to develop an equation to
use test data sets to develop the equation; however, the equation will be used for real-world
applications. We want to know the difference in error between the tested add the predicted.
Which has a strong linear relationship between the projected and observable gives us
RMSE closer to 19.
When plotting the root-mean-square ever versus the number of treats, we never get close
to predicting both the training and testing sets, which means that our motto is close to the actual
RMSE.
7. Conclusion
When analyzing a huge amount of data, it is essential to find relationships between
different variables of data value sets.
Which of the two logistic regression models would you choose to predict heart disease?
We used a confusion matrix to evaluate the performance of the model:
Model #1 - First Logistic Regression Model
Relation between Sensitivity and Specificity
Accuracy =.6930, This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =


= .7696 the ability of the test to get heart disease.
Specificity = =


 .6978 the ability of the test to properly get heart disease.
Model #2 - Second Logistic Regression Model
Relation between Sensitivity and Specificity
Accuracy = , This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .8363 the ability of the test to get heart disease.
Specificity = =





= .7976 the ability of the test to get heart disease.
When looking through accuracy sensitivity and specificity, We can see that overall, that
model better predicts heart disease. Based on the results, I would recommend model 2 predicting
heart disease. Based on the relationship between sensitivity and specificity, we can see that The
equation for predicting heart disease is 83,63%, which is extremely and distinguishing between 1
and 0 for binary value sets.
The random forest classification model is better than the logistic regression model. The
critical part of the random forest is that the algorithm uses not just one but multiple learners to
obtain better predictive performance. The more random samples and more learners we have in
an algorithm to generate a forest model, the better accuracy for our model. The more samples we
have, and we could take a nap, the better outcome we will get from these samples. In real-world
analysis when there are thousands of data sets to compare. Multiple forest trees can analyze
numerous samples and get a better outcome. One single decision tree someone has a more
significant number of samples and can have to be compared with the training test and the sample
test, which takes longer. Someone might have a more substantial error because of the outliers in
the model, which may affect the overall model.
8. Citations
Chan, C., Berrier, H., Pardoe, L., & Sturdivant, R. (2020). zyBook for Applied Statistics II for
Science, Technology, Engineering, and Math (STEM).