1. Introduction

We are analysts. We have access to two unique historical data sets. The dataset explores

many types of variables necessary to create different models analyzing a Heart Disease data set.

The variable used can determine what type of regression model will be developed given to the

specific scenario. We are trying to do a case study and develop a regression model. A dataset can

let us know or predict when a recession will occur and help us prepare for it. It is essential to

public health or anyone like a health industry analysis to prepare a patient for a crisis. The

different logistic regression models predict whether or not a person is at risk for heart disease.

The model may help look for risks that might not be obvious to human doctors. This data

analysis can better prepare for different scenarios, such as what affects a given area. Data sets

are essential and developing regression models. Especially nowadays, with everything collecting

data around us, finding validity is more critical than ever. This regression model can help

someone see if they are likely to default. Regression models are used, such as in case studies or

anywhere is very important in general.

2. Data Preparation

The data sets that have been given are from heart_disease.csv. The data set has many

variables related to the risk of heart-related issues. The data set consists of 13 columns and

around 317 rows. The columns are the particular variable. The rows are the different values of

historical data sets for the risk that their customers will default on their credit is given particular

variables being compared. They had so many variables that could be extremely important for a

data analyst to develop some regression model based on the data.

3. Model #1 - First Logistic Regression Model

Reporting Results

1.) The general form of a logistic regression model for heart disease (target) using

variables age (age), resting blood pressure (trestbps), and maximum heart rate achieved

(thalach).

Then, The general form of this logistic regression model is:



󰇛



󰇜





󰇛





























󰇜

 

󰇛































󰇜



y = heart disease (target)





= age





= resting blood pressure (trestbps)





= maximum heart rate achieved (thalach)



󰇛



󰇜





󰇛





























󰇜



󰇛































󰇜







󰇛



󰇜





󰇛













󰇜

 

󰇛















󰇜





2.) The general form of this logistic regression model can be converted to the model that

is linear in the beta terms

󰇡



  

󰇢



















󰇡



  

󰇢

  



 



 



3.) The left side of the equation above is the natural log of odds, so this can be written as:



󰇛



󰇜























󰇛



󰇜



  



 



 



Where odds are the odds of defaulting is default = 1

From the general form of the model above, mean in terms for heart disease (target)?



󰇛



󰇜



󰇛



󰇜





󰇛





























󰇜



󰇛































󰇜



, where Y is a binary response variable of

defaulting on credit. Then 

󰇛



󰇜

is the proportion and 















. In the

regression model. Then the  corresponds to the probability from the logistic regression model in

terms of the beta values versus the independent. The  is now in terms of the probability for the

dependent variable.  is the probability that the event occurs. However, this is dependent on the

beta values of the regression model. Then we can say that π is the probability that an

individual will get heart disease. It is the p(default=1).

Since  is the probability that the event occurs, then 1- is the probability that the event

does not occur.





. A binary event is the ratio of the event's probability to the probability that

the event does not occur. These are dependent on the coefficients' beta terms, which mimic a

linear relationship between them, which is a strong indication of having good characteristic

numbers for probability ratios given the regression model. Then we can say that π/(1-π) is the

odds of an individual getting heart disease.

Then the odds for the binary event is the ratio of the probability the event does not occur.

The odds are the odds of defaulting is default = 1. Then the odds are expressed as odds =





Furthermore, this is the probability that the event occurs.

Interpret the estimated coefficient of the maximum heart rate achieved variable.

Then the estimated coefficient for variable age is , which means that, on

average, the change in log odds for heart disease is  for each percentage decrease in

heart disease, given that all other variables are constant.

This can be expressed in terms of odds.





 

Then, Probabilty =  Odds = /(1-)= 0.00009425

Keep in mind that all other variables are constant. Then we can say the odds of getting heart

disease decrease by 0.00009425 percent for each percentage decrease in age.

Then what we can say that the estimated coefficient for variable resting blood pressure

(trestbps) is , which means that, on average, the change in log odds for heart disease is

 for each percentage decrease in heart disease, given that all other variables are

constant.

This can be expressed in terms of odds.





 

Then, Probabilty =  Odds = /(1-)= 0.00016022

Keep in mind that all other variables are constant. Then we can say the odds of getting

heart disease decrease by 0.00016022 percent for each percentage decrease in resting blood

pressure (trestbps) is .

Then what we can say that the estimated coefficient for variable maximum heart rate

achieved (thalach) is .04269. This means that on average, the change in log odds for heart

disease is  for each percentage increase in heart disease, given that all other variables

are constant.

This can be expressed in terms of odds.





 

Then, Probability =  Odds = /(1-)= 0.0004271

Keep in mind that all other variables are constant. Then we can say the odds of getting

heart disease increases by 0.0004271percent for each percentage decrease in resting blood

pressure (trestbps) is 0.0004271.

The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1 (Chan, 2020). Predicting the category of a categorical response is known as

classification. Because of the classification and regression models, I need a cutoff point where

the predictive value will be True or False. Because the Y dependent variable is a probability

percentage, we need a cutoff point where a particular value will be true or false to develop some

regression model analysis. A confusion matrix can evaluate a logistic regression model's

performance on the dataset used to create the model. The table's rows represent the predicted

outcomes, while the columns represent the actual outcomes (Chan, 2020).

Confusion Matrix

Prediction =

Actual =

Confusion Matrix

Prediction =

Actual =

TP= Ture Positive

TN = Ture Negative

FP = False Positive

FN = False Negative

A. Accuracy is the ratio of the number of correct predictions to the total number of

observations.

Accuracy =





 Accuracy =







Once again, a logistic regression model's goal is to predict whether the binary response Y

takes on a value of 0 or 1. We can see that our accuracy is , which is close to 1. Accuracy

is the ratio of the number of correct predictions to the total number of observations. When

assessing the classification model's performance, our accuracy is near one, which is

exceptionally well, which means that. Our model is excellently distinguishing what is in a binary

value, saying what is a 0 and what is a 1. Then we should expect good results are moving

forward with her model and Theory.

B. Precision is the ratio of correct positive predictions to the total predicted positives.

Precision =





 Precision =







The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1. We can see that our precision is , which is close to 1. Precision is the

ratio of correct positive predictions to the total predicted positives. When assessing the

classification model's performance, our accuracy is near one, which is exceptionally well, which

means that. Our model is excellently distinguishing what is in a binary value, saying what is a 0

and what is a 1.

C. Recall is the ratio of correct positive predictions to the total positives examples.

Recall =





 Recall =







The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1. Recall is , which is close to 1. The recall is the ratio of correct positive

predictions to the total positives examples. When assessing the classification model's

performance, our accuracy is near one exceptionally well, which means that. Our model is

excellently distinguishing what is in a binary value, saying what is a 0 and what is a 1.

Evaluating Model Significance

Using the Hosmer-Lemeshow goodness of fit test, is the model is appropriate at significant at a

5% level of significance?





: Model fits the data





: Model does not fit the data

Then:

















 0

The coefficient of age is x1, the coefficient of thalach x2 , and the coefficient of trestbps x3 is 0

in the regression equation. Then under the null hypothesis, the model fits the data.









󰇛󰇜

The coefficient of age is x1, the coefficient of thalach x2 , and the coefficient of trestbps x3 or all

are non-zero in the regression equation. Then under the alternative hypothesis, the model does

not fit the data.

X-Squared = 41.978

P-value= 0.7168

P-value >  = 0.05

The Null Hypothesis is not rejected. Since the p-value is 0.7168 and greater than the level

of significance of  = 0.05, we do not reject the null hypothesis and conclude that the regression

model fits the data.

Which terms are significant in the model based on Wald's test? Use a 5% level of

significance.

Age:

The confidence interval for age slope parameter is: (-0.0409, 0.0221)

The null hypothesis 



for the Wald test of the parameter is 



.

Age: Z-Value = -0.586, P-Value = 5578 and level of significance  = 0.05.

Based on the 95% confidence interval for age, the best conclusion is: Not Reject the null

hypothesis at level of significance  = 0.05 and conclude that age is not significant in the model

since the p-value is greater than the level of significance  = 0.05.

TRESTBPS:

The confidence interval for trestbps slope parameter is: (-0.0312, -0.0008)

The null hypothesis 



for the Wald test of the parameter is 



.

trestbps: Z-Value = -2.063, P-Value = .0392 and level of significance  = 0.05.

Based on the 95% confidence interval for age, the best conclusion about is: Reject the null

hypothesis at level of significance  = 0.05 and conclude that trestbps is significant in the model

since the p-value is less than the level of significance  = 0.05.

THALACH:

The confidence interval for thalach slope parameter is: (0.0291, 0.0563)

The null hypothesis 



for the Wald test of the thalach parameter is 



.

Missed_Payment: Z-Value = 6.144, P-Value = 8.06E-10 and level of significance  = 0.05.

Based on the 95% confidence interval for thalach, the best conclusion is: Reject the null

hypothesis at level of significance  = 0.05 and conclude that thalach is significant in the model

since the p-value is less than the level of significance  = 0.05.

Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and

explain what it illustrates.

A ROC curve measures the performance of a classifier at various threshold settings. Part of the

ROC curve is that the area under the curve indicates how well the model distinguishes between

Y=0 and Y=1, which is prospective to visualize how while the model distinguishes between 0

and 1. the larger the curve, the better the model will become because of the curve looks like a

square function, then that means that if there is a perfect distinguish between 1 and 0 better

when it corresponds to predicting binary class especially in functions.

What is the value of AUC? Interpret what this value represents.

A perfect model has AUC = 1: This means that we can measure separability between Y=1 and

Y=0.

A mode with AUC = 0.5: This means the model has no type of separation between Y=1 and

Y=0, and there is a hard time distinguishing between 0 and 1.

A model with AUC = 0. This means it has the worst separability. The function is not usable. It

thinks that 0 is a 1 and 1 is a 0.

Relation between Sensitivity and Specificity

Accuracy = , This is the probability of getting correct Y= 1 and Y=0.

Sensitivity =











= .7696 the ability of the test to get heart disease.

Specificity = =











= .60144 the ability of the test to properlyget heart disease.

Making Predictions Using Model

The logistic regression model predicts the probability of y=1. The probability of an

individual who is 50 years old, has a resting blood pressure of 122, and has a maximum heart

rate of 140 having heart disease percent chance is 0.4939, which would be a label = 0.

The logistic regression model predicts the probability of y=1. The probability of an

individual who is 50 years old, has a resting blood pressure of 140, and has a maximum heart

rate of 170 with heart disease is 0.7248, a label = 1.

4. Model #2 - Second Logistic Regression Model

Reporting Results

1.) The general form of a logistic regression model for model for heart disease (target)

using the variables maximum heart rate achieved (thalach), age of the individual (age),

sex of the individual (sex), exercise-induced angina (exang), and type of chest pain (cp).

You also must include the quadratic term for age and the interaction term between age

and the maximum heart rate achieved.

Then, The general form of this logistic regression model is:



󰇛



󰇜



























































































 



























































































y = heart disease (target)





= thalach





= age





= sex





= exang





= dummy for cp





= dummy for cp





= dummy for cp







= quadratic term for age









= interaction term between age and thalach



󰇛



󰇜



























































































 































































































󰇛



󰇜





















































 



















































2.) The general form of this logistic regression model can be converted to the model that

is linear in the beta terms

󰇡



  

󰇢



















































































󰇡



  

󰇢

    



   



 





   



 



 



 



  





 







3.) The left side of the equation above is the natural log of odds, so this can be written as:



󰇛



󰇜



























































































󰇛



󰇜

    



   



 



 

 



 



 



 



  







 







Where odds are the odds of heart disease is = 1