1. Introduction
We are analysts. We have access to a unique set of historical data from studying the
relationships between customer characteristics and whether they are likely to default on their
credit. The dataset explores many variables necessary for the company to calculate the risk that
their customer will default on their credit. The variable used can determine what type of
regression model will be developed given to the specific scenario. We are trying to do a case
study and develop a regression model.
Data analysis can better prepare for different scenarios, such as housing prices in a given
area. Data sets are essential and developing regression models. Especially nowadays, with
everything collecting data around us, finding validity is more critical than ever. This regression
model can help someone see if they are likely to default. Regression models are used, such as in
case studies or anywhere is very important in general.
2. Data Preparation
The data sets that have been given are from credit_card_default.csv. The data set has
many variables related to the risk that their customers will default on their credit. The data set
consists of 8 columns and around 601 rows. The columns are the particular variable. The rows
are the different values of historical data sets for the risk that their customers will default on their
credit is given particular variables being compared. They had so many variables that could be
extremely important for a data analyst to develop some regression model based on the data.
The variables used to risk their customers will default on their credit, age, sex, education,
marriage, assets, missed_payment, credit_utilize, and default. We want to find a regression
model that best fits an actual scenario data set to see if customers default on their credit.
3. First Logistic Regression Model
Reporting Results
2
1.) The general form of a logistic regression model for defaulting on credit, using credit
utilization and missed payments as independent variables.
Then, The general form of this logistic regression model is:
󰇛
󰇜
󰇛


󰇜
󰇛


󰇜
y = 1 for defaulting on credit and 0 for not defaulting
= credit utilization
= dummy variable for missing payment
󰇛
󰇜
󰇛


󰇜
󰇛


󰇜
 
󰇛
󰇜
󰇛


󰇜
󰇛


󰇜

2.) The general form of this logistic regression model can be converted to the model that
is linear in the beta terms
󰇡
󰇢
 󰇡
󰇢
 


3.) The left side of the equation above is the natural log of odds, so this can be written as:
3

󰇛

󰇜
 
󰇛

󰇜
 


Where odds are the odds of defaulting is default = 1
What do the following terms, from the general form of the model above, mean in terms of an
individual defaulting on their credit?
a.
We know that
󰇛
󰇜

󰇛
󰇜
󰇛


󰇜

󰇛


󰇜
, where Y is a binary response variable of
defaulting on credit. Then
󰇛
󰇜
is the proportion and


. In the regression model.
Then the corresponds to the probability from the logistic regression model in terms of the beta
values versus the independent. The is now in terms of the probability for the dependent
variable. Then we can say that is the probability that the event occurs. However, this is
dependent on the beta values of the regression model
b.
Since we know that is the probability that the event occurs, then 1- is the probability that the
event does not happen.

. A binary event is the ratio of the event's probability to the
probability that the event does not occur. These are dependent on the coefficients' beta terms,
which mimic a linear relationship between them, which is a strong indication of having good
characteristic numbers for probability ratios given the regression model.
Then we can say that the odds for the binary event is the ratio of the probability the event
does not occur. Then we are looking for odds are the odds of defaulting is default = 1. Then the
odds are expressed as odds =

Furthermore, this is the probability that the event occurs.
Interpret the estimated coefficient of credit utilization.
Then we can say that the estimated coefficient for variable credit utilization is ,
which means that, on average, the change in log odds for defaulting is  for each
percentage increase in credit utilization, given that all other variables are constant.
This can be expressed in terms of odds.

 
4
Keep in mind that all other variables are constant. Then we can say the odds of defaulting
increase by 36.63 percent for each percentage increase in credit utilization.
Then what we can say that the estimated coefficient for a variable missed payment is
, which means that, on average, the change in log odds for defaulting is  for each
percentage increase in missed payment, given that all other variables are constant.
This can be expressed in terms of odds.

 
Keep in mind that all other variables are constant. Then we can say the odds of defaulting
increase by 1.5 percent for each percentage increase in recurring missed payment.
The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1 (Chan, 2020). Predicting the category of a categorical response is known as
classification. Because of the classification and regression models, I need a cutoff point where
the predictive value will be True or False. Because the Y dependent variable is a probability
percentage, we need a cutoff point where a particular value will be true or false to develop some
regression model analysis. A confusion matrix can evaluate a logistic regression model's
performance on the dataset used to create the model. The table's rows represent the predicted
outcomes, while the columns represent the actual outcomes (Chan, 2020).
Confusion Matrix
Prediction =
0
Prediction =
1
Actual =
0
00
01
Actual =
1
10
11
Confusion Matrix
Prediction =
0
Prediction =
1
Actual =
0
TN
FP
Actual =
1
FN
TP
TP= Ture Positive
TN = Ture Negative
FP = False Positive
FN = False Negative
5
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



Once again, a logistic regression model's goal is to predict whether the binary response Y
takes on a value of 0 or 1. We can see that our accuracy is , which is close to 1. Accuracy
is the ratio of the number of correct predictions to the total number of observations. When
assessing the classification model's performance, our accuracy is near one, which is
exceptionally well, which means that. Our model excellently distinguishes what is in a binary
value, saying what a 0 and a 1. Then we should expect good results are moving forward with her
model and Theory.
B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our Precision is .932308, which is close to 1. Precision is the
ratio of correct positive predictions to the total predicted positives. When assessing the
classification model's performance, our accuracy is near one, which is exceptionally well, which
means that. Our model excellently distinguishes what is in a binary value, saying what is a 0 and
a 1.
C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our Recall is .935185, which is close to 1. The Recall is the
6
ratio of correct positive predictions to the total positives examples. When assessing the
classification model's performance, our accuracy is near one exceptionally well, which means
that. Our model excellently distinguishes what is in a binary value, saying what is a 0 and a 1.
Evaluating Model Significance
Using the Hosmer-Lemeshow goodness of fit test, is the model is appropriate at significant at a
5% level of significance?
: Model fits the data
: Model does not fit the data
Then:
0
The coefficient of credit utilization is x1, and the coefficient of missed payments x2 is 0 in the
regression equation. Then under the null hypothesis, the model fits the data.

󰇛 󰇜
The coefficient of credit utilization is x1, and the coefficient of missed payments x2 or all are
non-zero in the regression equation. Then under the alternative hypothesis, the model does not fit
the data.
X-Squared = 49.076
P-value= 0.4298
P-value > = 0.05
The Null Hypothesis is not rejected. Since the p-value is 0.4298 and greater than the level
of significance of = 0.05, we do not reject the null hypothesis and conclude that the regression
model fits the data.
Which terms are significant in the model based on Wald's test? Use a 5% level of
significance.
7
Credit Utilization
The confidence interval for the credit utilization slope parameter is: (-12.4188, -8.3365)
The null hypothesis
for the Wald test of the credit utilization parameter is
.
Credit_Utilize: Z-Value = -9.965, P-Value = 2E-16 and level of significance = 0.05.
Based on the 95% confidence interval for credit utilization, the best conclusion about is: Reject
the null hypothesis at level of significance = 0.05 and conclude that Credit Utilzation is
significant in the model since the p-value is less than the level of significance = 0.05.
Missed Payment
The confidence interval for Missed Payment slope parameter is: (0.8235, 2.1548)
The null hypothesis
for the Wald test of the Missed Payment parameter is
.
Missed_Payment: Z-Value = 9.737, P-Value = 2E-16 and level of significance = 0.05.
Based on the 95% confidence interval for Missed Payment, the best conclusion is: Reject the null
hypothesis at the level of significance = 0.05 and conclude that Missed Payment is significant
in the model since the p-value is less than the level of significance = 0.05.
Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and
explain what it illustrates.
A ROC curve measures the performance of a classifier at various threshold settings. Part of the
ROC curve is that the area under the curve indicates how well the model distinguishes between
Y=0 and Y=1, which is prospective to visualize how while the model distinguishes between 0
and 1. the larger the curve, the better the model will become because the curve looks like a
square function, then that means that if there is a perfect distinguish between 1 and 0, better
when it corresponds to predicting binary class especially in functions.
8
What is the value of AUC? Interpret what this value represents.
A perfect model has AUC = 1: This means we can measure separability between Y=1 and Y=0.
A mode with AUC = 0.5: This means the model has no separation between Y=1 and Y=0, and
there is a hard time distinguishing between 0 and 1.
A model with AUC = 0. This means it has the worst separability. The function is not usable. It
thinks that 0 is a 1 and 1 is a 0.
Relation between Sensitivity and Specificity
Accuracy = , This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .91049 the ability of the test to defaulting on credit.
Specificity = =





= .89130 the ability of the test to properly defaulting on credit.
Making Predictions Using Model
9
The logistic regression model predicts the probability of y=1. The probability of an
individual who has a credit utilization of 32% and has missed payments in the past three months
defaulting on credit has a 0.75 percent chance of default, which would be a label = 1.
The logistic regression model predicts the probability of y=1. The probability of an
individual who has a credit utilization of 32% and has not missed payments in the past three
months defaulting on credit has a 0.4035 percent chance of defaulting, which would be a label =
0.
4. Second Logistic Regression Model
Reporting Results
1.) The general form of a logistic regression model for defaulting on credit, using credit
utilization, assets, and education as independent variables.
Then, The general form of this logistic regression model is:
10
󰇛
󰇜
󰇛






󰇜
󰇛






󰇜
y = 1 for defaulting on credit and 0 for not defaulting
= credit utilization
= dummy variable for asset
= dummy variable for asset
= dummy variable for asset
= dummy variable for education
= dummy variable for education
󰇛
󰇜
󰇛






󰇜
󰇛






󰇜

󰇛
󰇜
󰇛






󰇜
󰇛






󰇜
2.) The general form of this logistic regression model can be converted to the model that
is linear in the beta terms
󰇡
󰇢
 
󰇡
󰇢
 






3.) The left side of the equation above is the natural log of odds, so this can be written as:

󰇛

󰇜

󰇛

󰇜
 





Where odds are the odds of defaulting is default = 1
What do the following terms, from the general form of the model above, mean in terms of
an individual defaulting on their credit?
a.
We know that
󰇛
󰇜

󰇛
󰇜
󰇛






󰇜

󰇛






󰇜
, where Y is a
binary response variable of defaulting on credit. Then
󰇛
󰇜
is the proportion and


. In the regression model. Then the corresponds to the probability from the
logistic regression model in terms of the beta values versus the independent. The is now in
11
terms of the probability for the dependent variable. Then we can say that is the probability that
the event occurs. However, this is dependent on the beta values of the regression model
b.
Since we know that is the probability that the event occurs, then 1- is the probability
that the event does not occur.

A binary event is the ratio of the event's probability to the
probability the event does not occur. These are dependent on the coefficients' beta terms, which
mimic a linear relationship between them, which is a strong indication of having good
characteristic numbers for probability ratios given the regression model.
Then we can say that the odds for the binary event is the ratio of the probability the event
does not occur. Then we are looking for odds are the odds of defaulting is default = 1. Then the
odds are expressed as odds =

Moreover, this is the probability that the event occurs.
Interpret the estimated coefficient of credit utilization.
Then what we can say that the estimated coefficient for variable credit utilization x1 is
. This means that, on average, the change in log odds for defaulting is  for each
percentage increase in credit utilization, given that all other variables are constant.
Then what we can say that the estimated coefficient for a variable assets1 x2 is -0.1523.
This means that, on average, the change in log odds for defaulting is -  for each
percentage decrease in assets1, given that all other variables are constant.
Then what we can say that the estimated coefficient for a variable assets2 x3 is -2.9292.
This means that, on average, the change in log odds for defaulting is -  for each
percentage decrease in assets2, given that all other variables are constant.
Then what we can say that the estimated coefficient for a variable assets3 x4 is -2.9292.
This means that, on average, the change in log odds for defaulting is -  for each
percentage decrease in assets3, given that all other variables are constant.
Then what we can say that the estimated coefficient for a variable education2 x5 is -1.9324. This
means that, on average, the change in log odds for defaulting is -  for each percentage
decrease in education2, given that all other variables are constant.
Then what we can say that the estimated coefficient for a variable education3 x6 is -
4.7503. This means that, on average, the change in log odds for defaulting is  for each
percentage decrease in education2, given that all other variables are constant.
The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1 (Chan, 2020). Predicting the category of a categorical response is known as
classification. Because of the classification and regression models, I need a cutoff point where
the predictive value will be True or False. Because the Y dependent variable is a probability
percentage, we need a cutoff point where a particular value will be true or false to develop some
regression model analysis. A confusion matrix can evaluate a logistic regression model's
12
performance on the dataset used to create the model. The table's rows represent the predicted
outcomes, while the columns represent the actual outcomes (Chan, 2020).
Confusion Matrix
Prediction =
0
Prediction =
1
Actual =
0
00
01
Actual =
1
10
11
Confusion Matrix
Prediction =
0
Prediction =
1
Actual =
0
TN
FP
Actual =
1
FN
TP
TP= Ture Positive
TN = Ture Negative
FP = False Positive
FN = False Negative
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



Once again, a logistic regression model's goal is to predict whether the binary response Y
takes on a value of 0 or 1. We can see that our accuracy is , which is close to 1. Accuracy
is the ratio of the number of correct predictions to the total number of observations. When
assessing the classification model's performance, our accuracy is near one, which is
exceptionally well, which means that. Our model is excellently distinguishing what is in a binary
value, saying what is a 0 and what is a 1. Then we should expect good results are moving
forward with her model and Theory.
B. Precision is the ratio of correct positive predictions to the total predicted positives.
13
Precision =


Precision =



The logistic regression model's goal is to predict whether the binary response Y takes on a value
of 0 or 1. We can see that our Precision is .9688, which is close to 1. Precision is the ratio of
correct positive predictions to the total predicted positives. When assessing the classification
model's performance, our accuracy is near one exceptionally well, which means that. Our model
is doing a good job in distinguishing what is in a binary value, saying what is a 0 and what is a 1.
C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our Recall is .95779, which is close to 1. The Recall is the
ratio of correct positive predictions to the total positives examples. When assessing the
classification model's performance, our accuracy is near one exceptionally well, which means
that. Our model is excellently distinguishing what is in a binary value, saying what is a 0 and
what is a 1.
Evaluating Model Significance
Using the Hosmer-Lemeshow goodness of fit test, is the model is appropriate at significant at a
5% level of significance?
: Model fits the data
: Model does not fit the data
Then:
0
The coefficient of credit utilization is x1, and the coefficient of x2, x3, x4, x5, x6 is 0 in the
regression equation. Then under the null hypothesis, the model fits the data.

󰇛 󰇜
The coefficient of credit utilization is x1, and the coefficient of x2, x3, x4, x5, x6, or all are non-
zero in the regression equation. Then under the alternative hypothesis, the model does not fit the
data.
X-Squared = 18.423
P-value= 1.0
14
P-value > = 0.05
The Null Hypothesis is not rejected. Since the p-value is 1.0 and greater than the level of
significance of = 0.05, we do not reject the null hypothesis and conclude that the regression
model fits the data.
Which terms are significant in the model based on Wald's test? Use a 5% level of
significance.
Credit Utilization
The confidence interval for credit utilization slope parameter is: (-10.6516, -4.8412)
The null hypothesis
for the Wald test of the credit utilization parameter is
.
Credit_Utilize: Z-Value = 6.954, P-Value = 3.56E-12 and level of significance = 0.05.
Based on the 95% confidence interval for credit utilization, the best conclusion about is: Reject
the null hypothesis at level of significance = 0.05 and conclude that Credit Utilzation is
significant in the model since the p-value is less than the level of significance = 0.05.
Assets 1
The confidence interval for assets1 slope parameter is: (-1.2937, 0.9891)
The null hypothesis
for the Wald test of the Missed Payment parameter is
.
Assets 1: Z-Value = -0.262, P-Value = 0.79365 and level of significance = 0.05.
Based on the 95% confidence interval for Asset 1, the best conclusion is: Not to Reject the null
hypothesis at the level of significance = 0.05 and conclude that Assets 1 is not significant in
the model since the p-value is greater than the level of significance = 0.05.
Assets 2
The confidence interval for assets 2 slope parameter is: (-4.4293, 0.9891). The null
hypothesis
for the Wald test of the Missed Payment parameter is
. Assets 2: Z-Value
= -3.827, P-Value = 0.00013 and level of significance = 0.05. Based on the 95% confidence
interval for Asset 2, the best conclusion is: Reject the null hypothesis at the level of significance
= 0.05 and conclude that Assets 2 is significant in the model since the p-value is less than the
level of significance = 0.05.
15
Assets 3
The confidence interval for assets 3 slope parameter is: (-5.1277, -2.3809)
The null hypothesis
for the Wald test of the assets, three parameters are
.
Assets 3: Z-Value = -5.358, P-Value = 8.43E-08 and level of significance = 0.05.
Based on the 95% confidence interval for Asset 3, the best conclusion is: Reject the null
hypothesis at level of significance = 0.05 and conclude that Assets 3 is significant in the model
since the p-value is less than the level of significance = 0.05.
Education 2
The confidence interval for Education 2 slope parameter is: (-3.1085, -0.7563)
The null hypothesis
for the Wald test of the Education 2 parameter is
.
Education 2: Z-Value = -3.220, P-Value = 0.00128 and level of significance = 0.05.
Based on the 95% confidence interval for Education 2, the best conclusion is: Reject the null
hypothesis at level of significance = 0.05 and conclude that Education 2 is significant in the
model since the p-value is less than the level of significance = 0.05.
Education 3
The confidence interval for Education 2 slope parameter is: (-6.2954, -3.2052)
The null hypothesis
for the Wald test of the Education 3 parameter is
.
Education 2: Z-Value = -6.026, P-Value = 1.68E-09 and level of significance = 0.05.
Based on the 95% confidence interval for Education 3, the best conclusion is: Reject the null
hypothesis at the significance level of significance = 0.05 and conclude that Education 2 is
significant in the model since the p-value is less than the significance level of significance =
0.05.
Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and explain what
it illustrates.
A ROC curve measures the performance of a classifier at various threshold settings. Part
of the ROC curve is that the area under the curve indicates how well the model distinguishes
between Y=0 and Y=1. This is prospective to visualize how while the model distinguishes
between 0 and 1. the larger the curve, the better the model will become because of the curve
looks like a square function, then that means that there is a perfect distinguish between 1 and
0 better when it corresponds to predicting binary class, especially in functions.
16
What is the value of AUC? Interpret what this value represents.
A perfect model has AUC = 1: This means measuring separability between Y=1 and Y=0.
A mode with AUC = 0.5: This means the model has no separation between Y=1 and Y=0, and
there is a hard time distinguishing between 0 and 1.
A model with AUC = 0. This means it has the worst separability. The function is not usable. It
thinks that 0 is a 1 and 1 is a 0.
Relation between Sensitivity and Specificity
Accuracy = , This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .959876 the ability of the test to defaulting on credit.
Specificity = =





= .96376 the ability of the test to properly defaulting on credit.
17
Making Predictions Using Model
The logistic regression model predicts the probability of y=1. The probability of an
individual who has a credit utilization of 43% and owns a car and a house, and has attained a
high school diploma defaulting on credit and the probability of defaulting on credit has a 0.9929
percent chance of default, which would be a label = 1.
The logistic regression model predicts the probability of y=1. The probability of an individual
who has a credit utilization of 43%, owns a car and a house, and has attained a postgraduate
degree the probability of defaulting on credit has a 0.5478 percent chance of defaulting would be
a label = 0.
5. Conclusion
When analyzing a huge amount of data, it is essential to find relationships between
different variables of data value sets. Both models developed in this homework assignment are
very similar in both the values and the graphs.
The general form of a logistic regression model for defaulting on credit, using credit utilization
and missed payments as independent variables is:
󰇛
󰇜
󰇛


󰇜
󰇛


󰇜
Relation between Sensitivity and Specificity
Accuracy = , This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .91049 the ability of the test to defaulting on credit.
Specificity = =





= .89130 the ability of the test to properly defaulting on credit.
The general form of a logistic regression model for defaulting on credit, using credit utilization,
assets, and education as independent variables is :
󰇛
󰇜
󰇛






󰇜
󰇛






󰇜
18
Relation between Sensitivity and Specificity
Accuracy = , This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .959876 the ability of the test to defaulting on credit.
Specificity = =





= .96376 the ability of the test to properly defaulting on credit.
Now, if we compare the model 1 and model 2 accuracy, we get a difference of .06 percent. If we
compare model 1 and model 2 sensitivity, we get a difference of .049. When comparing model 1
and model 2 specificity, we get a difference of .072.
However, when contributed to the overall regression model 2, there was no huge
difference in the regression model's overall pay to calculate the risk that their customers will
default on their credit. However, I think more variables need to be analyzed and evaluated to
determine which variable will produce the highest probability of finding the regression model
that best fits the situation. The whole objective is to develop a regression model that is custom to
the scenario. The key is to understand when a data set has and what type of regression model to
use.
19
6. Citations
Chan, C., Berrier, H., Pardoe, L., & Sturdivant, R. (2020). zyBook for Applied Statistics II for
Science, Technology, Engineering, and Math (STEM).