1. Introduction

We are analysts. We have access to a unique set of historical data from studying the

relationships between customer characteristics and whether they are likely to default on their

credit. The dataset explores many variables necessary for the company to calculate the risk that

their customer will default on their credit. The variable used can determine what type of

regression model will be developed given to the specific scenario. We are trying to do a case

study and develop a regression model.

Data analysis can better prepare for different scenarios, such as housing prices in a given

area. Data sets are essential and developing regression models. Especially nowadays, with

everything collecting data around us, finding validity is more critical than ever. This regression

model can help someone see if they are likely to default. Regression models are used, such as in

case studies or anywhere is very important in general.

2. Data Preparation

The data sets that have been given are from credit_card_default.csv. The data set has

many variables related to the risk that their customers will default on their credit. The data set

consists of 8 columns and around 601 rows. The columns are the particular variable. The rows

are the different values of historical data sets for the risk that their customers will default on their

credit is given particular variables being compared. They had so many variables that could be

extremely important for a data analyst to develop some regression model based on the data.

The variables used to risk their customers will default on their credit, age, sex, education,

marriage, assets, missed_payment, credit_utilize, and default. We want to find a regression

model that best fits an actual scenario data set to see if customers default on their credit.

3. First Logistic Regression Model

Reporting Results

1.) The general form of a logistic regression model for defaulting on credit, using credit

utilization and missed payments as independent variables.

Then, The general form of this logistic regression model is:



󰇛



󰇜





󰇛























󰇜

 

󰇛























󰇜



y = 1 for defaulting on credit and 0 for not defaulting





= credit utilization





= dummy variable for missing payment



󰇛



󰇜





󰇛























󰇜

 

󰇛























󰇜



 

󰇛



󰇜





󰇛











󰇜

 

󰇛











󰇜





2.) The general form of this logistic regression model can be converted to the model that

is linear in the beta terms

󰇡



  

󰇢



















 󰇡



  

󰇢

  



 





3.) The left side of the equation above is the natural log of odds, so this can be written as:



󰇛



󰇜





















 

󰇛



󰇜



  



 





Where odds are the odds of defaulting is default = 1

What do the following terms, from the general form of the model above, mean in terms of an

individual defaulting on their credit?

We know that 

󰇛



󰇜

 

󰇛



󰇜





󰇛























󰇜



󰇛























󰇜



, where Y is a binary response variable of

defaulting on credit. Then 

󰇛



󰇜

is the proportion and 











. In the regression model.

Then the  corresponds to the probability from the logistic regression model in terms of the beta

values versus the independent. The  is now in terms of the probability for the dependent

variable. Then we can say that  is the probability that the event occurs. However, this is

dependent on the beta values of the regression model

Since we know that  is the probability that the event occurs, then 1- is the probability that the

event does not happen.





. A binary event is the ratio of the event's probability to the

probability that the event does not occur. These are dependent on the coefficients' beta terms,

which mimic a linear relationship between them, which is a strong indication of having good

characteristic numbers for probability ratios given the regression model.

Then we can say that the odds for the binary event is the ratio of the probability the event

does not occur. Then we are looking for odds are the odds of defaulting is default = 1. Then the

odds are expressed as odds =





Furthermore, this is the probability that the event occurs.

Interpret the estimated coefficient of credit utilization.

Then we can say that the estimated coefficient for variable credit utilization is ,

which means that, on average, the change in log odds for defaulting is  for each

percentage increase in credit utilization, given that all other variables are constant.

This can be expressed in terms of odds.





   

Keep in mind that all other variables are constant. Then we can say the odds of defaulting

increase by 36.63 percent for each percentage increase in credit utilization.

Then what we can say that the estimated coefficient for a variable missed payment is

, which means that, on average, the change in log odds for defaulting is  for each

percentage increase in missed payment, given that all other variables are constant.

This can be expressed in terms of odds.





   

Keep in mind that all other variables are constant. Then we can say the odds of defaulting

increase by 1.5 percent for each percentage increase in recurring missed payment.

The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1 (Chan, 2020). Predicting the category of a categorical response is known as

classification. Because of the classification and regression models, I need a cutoff point where

the predictive value will be True or False. Because the Y dependent variable is a probability

percentage, we need a cutoff point where a particular value will be true or false to develop some

regression model analysis. A confusion matrix can evaluate a logistic regression model's

performance on the dataset used to create the model. The table's rows represent the predicted

outcomes, while the columns represent the actual outcomes (Chan, 2020).

Confusion Matrix

Prediction =

Actual =

Confusion Matrix

Prediction =

Actual =

TP= Ture Positive

TN = Ture Negative

FP = False Positive

FN = False Negative

A. Accuracy is the ratio of the number of correct predictions to the total number of

observations.

Accuracy =





 Accuracy =







Once again, a logistic regression model's goal is to predict whether the binary response Y

takes on a value of 0 or 1. We can see that our accuracy is , which is close to 1. Accuracy

is the ratio of the number of correct predictions to the total number of observations. When

assessing the classification model's performance, our accuracy is near one, which is

exceptionally well, which means that. Our model excellently distinguishes what is in a binary

value, saying what a 0 and a 1. Then we should expect good results are moving forward with her

model and Theory.

B. Precision is the ratio of correct positive predictions to the total predicted positives.

Precision =





 Precision =







The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1. We can see that our Precision is .932308, which is close to 1. Precision is the

ratio of correct positive predictions to the total predicted positives. When assessing the

classification model's performance, our accuracy is near one, which is exceptionally well, which

means that. Our model excellently distinguishes what is in a binary value, saying what is a 0 and

a 1.

C. Recall is the ratio of correct positive predictions to the total positives examples.

Recall =





 Recall =







The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1. We can see that our Recall is .935185, which is close to 1. The Recall is the

ratio of correct positive predictions to the total positives examples. When assessing the

classification model's performance, our accuracy is near one exceptionally well, which means

that. Our model excellently distinguishes what is in a binary value, saying what is a 0 and a 1.

Evaluating Model Significance

Using the Hosmer-Lemeshow goodness of fit test, is the model is appropriate at significant at a

5% level of significance?





: Model fits the data





: Model does not fit the data

Then:













 0

The coefficient of credit utilization is x1, and the coefficient of missed payments x2 is 0 in the

regression equation. Then under the null hypothesis, the model fits the data.









 󰇛  󰇜

The coefficient of credit utilization is x1, and the coefficient of missed payments x2 or all are

non-zero in the regression equation. Then under the alternative hypothesis, the model does not fit

the data.

X-Squared = 49.076

P-value= 0.4298

P-value >  = 0.05

The Null Hypothesis is not rejected. Since the p-value is 0.4298 and greater than the level

of significance of  = 0.05, we do not reject the null hypothesis and conclude that the regression

model fits the data.

Which terms are significant in the model based on Wald's test? Use a 5% level of

significance.

Credit Utilization

The confidence interval for the credit utilization slope parameter is: (-12.4188, -8.3365)

The null hypothesis 



for the Wald test of the credit utilization parameter is 



 .

Credit_Utilize: Z-Value = -9.965, P-Value = 2E-16 and level of significance  = 0.05.

Based on the 95% confidence interval for credit utilization, the best conclusion about is: Reject

the null hypothesis at level of significance  = 0.05 and conclude that Credit Utilzation is

significant in the model since the p-value is less than the level of significance  = 0.05.

Missed Payment

The confidence interval for Missed Payment slope parameter is: (0.8235, 2.1548)

The null hypothesis 



for the Wald test of the Missed Payment parameter is 



 .

Missed_Payment: Z-Value = 9.737, P-Value = 2E-16 and level of significance  = 0.05.

Based on the 95% confidence interval for Missed Payment, the best conclusion is: Reject the null

hypothesis at the level of significance  = 0.05 and conclude that Missed Payment is significant

in the model since the p-value is less than the level of significance  = 0.05.

Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and

explain what it illustrates.

A ROC curve measures the performance of a classifier at various threshold settings. Part of the

ROC curve is that the area under the curve indicates how well the model distinguishes between

Y=0 and Y=1, which is prospective to visualize how while the model distinguishes between 0

and 1. the larger the curve, the better the model will become because the curve looks like a

square function, then that means that if there is a perfect distinguish between 1 and 0, better

when it corresponds to predicting binary class especially in functions.

What is the value of AUC? Interpret what this value represents.

A perfect model has AUC = 1: This means we can measure separability between Y=1 and Y=0.

A mode with AUC = 0.5: This means the model has no separation between Y=1 and Y=0, and

there is a hard time distinguishing between 0 and 1.

A model with AUC = 0. This means it has the worst separability. The function is not usable. It

thinks that 0 is a 1 and 1 is a 0.

Relation between Sensitivity and Specificity

Accuracy = , This is the probability of getting correct Y= 1 and Y=0.

Sensitivity =











= .91049 the ability of the test to defaulting on credit.

Specificity = =











= .89130 the ability of the test to properly defaulting on credit.

Making Predictions Using Model

The logistic regression model predicts the probability of y=1. The probability of an

individual who has a credit utilization of 32% and has missed payments in the past three months

defaulting on credit has a 0.75 percent chance of default, which would be a label = 1.

The logistic regression model predicts the probability of y=1. The probability of an

individual who has a credit utilization of 32% and has not missed payments in the past three

months defaulting on credit has a 0.4035 percent chance of defaulting, which would be a label =

4. Second Logistic Regression Model

Reporting Results

1.) The general form of a logistic regression model for defaulting on credit, using credit

utilization, assets, and education as independent variables.

Then, The general form of this logistic regression model is:



󰇛



󰇜





󰇛





















































󰇜

 

󰇛























































󰇜



y = 1 for defaulting on credit and 0 for not defaulting





= credit utilization





= dummy variable for asset





= dummy variable for asset





= dummy variable for asset





= dummy variable for education





= dummy variable for education



󰇛



󰇜





󰇛





















































󰇜

 

󰇛























































󰇜



 

󰇛



󰇜





󰇛

























󰇜

 

󰇛



























󰇜



2.) The general form of this logistic regression model can be converted to the model that

is linear in the beta terms

󰇡



  

󰇢



















































 

󰇡



  

󰇢

  



 



 



 



 





 





3.) The left side of the equation above is the natural log of odds, so this can be written as:



󰇛



󰇜





















































 

󰇛



󰇜

  



 



 



 



 



 





Where odds are the odds of defaulting is default = 1

What do the following terms, from the general form of the model above, mean in terms of

an individual defaulting on their credit?

We know that 

󰇛



󰇜

 

󰇛



󰇜





󰇛





















































󰇜



󰇛























































󰇜



, where Y is a

binary response variable of defaulting on credit. Then 

󰇛



󰇜

is the proportion and













. In the regression model. Then the  corresponds to the probability from the

logistic regression model in terms of the beta values versus the independent. The  is now in

terms of the probability for the dependent variable. Then we can say that  is the probability that

the event occurs. However, this is dependent on the beta values of the regression model

Since we know that  is the probability that the event occurs, then 1- is the probability

that the event does not occur.





A binary event is the ratio of the event's probability to the

probability the event does not occur. These are dependent on the coefficients' beta terms, which

mimic a linear relationship between them, which is a strong indication of having good

characteristic numbers for probability ratios given the regression model.

Then we can say that the odds for the binary event is the ratio of the probability the event

does not occur. Then we are looking for odds are the odds of defaulting is default = 1. Then the

odds are expressed as odds =





Moreover, this is the probability that the event occurs.

Interpret the estimated coefficient of credit utilization.

Then what we can say that the estimated coefficient for variable credit utilization x1 is

. This means that, on average, the change in log odds for defaulting is  for each

percentage increase in credit utilization, given that all other variables are constant.

Then what we can say that the estimated coefficient for a variable assets1 x2 is -0.1523.

This means that, on average, the change in log odds for defaulting is -  for each

percentage decrease in assets1, given that all other variables are constant.

Then what we can say that the estimated coefficient for a variable assets2 x3 is -2.9292.

This means that, on average, the change in log odds for defaulting is -  for each

percentage decrease in assets2, given that all other variables are constant.

Then what we can say that the estimated coefficient for a variable assets3 x4 is -2.9292.

This means that, on average, the change in log odds for defaulting is -  for each

percentage decrease in assets3, given that all other variables are constant.

Then what we can say that the estimated coefficient for a variable education2 x5 is -1.9324. This

means that, on average, the change in log odds for defaulting is -  for each percentage

decrease in education2, given that all other variables are constant.

Then what we can say that the estimated coefficient for a variable education3 x6 is -

4.7503. This means that, on average, the change in log odds for defaulting is  for each

percentage decrease in education2, given that all other variables are constant.