1. Introduction

We are analysts and have access to two unique historical data sets. The first is studying

the relationships between customer characteristics. Whether they are likely to default on their

credit—the other set of data we are given is for the economy. In the dataset, we explore many

variables necessary to calculate the risk that their customer will default on their credit. The

variable used can determine what type of regression model will be developed given to the

specific scenario. We are trying to do a case study and develop a regression model. A dataset can

let us know or predict when a recession will occur and help us prepare for it. It is essential to

government agencies or anyone like a financial analysis to prepare for a crisis. In the credit data

set, we can analyze financial risk. This data analysis can better prepare for different scenarios,

such as what affects a given area. Data sets are essential and developing regression models.

Especially nowadays, with everything collecting data around us, finding validity is more critical

than ever. This regression model can help someone see if they are likely to default. Regression

models are used, such as in case studies or anywhere is very important in general.

2. Data Preparation

The data sets that have been given are from credit_card_default.csv. The data set has

many variables related to the risk that their customers will default on their credit. The data set

consists of 8 columns and around 601 rows. The columns are the particular variable. The rows

are the different values of historical data sets for the risk that their customers will default on their

credit is given particular variables being compared. They had so many variables that could be

extremely important for a data analyst to develop some regression model based on the data.

The variables used to risk their customers will default on their credit, age, sex, education,

marriage, assets, missed_payment, credit_utilize, and default. We want to find a regression

model that best fits an actual scenario data set to see if customers default on their credit.

The other data sets that have been given are from economic.csv. The data set has many

variables that are related to wage growth. The data set consists of 6 columns and around 49 rows.

The columns are the particular variable, and the rows are the different values of historical data

sets for wage growth, given the specific variables being compared. We were asked to evaluate

the wage growth of 49 historical data sets entries, given their data sets, to develop a regression

model for predicting wage growth.

The variables used to determine wage growth rate are inflation, unemployment rate, the

economy in recession or not in recession, and GDP growth rate as response variables.

3. Classification Decision Tree

Reporting Results

What is the training set and testing set? The training set is the data used to train the decision

tree model, and then once the model is trained, we use the testing set to test that data. However,

why do we want to split the original data into training and testing data sets? The main reason is

the disadvantages that decision trees have is that they can overfit. If the data set is used to train

those models and produce a decision tree model that fits the training data sets well; However, if

it works it so well, it has biased and does not perform well on any future data sets. The decision

tree does not generalize very well and goes around this overfitting issue, and we can use the

training set to train the model. Then we are asked to split the credit card default data set into

training and validation sets using 70% and 30% split

In the original data set of the credit_card_default.csv there are 601 data set values. There are 420

number rows for the training set and 180 number rows for the validation set.

Constructing a decision tree classification is extremely important when knowing what variables

need to be taken to decide the algorithm. We are asked to create a classification decision tree for

credit default using variables using missed payment, credit utilization, and assets as predictors.

It is finding the cross-validation errors against the cost-complexity parameter CP (critical

point). The main idea is to find the best CP value is the most significant value at which the error

is below that line, and after this value of CP is found, the error decreases. We stopped at the most

critical value that gives the minimum or one error below this line, which is the criteria. Finding

the cross-validation error and cost complexity that best fits the data set values for the regression

model will help us produce an effective model.

Take the cross-validation error and the cost complexity of these two columns, and we will get

a plot for them, so then we can see how the decision will be made for picking the CP value. This

is the cross-validation error against cost-complexity. The cross-validation error is on the y-axis,

which is the cause of complexity. On the x-axis, so we have to select an appropriate model by

pruning the tree. The idea of pruning the tree gives preference to smaller trees over the larger

ones, which can help to overfit the data values. We do not want to pick a CP too small of a tree

where our cross-validation error is significant. In the graph, we can see a horizontal red line. In

general, the enormous CP value that achieves a cross-validation error below the red line is

chosen to prune the tree. If we see the graph below the red line, we have two values for CP. We

want to pick the enormous CP value that has an error below this redline and so clearly is

CP=0.28.

Using the CP for pruning the tree in our model, we repeat step four, and then we get a new

classification tree diagram in which Row 3 has the CP value of 0.039.

The above graph for plotting the Classification Decision Tree and each node shows the predicted

class (default or no default), the predicted probability of the event (defaulting), the percentage of

observations in the node

Node 1:

"Yes" means that majority of the observations are default=Yes. This means that the individual

has defaulted. 0.56 means that the probability of defaulting in that box is 0.56. 100% means that

100% of the total observations in the training set fall in that node.

Node 2:

"Yes" means that majority of the observations are default=Yes. This means that the individual

has defaulted. 0.88 means that the probability of defaulting in that box is 0.88. Sixty-three

percent means that 100 percent of the total observations in the training set fall in that node.

Node 3:

"Yes" means that most of the observations are default=Yes, which means the individual has

defaulted. 0.58 means that the probability of defaulting in that box is 0.58. Nineteen percent

means that 19 percent of the total observations in the training set fall in that node.

The graph summary shows that, on average, a person with credit utilization with less than

.27 percent for defaulting is 0.033 percent, given that all other variables are constant. Then, on

average, a person with credit utilization with less than .35 percent and no assets for defaulting is

0.22 percent, given that all other variables are constant. This means that, on average, a person

with credit utilization with less than .35 percent and assets a car, house for defaulting is 0.83

percent, given that all other variables are constant. This means that, on average, a person with

credit utilization greater than .35 percent for defaulting is 100 percent, given that all other

variables are constant.

Evaluating Utility of Model

Confusion Matrix

Prediction =

Actual =

Confusion Matrix

Prediction =

Actual =

TP= Ture Positive

TN = Ture Negative

FP = False Positive

FN = False Negative

A. Accuracy is the ratio of the number of correct predictions to the total number of

observations.

Accuracy =





 Accuracy =







Once again, a logistic regression model's goal is to predict whether the binary response Y

takes on a value of 0 or 1. We can see that our accuracy is .894444. Accuracy is the ratio of the

number of correct predictions to the total number of observations. When assessing the

classification model's performance, our accuracy is near one, which is exceptionally well, which

means that. Our model excellently distinguishes what is in a binary value, saying what is a 0 and

a 1. Then we should expect good results are moving forward with her model and Theory.

B. Precision is the ratio of correct positive predictions to the total predicted positives.

Precision =





 Precision =







The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1. We can see that our Precision is .94382. Precision is the ratio of correct

positive predictions to the total predicted positives. When assessing the classification model's

performance, our accuracy is near one exceptionally well, which means that. Our model

excellently distinguishes what is in a binary value, saying what is a 0 and a 1.

C. Recall is the ratio of correct positive predictions to the total positives examples.

Recall =





 Recall =







The logistic regression model's goal is to predict whether the binary response Y takes on

a value of 0 or 1. We can see that our Recall is .95454. The Recall is the ratio of correct positive

predictions to the total positives examples. When assessing the classification model's

performance, our accuracy is near one exceptionally well, which means that. Our model does an

excellent job distinguishing what is in a binary value, saying what a 0 and a 1.

Making Predictions Using Model

The probability of an individual who has a credit utilization of 30% and has not missed

payments, owns a car, and a house the chance of defaulting on credit would be a label = no. Keep

in mind the difference from the logistical regression model where we use the same data sets;

however, we had values and a scale between a definite zero and a definite 1. Based on these

classifications and determined, the number would be a label yes and no label no. However, here

we get either yes they will default or no they will not default.

The probability of an individual who has a credit utilization of 30% and has missed

payments owns any assets the chances of defaulting on credit would be a label = no. Keep in

mind the difference from the logistical regression model where we use the same data sets;

however, we had values and a scale between a definite zero and a definite 1. Based on these

classifications and determined, the number would be a label yes and no label no. However, here

we get either yes they will default or no they will not default.

4. Regression Decision Tree

Reporting Results

What is the training set and testing set? The training set is the data used to train the decision

tree model, and then once the model is trained, we use the testing set to test that data. However,

why do we want to split the original data into training and testing data sets? The main reason is

the disadvantages that decision trees have is that they can overfit. If the data set is used to train

those models and produce a decision tree model that fits the training data sets well; However, if

it fits it so well, it has biased and does not perform well on any future data sets. The decision tree

does not generalize very well and goes around this overfitting issue, and we can use the training

set to train the model. Then we are asked to split the wage growth data set into training and

validation sets using 80% and 20 split

In the original data set of the economic.csv there are 101 data set values. There are 79 number

rows for the training set and 20 number rows for the validation set.

Constructing a decision tree classification is extremely important when to know what

variables need to be taken to decide for the algorithm. We are asked to create a classification

decision tree for creating a regression decision tree for wage growth using economy,

unemployment, and gdp as predictors.

It is finding the cross-validation errors against the cost-complexity parameter cp. The main

idea is to find the best CP value is the most significant value at which the error is below that line,

and after this value of CP is found, the error decreases. We stopped at the essential value that

gives the minimum or one error below this line, which is the criteria. Finding the cross-validation

error and cost complexity that best fits the data set values for the regression model will help us

produce an effective model.

Take the cross-validation error and the cost complexity of these two columns, and we will get

a plot for them, so then we can see how the decision will be made for picking the CP value. This

is the cross-validation error against cost-complexity. The cross-validation error is on the y-axis,

which is the cause of complexity. On the x-axis, so we have to select an appropriate model by

pruning the tree. The idea of pruning the tree is essentially giving preference to smaller trees

over larger trees. This can help in overfitting the data values. We do not want to pick a CP too

small of a tree where our cross-validation error is significant. In the graph, we can see a

horizontal red line. In general, the largest CP value that achieves a cross-validation error below

the red line is chosen to prune the tree. If we see the graph below the red line, we have two

values for CP. We want to pick the largest CP value with an error below this redline and

CP=0.35.

Using the CP for pruning the tree in our model, we repeat step four, and then we get a

new classification tree diagram in which Row 3 has the CP value of 0.035.

The above graph for plotting the Classification Decision Tree and each node shows the

predicted class wage growth, the expected probability of the event of wage growth, the

percentage of observations in the node

Node 1:

7.1 means that the percentage of wage growth in that box is 7.1 percent. 100% means that 100%

of the total observations in the training set fall in that node.

Node 2:

4.4 means that the percentage of wage growth in that box is 4.4 percent. 41% means that 41% of

the total observations in the training set fall in that node.

Node 3:

8.9 means that the percentage of wage growth in that box is 8.9 percent. 59% means that 59% of

the total observations in the training set fall in that node.

The graph summary shows that, on average, wage growth given the following conditions

unemployment >=5.6 percent has a wage growth of 2.6 percent, given that all other variables are

constant. On average, the wage growth given the following conditions was unemployment is not

>=5.6 percent has a wage growth 5.8 of percent, given that all other variables are constant. On

average, the wage growth given the following conditions was unemployment is >= 2.3 percent

has a wage growth of 7.8 percent, given that all other variables are constant. On average, the

wage growth given the following conditions was unemployment is not >= 2.3 percent has a wage

growth of 9.6 percent, given that all other variables are constant.

Evaluating Utility of Model

The formula gives RMSE:





 󰇛









󰇜









RMSE is the standard deviation of residuals, and the RMSE can be used in a Decision

Tree Regressor to evaluate the model's performance (Chan, 2020). To see how accurate our

model is in predicting the values, we use Root Mean Square. We want to see our model's

accuracy then the root mean square of the error between the test values and the predicted values.

So it is the difference between the predicted values and what is observed, and it is a reference to

see how our function is in or how accurate it is. In our case, the RMSE is 0.8386. If the predicted

versus The observed values were closer to each other, then the RMSE would be closer to 1 in a

particular class to predict wage growth. On average, our model only has a difference of 0.1664

from having a value of 1, which is not bad. It is suitable for predicting wage growth for our

model.

Moreover, having residuals close to each other also correlates to a strong linear

relationship. The relationships occur between the predicted and observed value sets. We want to

develop an equation to use test data sets to build the equation; however, the equation will be used

for real-world applications. We want to know the difference in error between the tested add the

predicted. Which has a strong linear relationship between the projected and observable, gives us

RMSE closer to 1.

Making Predictions Using Model

Given our model, we want to predict wage growth if the economy is not in recession,

unemployment is at 3.4 percent, and the GDP growth rate is 3.5 percent. Placing these

parameters in our model, we get that the wage growth given this scenario would be

7.924 percent.

Given our model, we want to predict wage growth if the economy is in recession,

unemployment at 7.4 percent, and the GDP growth rate at 1.5 percent. Placing these parameters

in our model, we get that the wage growth given this scenario would be 7.924 percent.

5. Conclusion

When analyzing a huge amount of data, it is essential to find relationships between

different variables of data value sets. Both models developed in this homework assignment are

different. One deals with wage growth, and the other deals with credit default. A classification

decision tree for credit default uses variables using missed payment, credit utilization, and assets

as predictors. We used a confusion matrix to evaluate the performance of the model:

Relation between Sensitivity and Specificity

Accuracy =.95, This is the probability of getting correct Y= 1 and Y=0.

Sensitivity =











= .95454 the ability of the test to default on credit.

Specificity = =











= .94565 the ability of the test to properly default on credit.

Based on the relationship between sensitivity and specificity, we can see that The

equation for defaulting on credit accuracy is 95%, which is extremely and distinguishing

between 1 and 0 for binary value sets.

Then we are asked to create a classification decision tree for creating a regression

decision tree for wage growth using economy, unemployment, and gdp as predictors. To see how

accurate our model is in predicting the values, we use Root Mean Square. We want to see our

model's accuracy then the root mean square of the error between the test values and the predicted

values. In our case, the RMSE is 0.8386.

Looking at two different types of models, one dealing with probability and the other

dealing with regression values, we were able to use various means to determine our model's

accuracy. The credit default model is 95% accurate, far better than our wage growth model.

However, I think more variables need to be analyzed and evaluated to determine which variable

will produce the highest probability of finding the model that best fits the situation. The whole

objective is to develop a model that is custom to the scenario. The key is to understand when a

data set has and what type of model to use.

6. Citations

Chan, C., Berrier, H., Pardoe, L., & Sturdivant, R. (2020). zyBook for Applied Statistics II for

Science, Technology, Engineering, and Math (STEM).