1. Introduction
We are analysts and have access to two unique historical data sets. The first is studying
the relationships between customer characteristics. Whether they are likely to default on their
creditthe other set of data we are given is for the economy. In the dataset, we explore many
variables necessary to calculate the risk that their customer will default on their credit. The
variable used can determine what type of regression model will be developed given to the
specific scenario. We are trying to do a case study and develop a regression model. A dataset can
let us know or predict when a recession will occur and help us prepare for it. It is essential to
government agencies or anyone like a financial analysis to prepare for a crisis. In the credit data
set, we can analyze financial risk. This data analysis can better prepare for different scenarios,
such as what affects a given area. Data sets are essential and developing regression models.
Especially nowadays, with everything collecting data around us, finding validity is more critical
than ever. This regression model can help someone see if they are likely to default. Regression
models are used, such as in case studies or anywhere is very important in general.
2. Data Preparation
The data sets that have been given are from credit_card_default.csv. The data set has
many variables related to the risk that their customers will default on their credit. The data set
consists of 8 columns and around 601 rows. The columns are the particular variable. The rows
are the different values of historical data sets for the risk that their customers will default on their
credit is given particular variables being compared. They had so many variables that could be
extremely important for a data analyst to develop some regression model based on the data.
The variables used to risk their customers will default on their credit, age, sex, education,
marriage, assets, missed_payment, credit_utilize, and default. We want to find a regression
model that best fits an actual scenario data set to see if customers default on their credit.
The other data sets that have been given are from economic.csv. The data set has many
variables that are related to wage growth. The data set consists of 6 columns and around 49 rows.
The columns are the particular variable, and the rows are the different values of historical data
2
sets for wage growth, given the specific variables being compared. We were asked to evaluate
the wage growth of 49 historical data sets entries, given their data sets, to develop a regression
model for predicting wage growth.
The variables used to determine wage growth rate are inflation, unemployment rate, the
economy in recession or not in recession, and GDP growth rate as response variables.
3. Classification Decision Tree
Reporting Results
What is the training set and testing set? The training set is the data used to train the decision
tree model, and then once the model is trained, we use the testing set to test that data. However,
why do we want to split the original data into training and testing data sets? The main reason is
the disadvantages that decision trees have is that they can overfit. If the data set is used to train
those models and produce a decision tree model that fits the training data sets well; However, if
it works it so well, it has biased and does not perform well on any future data sets. The decision
tree does not generalize very well and goes around this overfitting issue, and we can use the
training set to train the model. Then we are asked to split the credit card default data set into
training and validation sets using 70% and 30% split
In the original data set of the credit_card_default.csv there are 601 data set values. There are 420
number rows for the training set and 180 number rows for the validation set.
Constructing a decision tree classification is extremely important when knowing what variables
need to be taken to decide the algorithm. We are asked to create a classification decision tree for
credit default using variables using missed payment, credit utilization, and assets as predictors.
3
It is finding the cross-validation errors against the cost-complexity parameter CP (critical
point). The main idea is to find the best CP value is the most significant value at which the error
is below that line, and after this value of CP is found, the error decreases. We stopped at the most
critical value that gives the minimum or one error below this line, which is the criteria. Finding
the cross-validation error and cost complexity that best fits the data set values for the regression
model will help us produce an effective model.
4
Take the cross-validation error and the cost complexity of these two columns, and we will get
a plot for them, so then we can see how the decision will be made for picking the CP value. This
is the cross-validation error against cost-complexity. The cross-validation error is on the y-axis,
which is the cause of complexity. On the x-axis, so we have to select an appropriate model by
pruning the tree. The idea of pruning the tree gives preference to smaller trees over the larger
ones, which can help to overfit the data values. We do not want to pick a CP too small of a tree
where our cross-validation error is significant. In the graph, we can see a horizontal red line. In
general, the enormous CP value that achieves a cross-validation error below the red line is
chosen to prune the tree. If we see the graph below the red line, we have two values for CP. We
want to pick the enormous CP value that has an error below this redline and so clearly is
CP=0.28.
5
Using the CP for pruning the tree in our model, we repeat step four, and then we get a new
classification tree diagram in which Row 3 has the CP value of 0.039.
The above graph for plotting the Classification Decision Tree and each node shows the predicted
class (default or no default), the predicted probability of the event (defaulting), the percentage of
observations in the node
Node 1:
6
"Yes" means that majority of the observations are default=Yes. This means that the individual
has defaulted. 0.56 means that the probability of defaulting in that box is 0.56. 100% means that
100% of the total observations in the training set fall in that node.
Node 2:
"Yes" means that majority of the observations are default=Yes. This means that the individual
has defaulted. 0.88 means that the probability of defaulting in that box is 0.88. Sixty-three
percent means that 100 percent of the total observations in the training set fall in that node.
Node 3:
"Yes" means that most of the observations are default=Yes, which means the individual has
defaulted. 0.58 means that the probability of defaulting in that box is 0.58. Nineteen percent
means that 19 percent of the total observations in the training set fall in that node.
The graph summary shows that, on average, a person with credit utilization with less than
.27 percent for defaulting is 0.033 percent, given that all other variables are constant. Then, on
average, a person with credit utilization with less than .35 percent and no assets for defaulting is
0.22 percent, given that all other variables are constant. This means that, on average, a person
with credit utilization with less than .35 percent and assets a car, house for defaulting is 0.83
percent, given that all other variables are constant. This means that, on average, a person with
credit utilization greater than .35 percent for defaulting is 100 percent, given that all other
variables are constant.
7
Evaluating Utility of Model
Confusion Matrix
Prediction =
0
Prediction =
1
Actual =
0
00
01
Actual =
1
10
11
Confusion Matrix
Prediction =
0
Prediction =
1
Actual =
0
TN
FP
Actual =
1
FN
TP
TP= Ture Positive
TN = Ture Negative
FP = False Positive
FN = False Negative
A. Accuracy is the ratio of the number of correct predictions to the total number of
observations.
Accuracy =


Accuracy =



Once again, a logistic regression model's goal is to predict whether the binary response Y
takes on a value of 0 or 1. We can see that our accuracy is .894444. Accuracy is the ratio of the
number of correct predictions to the total number of observations. When assessing the
classification model's performance, our accuracy is near one, which is exceptionally well, which
means that. Our model excellently distinguishes what is in a binary value, saying what is a 0 and
a 1. Then we should expect good results are moving forward with her model and Theory.
B. Precision is the ratio of correct positive predictions to the total predicted positives.
Precision =


Precision =



8
The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our Precision is .94382. Precision is the ratio of correct
positive predictions to the total predicted positives. When assessing the classification model's
performance, our accuracy is near one exceptionally well, which means that. Our model
excellently distinguishes what is in a binary value, saying what is a 0 and a 1.
C. Recall is the ratio of correct positive predictions to the total positives examples.
Recall =


Recall =



The logistic regression model's goal is to predict whether the binary response Y takes on
a value of 0 or 1. We can see that our Recall is .95454. The Recall is the ratio of correct positive
predictions to the total positives examples. When assessing the classification model's
performance, our accuracy is near one exceptionally well, which means that. Our model does an
excellent job distinguishing what is in a binary value, saying what a 0 and a 1.
Making Predictions Using Model
The probability of an individual who has a credit utilization of 30% and has not missed
payments, owns a car, and a house the chance of defaulting on credit would be a label = no. Keep
in mind the difference from the logistical regression model where we use the same data sets;
however, we had values and a scale between a definite zero and a definite 1. Based on these
classifications and determined, the number would be a label yes and no label no. However, here
we get either yes they will default or no they will not default.
The probability of an individual who has a credit utilization of 30% and has missed
payments owns any assets the chances of defaulting on credit would be a label = no. Keep in
mind the difference from the logistical regression model where we use the same data sets;
however, we had values and a scale between a definite zero and a definite 1. Based on these
classifications and determined, the number would be a label yes and no label no. However, here
we get either yes they will default or no they will not default.
4. Regression Decision Tree
9
Reporting Results
What is the training set and testing set? The training set is the data used to train the decision
tree model, and then once the model is trained, we use the testing set to test that data. However,
why do we want to split the original data into training and testing data sets? The main reason is
the disadvantages that decision trees have is that they can overfit. If the data set is used to train
those models and produce a decision tree model that fits the training data sets well; However, if
it fits it so well, it has biased and does not perform well on any future data sets. The decision tree
does not generalize very well and goes around this overfitting issue, and we can use the training
set to train the model. Then we are asked to split the wage growth data set into training and
validation sets using 80% and 20 split
In the original data set of the economic.csv there are 101 data set values. There are 79 number
rows for the training set and 20 number rows for the validation set.
Constructing a decision tree classification is extremely important when to know what
variables need to be taken to decide for the algorithm. We are asked to create a classification
decision tree for creating a regression decision tree for wage growth using economy,
unemployment, and gdp as predictors.
It is finding the cross-validation errors against the cost-complexity parameter cp. The main
idea is to find the best CP value is the most significant value at which the error is below that line,
and after this value of CP is found, the error decreases. We stopped at the essential value that
gives the minimum or one error below this line, which is the criteria. Finding the cross-validation
10
error and cost complexity that best fits the data set values for the regression model will help us
produce an effective model.
Take the cross-validation error and the cost complexity of these two columns, and we will get
a plot for them, so then we can see how the decision will be made for picking the CP value. This
is the cross-validation error against cost-complexity. The cross-validation error is on the y-axis,
which is the cause of complexity. On the x-axis, so we have to select an appropriate model by
pruning the tree. The idea of pruning the tree is essentially giving preference to smaller trees
over larger trees. This can help in overfitting the data values. We do not want to pick a CP too
small of a tree where our cross-validation error is significant. In the graph, we can see a
horizontal red line. In general, the largest CP value that achieves a cross-validation error below
the red line is chosen to prune the tree. If we see the graph below the red line, we have two
values for CP. We want to pick the largest CP value with an error below this redline and
CP=0.35.
11
Using the CP for pruning the tree in our model, we repeat step four, and then we get a
new classification tree diagram in which Row 3 has the CP value of 0.035.
The above graph for plotting the Classification Decision Tree and each node shows the
predicted class wage growth, the expected probability of the event of wage growth, the
percentage of observations in the node
Node 1:
12
7.1 means that the percentage of wage growth in that box is 7.1 percent. 100% means that 100%
of the total observations in the training set fall in that node.
Node 2:
4.4 means that the percentage of wage growth in that box is 4.4 percent. 41% means that 41% of
the total observations in the training set fall in that node.
Node 3:
8.9 means that the percentage of wage growth in that box is 8.9 percent. 59% means that 59% of
the total observations in the training set fall in that node.
The graph summary shows that, on average, wage growth given the following conditions
unemployment >=5.6 percent has a wage growth of 2.6 percent, given that all other variables are
constant. On average, the wage growth given the following conditions was unemployment is not
>=5.6 percent has a wage growth 5.8 of percent, given that all other variables are constant. On
average, the wage growth given the following conditions was unemployment is >= 2.3 percent
has a wage growth of 7.8 percent, given that all other variables are constant. On average, the
wage growth given the following conditions was unemployment is not >= 2.3 percent has a wage
growth of 9.6 percent, given that all other variables are constant.
Evaluating Utility of Model
The formula gives RMSE:
13

󰇛


󰇜
RMSE is the standard deviation of residuals, and the RMSE can be used in a Decision
Tree Regressor to evaluate the model's performance (Chan, 2020). To see how accurate our
model is in predicting the values, we use Root Mean Square. We want to see our model's
accuracy then the root mean square of the error between the test values and the predicted values.
So it is the difference between the predicted values and what is observed, and it is a reference to
see how our function is in or how accurate it is. In our case, the RMSE is 0.8386. If the predicted
versus The observed values were closer to each other, then the RMSE would be closer to 1 in a
particular class to predict wage growth. On average, our model only has a difference of 0.1664
from having a value of 1, which is not bad. It is suitable for predicting wage growth for our
model.
Moreover, having residuals close to each other also correlates to a strong linear
relationship. The relationships occur between the predicted and observed value sets. We want to
develop an equation to use test data sets to build the equation; however, the equation will be used
for real-world applications. We want to know the difference in error between the tested add the
predicted. Which has a strong linear relationship between the projected and observable, gives us
RMSE closer to 1.
Making Predictions Using Model
Given our model, we want to predict wage growth if the economy is not in recession,
unemployment is at 3.4 percent, and the GDP growth rate is 3.5 percent. Placing these
parameters in our model, we get that the wage growth given this scenario would be
7.924 percent.
Given our model, we want to predict wage growth if the economy is in recession,
unemployment at 7.4 percent, and the GDP growth rate at 1.5 percent. Placing these parameters
in our model, we get that the wage growth given this scenario would be 7.924 percent.
5. Conclusion
14
When analyzing a huge amount of data, it is essential to find relationships between
different variables of data value sets. Both models developed in this homework assignment are
different. One deals with wage growth, and the other deals with credit default. A classification
decision tree for credit default uses variables using missed payment, credit utilization, and assets
as predictors. We used a confusion matrix to evaluate the performance of the model:
Relation between Sensitivity and Specificity
Accuracy =.95, This is the probability of getting correct Y= 1 and Y=0.
Sensitivity =





= .95454 the ability of the test to default on credit.
Specificity = =





= .94565 the ability of the test to properly default on credit.
Based on the relationship between sensitivity and specificity, we can see that The
equation for defaulting on credit accuracy is 95%, which is extremely and distinguishing
between 1 and 0 for binary value sets.
Then we are asked to create a classification decision tree for creating a regression
decision tree for wage growth using economy, unemployment, and gdp as predictors. To see how
accurate our model is in predicting the values, we use Root Mean Square. We want to see our
model's accuracy then the root mean square of the error between the test values and the predicted
values. In our case, the RMSE is 0.8386.
Looking at two different types of models, one dealing with probability and the other
dealing with regression values, we were able to use various means to determine our model's
accuracy. The credit default model is 95% accurate, far better than our wage growth model.
However, I think more variables need to be analyzed and evaluated to determine which variable
will produce the highest probability of finding the model that best fits the situation. The whole
objective is to develop a model that is custom to the scenario. The key is to understand when a
data set has and what type of model to use.
15
16
6. Citations
Chan, C., Berrier, H., Pardoe, L., & Sturdivant, R. (2020). zyBook for Applied Statistics II for
Science, Technology, Engineering, and Math (STEM).