The main purpose of this project is to predict whether a person is at risk of a heart attack or not, using automated Automated Machine Learning Techniques.
A heart attack occurs when one or more of your coronary arteries become blocked. Over time, a buildup of fatty deposits, including cholesterol, forms substances called plaques, which can narrow the arteries. This condition, called coronary artery disease, causes most heart attacks.
We are given a data set in which various attributes are provided, crucial for heart disease detection. We will build a model using Automated Machine Learning Techniques using this data.
AutoML is the process of automating the tasks of applying machine learning to real-world problems. We will be using EvalML Library. EvalML is an open-source AutoML library written in python that automates a more significant part of the machine learning process. We can quickly evaluate which machine learning pipeline works better for the given data set.
In this project, we will first use different ML Algorithms. Then we will then understand how we can use Auto ML techniques, EvalML for this project to simplify our work.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Let us import our Data Set
df= pd.read_csv("heart.csv")
df= df.drop(['oldpeak','slp','thall'],axis=1)
df.head()
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | caa | output | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 0 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 0 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 0 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0 | 1 |
In order to understand the purpose of this project we need to understand the data set and what each column is in our heart dataset.
This is our independent variable for our data set and it is the most important variable.
What is a binary and if 0 there's less of a chance of getting a heart attack. a bar Target value is 1 there is a higher chance of getting a heart attack.
df.shape
(303, 11)
As we can see will have 303 rows and 11 columns in our data set.
df.isnull().sum()
age 0 sex 0 cp 0 trtbps 0 chol 0 fbs 0 restecg 0 thalachh 0 exng 0 caa 0 output 0 dtype: int64
As we can see there are no null values in our Data Set. This is extremely important to determined because Null values can lead to miscalculations During our data analysis process. It is extremely important to make sure there is no NULL values in our current Data Set. If our Data set had Null Values we will have to use different techniques to deal with them.
df.corr()
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | caa | output | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | -0.098447 | -0.068653 | 0.279351 | 0.213678 | 0.121308 | -0.116211 | -0.398522 | 0.096801 | 0.276326 | -0.225439 |
| sex | -0.098447 | 1.000000 | -0.049353 | -0.056769 | -0.197912 | 0.045032 | -0.058196 | -0.044020 | 0.141664 | 0.118261 | -0.280937 |
| cp | -0.068653 | -0.049353 | 1.000000 | 0.047608 | -0.076904 | 0.094444 | 0.044421 | 0.295762 | -0.394280 | -0.181053 | 0.433798 |
| trtbps | 0.279351 | -0.056769 | 0.047608 | 1.000000 | 0.123174 | 0.177531 | -0.114103 | -0.046698 | 0.067616 | 0.101389 | -0.144931 |
| chol | 0.213678 | -0.197912 | -0.076904 | 0.123174 | 1.000000 | 0.013294 | -0.151040 | -0.009940 | 0.067023 | 0.070511 | -0.085239 |
| fbs | 0.121308 | 0.045032 | 0.094444 | 0.177531 | 0.013294 | 1.000000 | -0.084189 | -0.008567 | 0.025665 | 0.137979 | -0.028046 |
| restecg | -0.116211 | -0.058196 | 0.044421 | -0.114103 | -0.151040 | -0.084189 | 1.000000 | 0.044123 | -0.070733 | -0.072042 | 0.137230 |
| thalachh | -0.398522 | -0.044020 | 0.295762 | -0.046698 | -0.009940 | -0.008567 | 0.044123 | 1.000000 | -0.378812 | -0.213177 | 0.421741 |
| exng | 0.096801 | 0.141664 | -0.394280 | 0.067616 | 0.067023 | 0.025665 | -0.070733 | -0.378812 | 1.000000 | 0.115739 | -0.436757 |
| caa | 0.276326 | 0.118261 | -0.181053 | 0.101389 | 0.070511 | 0.137979 | -0.072042 | -0.213177 | 0.115739 | 1.000000 | -0.391724 |
| output | -0.225439 | -0.280937 | 0.433798 | -0.144931 | -0.085239 | -0.028046 | 0.137230 | 0.421741 | -0.436757 | -0.391724 | 1.000000 |
we are going to process our data set through the correlation function and as we could see above chard that is the results of the correlation. Correlation refers to a process for establishing the relationships between two variables. The value of 1 means that the two variables are completely correlated. This is only possible if we establish relationships between the same variables as we can see in our chart above. Suppose either -1 or 1 one means that they are highly correlated between the two variables of comparison. We're going to convert this to a more graphical form to better understand the relationship between the variables in terms of correlation.
sns.heatmap(df.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1620b9a50>
As you can see our data from above gets converted into a graphical form. The darker the square is the less correlation the two variables that comparisons have to each other. The lighter the square is, the closer to the value 1 and the two variables a comparison is, have a higher correlation between the two variables.
Univariate and Bivariate analysis in out data. Univariate analysis looks at one variable. This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. Bivariate analysis looks at two variables and their relationship. This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables. This is important to get a better understanding how the correlation is distributed through our data. Now we're going to find how the age of the patient is distributed throughout our Data Set. This is important so we could have a better understanding of our data in order to make the best interpretation when we start analyzing the data set further down the process.
plt.figure(figsize=(20, 10))
plt.title("Age of Patients")
plt.xlabel("Age")
sns.countplot(x='age',data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fd162022a90>
plt.figure(figsize=(20, 10))
plt.title("Sex of Patients,0=Female and 1=Male")
sns.countplot(x='sex',data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fd161fda710>
You can see from the graph above after graphing our data that we can see the Sex of Patients. 0=Female and 1=Male. We can clearly see that there is more than 100 more males in our data sets than females. This would be extremely important during our findings later on after analysis of our data.
cp_data= df['cp'].value_counts().reset_index()
cp_data['index'][3]= 'asymptomatic'
cp_data['index'][2]= 'non-anginal'
cp_data['index'][1]= 'Atyppical Anigma'
cp_data['index'][0]= 'Typical Anigma'
cp_data
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy after removing the cwd from sys.path. /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """
| index | cp | |
|---|---|---|
| 0 | Typical Anigma | 143 |
| 1 | Atyppical Anigma | 87 |
| 2 | non-anginal | 50 |
| 3 | asymptomatic | 23 |
As we can see from the data for chest pain of patients. Angina is chest pain or discomfort caused when your heart muscle doesn't get enough oxygen-rich blood. It may feel like pressure or squeezing in your chest. The discomfort also can occur in your shoulders, arms, neck, jaw, abdomen or back. Angina pain may even feel like indigestion. In addition, some people don’t feel any pain but have other symptoms like shortness of breath or fatigue. But angina is not a disease. It's a symptom of an underlying heart problem. We see from the diagram below from our Data set.
plt.figure(figsize=(20, 10))
plt.title("Chest Pain of Patients")
sns.barplot(x=cp_data['index'],y= cp_data['cp'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fd161e5dbd0>
ecg_data= df['restecg'].value_counts().reset_index()
ecg_data['index'][0]= 'normal'
ecg_data['index'][1]= 'having ST-T wave abnormality'
ecg_data['index'][2]= 'showing probable or definite left ventricular hypertrophy by Estes'
ecg_data
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy after removing the cwd from sys.path.
| index | restecg | |
|---|---|---|
| 0 | normal | 152 |
| 1 | having ST-T wave abnormality | 147 |
| 2 | showing probable or definite left ventricular ... | 4 |
An ECG is used to see how the heart is functioning. It mainly records how often the heart beats (heart rate) and how regularly it beats (heart rhythm). It can give us important information, for instance about possible narrowing of the coronary arteries, a heart attack or an irregular heartbeat like atrial fibrillation.
plt.figure(figsize=(20, 10))
plt.title("ECG data of Patients")
sns.barplot(x=ecg_data['index'],y= ecg_data['restecg'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fd161dd7cd0>
We could see our distribution of ECG data and we could see that from the normal to show me probable signs of issues.
sns.pairplot(df,hue='output',data=df)
<seaborn.axisgrid.PairGrid at 0x7fd161d4c850>
Now, after this, we can have multivariate analysis. Multivariate analysis is based on the principles of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. This will tell us how all the features are correlated to each other and also distributed throughout our Data Set. This is done in the form of different bar chart and distribution charts. This is the same as above but it is in the form of a different charge which corresponds to the multivariable analysis. As we can see hue is this 1 or 0. 0 is blue and 1 is orange in our diagram above. We can see that our x and y axis gives us different features. As you can see the different types of graphs above. It gives us an idea how data is distributed throughout our Data Set.
plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
sns.distplot(df['trtbps'], kde=True, color = 'magenta')
plt.xlabel("Resting Blood Pressure (mmHg)")
plt.subplot(1,2,2)
sns.distplot(df['thalachh'], kde=True, color = 'teal')
plt.xlabel("Maximum Heart Rate Achieved (bpm)")
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 0, 'Maximum Heart Rate Achieved (bpm)')
The first graph is resting blodd pressure and the other graph is maximum heart rate achieved. This is a very good sign because in order to feed our data through our models our data has to be normally distributed to get our best results.
plt.figure(figsize=(10,10))
sns.distplot(df['chol'], kde=True, color = 'red')
plt.xlabel("Cholestrol")
C:\Users\khank\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 0, 'Cholestrol')
Now we have to do the same thing for our cholesterol. As you can see are data is normally distributed throughout our data. this is a good sign this means that we can use our data to feed our models. However there are some outliers in our data, Not too many so we don't have to worry about it affecting our data that much.
df.head()
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | caa | output | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 0 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 0 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 0 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0 | 1 |
Now that we have confirmed our data above that it can be used to build our model we will move forward. How will we build our model? Before building a model let's look at our data in the diagram below. There are various parameters in a data set, which range from continuous data to categorical data. Before feeding our data to the machine learning algorithm it is very important we take the step known as data privacy. Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them.
Now we will apply a technique called standardscaler. StandardScaler is an important technique that is mainly performed as a preprocessing step before many machine learning models, in order to standardize the range of functionality of the input dataset. We use a standardized to scale the Data Set values to get the best results when using our models. Then to fit our model we have you standard scaler for our data set.
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
scale.fit(df)
StandardScaler()
df= scale.transform(df)
df=pd.DataFrame(df,columns=['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
'exng', 'caa', 'output'])
df.head()
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | caa | output | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.952197 | 0.681005 | 1.973123 | 0.763956 | -0.256334 | 2.394438 | -1.005832 | 0.015443 | -0.696631 | -0.714429 | 0.914529 |
| 1 | -1.915313 | 0.681005 | 1.002577 | -0.092738 | 0.072199 | -0.417635 | 0.898962 | 1.633471 | -0.696631 | -0.714429 | 0.914529 |
| 2 | -1.474158 | -1.468418 | 0.032031 | -0.092738 | -0.816773 | -0.417635 | -1.005832 | 0.977514 | -0.696631 | -0.714429 | 0.914529 |
| 3 | 0.180175 | 0.681005 | 0.032031 | -0.663867 | -0.198357 | -0.417635 | 0.898962 | 1.239897 | -0.696631 | -0.714429 | 0.914529 |
| 4 | 0.290464 | -1.468418 | -0.938515 | -0.663867 | 2.082050 | -0.417635 | 0.898962 | 0.583939 | 1.435481 | -0.714429 | 0.914529 |
We can see from the chart above our data has been scaled down between -1 and 1. These new scaled datasets values will better fit our models.
Logistic Regression - is used to calculate the probability of a binary event occurring, and to deal with issues of classification
Decision Tree - is a type of flowchart that shows a clear pathway to a decision. In terms of data analytics, it is a type of algorithm that includes conditional 'control' statements to classify data.
Random Forest - is a powerful and versatile supervised machine learning algorithm that grows and combines multiple decision trees to create a "forest." It can be used for both classification and regression problems.
K Nearest Neighbor - is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.
SVM - is a type of deep learning algorithm that performs supervised learning for classification or regression of data groups. In AI and machine learning, supervised learning systems provide both input and desired output data, which are labeled for classification.
However, to accomplish all these models above we have to split our data to dependent and independent variables in order to analyze our data correctly through all the different type of models we are going to use.
x= df.iloc[:,:-1]
x
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | caa | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.952197 | 0.681005 | 1.973123 | 0.763956 | -0.256334 | 2.394438 | -1.005832 | 0.015443 | -0.696631 | -0.714429 |
| 1 | -1.915313 | 0.681005 | 1.002577 | -0.092738 | 0.072199 | -0.417635 | 0.898962 | 1.633471 | -0.696631 | -0.714429 |
| 2 | -1.474158 | -1.468418 | 0.032031 | -0.092738 | -0.816773 | -0.417635 | -1.005832 | 0.977514 | -0.696631 | -0.714429 |
| 3 | 0.180175 | 0.681005 | 0.032031 | -0.663867 | -0.198357 | -0.417635 | 0.898962 | 1.239897 | -0.696631 | -0.714429 |
| 4 | 0.290464 | -1.468418 | -0.938515 | -0.663867 | 2.082050 | -0.417635 | 0.898962 | 0.583939 | 1.435481 | -0.714429 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 298 | 0.290464 | -1.468418 | -0.938515 | 0.478391 | -0.101730 | -0.417635 | 0.898962 | -1.165281 | 1.435481 | -0.714429 |
| 299 | -1.033002 | 0.681005 | 1.973123 | -1.234996 | 0.342756 | -0.417635 | 0.898962 | -0.771706 | -0.696631 | -0.714429 |
| 300 | 1.503641 | 0.681005 | -0.938515 | 0.706843 | -1.029353 | 2.394438 | 0.898962 | -0.378132 | -0.696631 | 1.244593 |
| 301 | 0.290464 | 0.681005 | -0.938515 | -0.092738 | -2.227533 | -0.417635 | 0.898962 | -1.515125 | 1.435481 | 0.265082 |
| 302 | 0.290464 | -1.468418 | 0.032031 | -0.092738 | -0.198357 | -0.417635 | -1.005832 | 1.064975 | -0.696631 | 0.265082 |
303 rows × 10 columns
We can see from our chart above X is our independent variable
y= df.iloc[:,-1:]
y
| output | |
|---|---|
| 0 | 0.914529 |
| 1 | 0.914529 |
| 2 | 0.914529 |
| 3 | 0.914529 |
| 4 | 0.914529 |
| ... | ... |
| 298 | -1.093459 |
| 299 | -1.093459 |
| 300 | -1.093459 |
| 301 | -1.093459 |
| 302 | -1.093459 |
303 rows × 1 columns
We can see from our chart above Y is our dependent variable
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
From Sklean.Model_selection will are going to import train_test_split, x_train, x_test, y_train, y_test variables. We are using the trian_test_split funciton. This function is going to take our data set and feed it to different variables. These variables are going to be used in our various different models.
from sklearn.linear_model import LogisticRegression # import from sklearn
from sklearn.preprocessing import LabelEncoder # this is needed for preprocessing of out data
# we are using labelencoder we have bring y_trian to zero for our data set
lbl= LabelEncoder()
encoded_y= lbl.fit_transform(y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:251: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
logreg= LogisticRegression()
logreg = LogisticRegression()
logreg.fit(x_train, encoded_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Y_pred1
array([0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,
0, 1, 0])
#Import the accuracy score
from sklearn.metrics import accuracy_score
#Import the confusion matriax
from sklearn.metrics import confusion_matrix
encoded_ytest= lbl.fit_transform(y_test)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:251: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Y_pred1 = logreg.predict(x_test)
lr_conf_matrix = confusion_matrix(encoded_ytest,Y_pred1 )
lr_acc_score = accuracy_score(encoded_ytest, Y_pred1)
The logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. Predicting the category of a categorical response is known as classification. Because of the classification and regression models, I need a cutoff point where the predictive value will be True or False. Because the Y dependent variable is a probability percentage, we need a cutoff point where a particular value will be true or false to develop some regression model analysis. A confusion matrix can evaluate a logistic regression model's performance on the dataset used to create the model. The table's rows represent the predicted outcomes, while the columns represent the actual outcomes.
lr_conf_matrix
array([[35, 9],
[ 4, 43]])
Once again, a logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is .857143, which is close to 1. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is extremely well, which means that. Our model is doing a good job in distinguishing what is in a binary value, said what is a 0 and what is a 1. Then we should expect good results are moving forward with her model and Theory.
print(lr_acc_score*100,"%")
85.71428571428571 %
The logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our Recall is 85.714 percent, which is close to 1. The Recall is the ratio of correct positive predictions to the total positives examples. When assessing the classification model's performance, our accuracy is near one exceptionally well, which means that. Our model is doing an excellent job in distinguishing what is in a binary value, said what is a 0 and what is a 1.
from sklearn.tree import DecisionTreeClassifier
tree= DecisionTreeClassifier()
tree.fit(x_train,encoded_y)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
ypred2=tree.predict(x_test)
encoded_ytest= lbl.fit_transform(y_test)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:251: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
tree_conf_matrix = confusion_matrix(encoded_ytest,ypred2 )
tree_acc_score = accuracy_score(encoded_ytest, ypred2)
positive predictions to the total predicted positives. When assessing the classification model's performance, our accuracy is to distinguishing what is in a binary value, said what is a 0 and what is a 1.
tree_conf_matrix
array([[26, 18],
[ 9, 38]])
The model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our Recall is 70.329 percent. The Recall is the ratio of correct positive predictions to the total positives examples. When assessing the classification model's performance, our accuracy is not well, which means that. Our model is not doing well distinguishing what is in a binary value, saying what a 0 and a 1 are. This is not the best model results.
print(tree_acc_score*100,"%")
70.32967032967034 %
The Heart Attack Risk Predictor model is 70% accurate. Though, I think more variables need to be analyzed and evaluated to find which variable will produce the highest probability of finding the model that best fits the situation. The whole objective is to develop a model that is custom to the scenario, however, this is not the best results.
from sklearn.ensemble import RandomForestClassifier
rf= RandomForestClassifier()
rf.fit(x_train,encoded_y)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
ypred3 = rf.predict(x_test)
rf_conf_matrix = confusion_matrix(encoded_ytest,ypred3 )
rf_acc_score = accuracy_score(encoded_ytest, ypred3)
What is the use of training and testing sets when creating a random forest model? Random forest is a viral machine learning algorithm to make predictions on classification and regression problems. The critical part of the random forest is that the algorithm uses not just one but multiple learners to obtain better predictive performance. Then we use the multiple decision trees on the training data to make predictions. We were able to obtain a model that we then use to make predictions on the testing data set. However, how would the predictions occur well? In case we use classification data sets, we have to predict a binary variable. For example, yes or no, we can use the majority rule to create a random forest model with decision trees. We can then drive down the classification model to split the point into multiple samples to better predict the actual model. Using the training set and testing said is extremely important and creating a random forest model.
rf_conf_matrix
array([[30, 14],
[ 8, 39]])
Once again, our model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is 75.82417. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is not doinf well, which means that. Our model is not doing a job in distinguishing what is in a binary value, said what is a 0 and what is a 1.
print(rf_acc_score*100,"%")
75.82417582417582 %
from sklearn.neighbors import KNeighborsClassifier
error_rate= []
for i in range(1,40):
knn= KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,encoded_y)
pred= knn.predict(x_test)
error_rate.append(np.mean(pred != encoded_ytest))
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.xlabel('K Vlaue')
plt.ylabel('Error rate')
plt.title('To check the correct value of k')
plt.show()
Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set k = n^(1/2). The main reason is the disadvantages that K Nearest Neighbourhave is that they can overfit. If the data set is used to train those models and produce a model that fits the training data sets well makes the best K-value; However, if it fits it so well, it has biased and does not perform well on any future data sets. The K Nearest Neighbour not generalize very well and goes around this overfitting issue, and we can use the training set to train the model. THis whay our value is K=12 gives us the best error rate.
knn= KNeighborsClassifier(n_neighbors=12)
knn.fit(x_train,encoded_y)
ypred4= knn.predict(x_test)
knn_conf_matrix = confusion_matrix(encoded_ytest,ypred4 )
knn_acc_score = accuracy_score(encoded_ytest, ypred4)
knn_conf_matrix
array([[35, 9],
[ 5, 42]])
print(knn_acc_score*100,"%")
84.61538461538461 %
from sklearn import svm
svm= svm.SVC()
svm.fit(x_train,encoded_y)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
ypred5= svm.predict(x_test)
svm_conf_matrix = confusion_matrix(encoded_ytest,ypred5)
svm_acc_score = accuracy_score(encoded_ytest, ypred5)
svm_conf_matrix
array([[34, 10],
[ 8, 39]])
print(svm_acc_score*100,"%")
80.21978021978022 %
Now we can going to add our accuracy for Logistic Regression, Decision Tree, Random Forest, K nearest neighbor and SVM to a DataFrame so we can compare the model accuracy.
model_acc= pd.DataFrame({'Model' : ['Logistic Regression','Decision Tree','Random Forest','K Nearest Neighbor','SVM'],'Accuracy' : [lr_acc_score*100,tree_acc_score*100,rf_acc_score*100,knn_acc_score*100,svm_acc_score*100]})
model_acc = model_acc.sort_values(by=['Accuracy'],ascending=False)
model_acc
| Model | Accuracy | |
|---|---|---|
| 0 | Logistic Regression | 85.714286 |
| 3 | K Nearest Neighbor | 84.615385 |
| 4 | SVM | 80.219780 |
| 2 | Random Forest | 75.824176 |
| 1 | Decision Tree | 70.329670 |
Logistic Regression model had the highest accuracy at around 85 percent. These are some of the more primitive model that are used in machine learning.
AdaBoost as one of the best weighted combinations of classifiers – and one that is sensitive to noise, and conducive to certain machine learning results. Some confusion results from the reality that AdaBoost can be used with multiple instances of the same classifier with different parameters – where professionals might talk about AdaBoost "having only one classifier" and get confused about how weighting occurs.
AdaBoost also presents a particular philosophy in machine learning – as an ensemble learning tool, it proceeds from the fundamental idea that many weak learners can get better results than one stronger learning entity. With AdaBoost, machine learning experts are often crafting systems that will take in a number of inputs and combine them for an optimized result.
from sklearn.ensemble import AdaBoostClassifier
adab= AdaBoostClassifier(base_estimator=svm,n_estimators=100,algorithm='SAMME',learning_rate=0.01,random_state=0)
adab.fit(x_train,encoded_y)
AdaBoostClassifier(algorithm='SAMME',
base_estimator=SVC(C=1.0, break_ties=False, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3,
gamma='scale', kernel='rbf', max_iter=-1,
probability=False, random_state=None,
shrinking=True, tol=0.001,
verbose=False),
learning_rate=0.01, n_estimators=100, random_state=0)
ypred6=adab.predict(x_test)
adab_conf_matrix = confusion_matrix(encoded_ytest,ypred6)
adab_acc_score = accuracy_score(encoded_ytest, ypred6)
adab_conf_matrix
array([[ 0, 44],
[ 0, 47]])
print(adab_acc_score*100,"%")
51.64835164835166 %
adab.score(x_train,encoded_y)
0.5566037735849056
adab.score(x_test,encoded_ytest)
0.5164835164835165
As we can see we got worse results than our other models that were used. This can occur because parameters need to be more fine tuned to get better Accuracy.
Hyperparameter Tuning with GridSearch Cross-Validation (CV) is another technique to find the best parameters for your model. To perform hyperparameter tuning with Grid Search using Scikit Learn (Sklearn).
from sklearn.model_selection import GridSearchCV
model_acc
| Model | Accuracy | |
|---|---|---|
| 0 | Logistic Regression | 85.714286 |
| 3 | K Nearest Neighbor | 84.615385 |
| 4 | SVM | 80.219780 |
| 2 | Random Forest | 75.824176 |
| 1 | Decision Tree | 70.329670 |
param_grid= {
'solver': ['newton-cg', 'lbfgs', 'liblinear','sag', 'saga'],
'penalty' : ['none', 'l1', 'l2', 'elasticnet'],
'C' : [100, 10, 1.0, 0.1, 0.01]
}
We have different attributes that we're going to define for our logistic regression model.
grid1= GridSearchCV(LogisticRegression(),param_grid)
grid1.fit(x_train,encoded_y)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: penalty='none' is not supported for the liblinear solver FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None) FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: penalty='none' is not supported for the liblinear solver FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None) FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: penalty='none' is not supported for the liblinear solver FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None) FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: penalty='none' is not supported for the liblinear solver FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None) FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: penalty='none' is not supported for the liblinear solver FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:1505: UserWarning: Setting penalty='none' will ignore the C and l1_ratio parameters "Setting penalty='none' will ignore the C and l1_ratio " /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: Solver sag supports only 'l2' or 'none' penalties, got elasticnet penalty. FitFailedWarning) /usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None) FitFailedWarning)
GridSearchCV(cv=None, error_score=nan,
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, l1_ratio=None,
max_iter=100, multi_class='auto',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs',
tol=0.0001, verbose=0,
warm_start=False),
iid='deprecated', n_jobs=None,
param_grid={'C': [100, 10, 1.0, 0.1, 0.01],
'penalty': ['none', 'l1', 'l2', 'elasticnet'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
'saga']},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
grid1.best_params_
{'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}
'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear' are our parameters we're going to apply to our regression model.
logreg1= LogisticRegression(C=0.01,penalty='l2',solver='liblinear')
logreg1.fit(x_train,encoded_y)
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
logreg_pred= logreg1.predict(x_test)
logreg_pred_conf_matrix = confusion_matrix(encoded_ytest,logreg_pred)
logreg_pred_acc_score = accuracy_score(encoded_ytest, logreg_pred)
logreg_pred_conf_matrix
array([[33, 11],
[ 6, 41]])
print(logreg_pred_acc_score*100,"%")
81.31868131868131 %
Now with our K nearest neighbor now let us try Grid Search CV and performing Algorithms for HyperParameter tuning for our model
n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=knn, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_search.fit(x_train,encoded_y)
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1),
error_score=0,
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=12, p=2,
weights='uniform'),
iid='deprecated', n_jobs=-1,
param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
'n_neighbors': range(1, 21, 2),
'weights': ['uniform', 'distance']},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='accuracy', verbose=0)
grid_search.best_params_
{'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'distance'}
'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'distance' are best are our parameters we're going to apply to our model.
# now we apply to our model
knn= KNeighborsClassifier(n_neighbors=12,metric='manhattan',weights='distance')
knn.fit(x_train,encoded_y)
knn_pred= knn.predict(x_test)
knn_pred_conf_matrix = confusion_matrix(encoded_ytest,knn_pred)
knn_pred_acc_score = accuracy_score(encoded_ytest, knn_pred)
knn_pred_conf_matrix
array([[33, 11],
[ 5, 42]])
print(knn_pred_acc_score*100,"%")
82.41758241758241 %
Now with our SVM now let us try Grid Search CV and performing Algorithms for HyperParameter tuning for our model
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
gamma = ['scale']
grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=svm, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_search.fit(x_train,encoded_y)
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1),
error_score=0,
estimator=SVC(C=1.0, break_ties=False, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3,
gamma='scale', kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
iid='deprecated', n_jobs=-1,
param_grid={'C': [50, 10, 1.0, 0.1, 0.01], 'gamma': ['scale'],
'kernel': ['poly', 'rbf', 'sigmoid']},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='accuracy', verbose=0)
grid_search.best_params_
{'C': 0.1, 'gamma': 'scale', 'kernel': 'sigmoid'}
'C': 0.1, 'gamma': 'scale', 'kernel': 'sigmoid' are best are our parameters we're going to apply to our model.
from sklearn.svm import SVC
svc= SVC(C= 0.1, gamma= 'scale',kernel= 'sigmoid')
svc.fit(x_train,encoded_y)
SVC(C=0.1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='sigmoid',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
svm_pred= svc.predict(x_test)
svm_pred_conf_matrix = confusion_matrix(encoded_ytest,svm_pred)
svm_pred_acc_score = accuracy_score(encoded_ytest, svm_pred)
svm_pred_conf_matrix
array([[32, 12],
[ 5, 42]])
print(svm_pred_acc_score*100,"%")
81.31868131868131 %
As we see the Logistic Regression Model have a 85% accuracy compaired to 81%. Logistic regression is used to calculate the probability of a binary event occurring, and to deal with issues of classification and this is our best model we are going to use.
logreg= LogisticRegression()
logreg = LogisticRegression()
logreg.fit(x_train, encoded_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Y_pred1
array([0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,
0, 1, 0])
lr_conf_matrix
array([[35, 9],
[ 4, 43]])
print(lr_acc_score*100,"%")
85.71428571428571 %
The logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. Predicting the category of a categorical response is known as classification. Because of the classification and regression models, I need a cutoff point where the predictive value will be True or False. Because the Y dependent variable is a probability percentage, we need a cutoff point where a particular value will be true or false to develop some regression model analysis. A confusion matrix can evaluate a logistic regression model's performance on the dataset used to create the model. The table's rows represent the predicted outcomes, while the columns represent the actual outcomes.
# Confusion Matrix of Model enlarged
options = ["Disease", 'No Disease']
fig, ax = plt.subplots()
im = ax.imshow(lr_conf_matrix, cmap= 'Set3', interpolation='nearest')
# We want to show all ticks...
ax.set_xticks(np.arange(len(options)))
ax.set_yticks(np.arange(len(options)))
# ... and label them with the respective list entries
ax.set_xticklabels(options)
ax.set_yticklabels(options)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(options)):
for j in range(len(options)):
text = ax.text(j, i, lr_conf_matrix[i, j],
ha="center", va="center", color="black")
ax.set_title("Confusion Matrix of Logistic Regression Model")
fig.tight_layout()
plt.xlabel('Model Prediction')
plt.ylabel('Actual Result')
plt.show()
print("ACCURACY of our model is ",lr_acc_score*100,"%")
ACCURACY of our model is 85.71428571428571 %
Once again, a logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is 85.714 percent, which is close to 1. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is extremely well, which means that. Our model is doing a good job in distinguishing what is in a binary value, said what is a 0 and what is a 1. Then we should expect good results are moving forward.
The general form of a logistic regression model for as independent variables are Age, Sex, Exang, Ca, Cp, Trtbps, Chol, Fbs, Rest_ecg, Thalach . Independent variable which is the Target predicts weather a person is having a risk of heart disease not and our accuracy is 85.7 perceent Though, I think more variables need to be analyzed and evaluated to find which variable will produce the highest probability of finding the regression model that best fits the situation. The whole objective is to develop a regression model that is custom to the scenario. The key is to understand when a data set has and what type of regression model to use.
import pickle
pickle.dump(logreg,open('heart.pkl','wb'))
AutoML is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset. During the search, AutoML will explore different combinations of model type, model parameters and model architecture. We used to further evaluate our data set and compare to are logistic regression model.
!pip install evalml
Collecting evalml
Downloading evalml-0.30.2-py3-none-any.whl (6.3 MB)
|████████████████████████████████| 6.3 MB 3.8 MB/s
Requirement already satisfied: seaborn>=0.11.1 in /usr/local/lib/python3.7/dist-packages (from evalml) (0.11.1)
Collecting cloudpickle>=1.5.0
Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB)
Collecting colorama>=0.4.4
Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting category-encoders>=2.2.2
Downloading category_encoders-2.2.2-py2.py3-none-any.whl (80 kB)
|████████████████████████████████| 80 kB 6.3 MB/s
Collecting imbalanced-learn>=0.8.0
Downloading imbalanced_learn-0.8.0-py3-none-any.whl (206 kB)
|████████████████████████████████| 206 kB 62.6 MB/s
Collecting pandas>=1.2.5
Downloading pandas-1.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
|████████████████████████████████| 11.3 MB 35.4 MB/s
Collecting pmdarima==1.8.0
Downloading pmdarima-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (1.5 MB)
|████████████████████████████████| 1.5 MB 40.1 MB/s
Collecting lightgbm>=2.3.1
Downloading lightgbm-3.2.1-py3-none-manylinux1_x86_64.whl (2.0 MB)
|████████████████████████████████| 2.0 MB 48.1 MB/s
Collecting requirements-parser>=0.2.0
Downloading requirements-parser-0.2.0.tar.gz (6.3 kB)
Collecting matplotlib>=3.3.3
Downloading matplotlib-3.4.3-cp37-cp37m-manylinux1_x86_64.whl (10.3 MB)
|████████████████████████████████| 10.3 MB 31.2 MB/s
Requirement already satisfied: click>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from evalml) (7.1.2)
Collecting networkx<2.6,>=2.5
Downloading networkx-2.5.1-py3-none-any.whl (1.6 MB)
|████████████████████████████████| 1.6 MB 58.3 MB/s
Collecting sktime>=0.7.0
Downloading sktime-0.7.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
|████████████████████████████████| 5.8 MB 26.7 MB/s
Requirement already satisfied: ipywidgets>=7.5 in /usr/local/lib/python3.7/dist-packages (from evalml) (7.6.3)
Collecting graphviz>=0.13
Downloading graphviz-0.17-py3-none-any.whl (18 kB)
Collecting plotly>=5.0.0
Downloading plotly-5.2.1-py2.py3-none-any.whl (21.8 MB)
|████████████████████████████████| 21.8 MB 1.3 MB/s
Collecting nlp-primitives>=1.1.0
Downloading nlp_primitives-1.1.0-py3-none-any.whl (18.0 MB)
|████████████████████████████████| 18.0 MB 116 kB/s
Collecting scipy>=1.5.0
Downloading scipy-1.7.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.5 MB)
|████████████████████████████████| 28.5 MB 50 kB/s
Collecting woodwork==0.5.1
Downloading woodwork-0.5.1-py3-none-any.whl (135 kB)
|████████████████████████████████| 135 kB 62.8 MB/s
Collecting shap>=0.36.0
Downloading shap-0.39.0.tar.gz (356 kB)
|████████████████████████████████| 356 kB 59.2 MB/s
Collecting xgboost>=1.4.2
Downloading xgboost-1.4.2-py3-none-manylinux2010_x86_64.whl (166.7 MB)
|████████████████████████████████| 166.7 MB 7.5 kB/s
Collecting featuretools>=0.26.1
Downloading featuretools-0.26.1-py3-none-any.whl (327 kB)
|████████████████████████████████| 327 kB 48.7 MB/s
Requirement already satisfied: dask>=2.12.0 in /usr/local/lib/python3.7/dist-packages (from evalml) (2.12.0)
Collecting kaleido>=0.1.0
Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
|████████████████████████████████| 79.9 MB 42 kB/s
Collecting numpy>=1.20.0
Downloading numpy-1.21.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
|████████████████████████████████| 15.7 MB 183 kB/s
Collecting scikit-optimize>=0.8.1
Downloading scikit_optimize-0.8.1-py2.py3-none-any.whl (101 kB)
|████████████████████████████████| 101 kB 8.5 MB/s
Collecting texttable>=1.6.2
Downloading texttable-1.6.4-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: pyzmq>=20.0.0 in /usr/local/lib/python3.7/dist-packages (from evalml) (22.2.1)
Collecting scikit-learn>=0.24.0
Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
|████████████████████████████████| 22.3 MB 30 kB/s
Collecting catboost>=0.20
Downloading catboost-0.26.1-cp37-none-manylinux1_x86_64.whl (67.4 MB)
|████████████████████████████████| 67.4 MB 65 kB/s
Collecting statsmodels>=0.12.2
Downloading statsmodels-0.12.2-cp37-cp37m-manylinux1_x86_64.whl (9.5 MB)
|████████████████████████████████| 9.5 MB 35.9 MB/s
Collecting psutil>=5.6.6
Downloading psutil-5.8.0-cp37-cp37m-manylinux2010_x86_64.whl (296 kB)
|████████████████████████████████| 296 kB 60.4 MB/s
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /usr/local/lib/python3.7/dist-packages (from pmdarima==1.8.0->evalml) (57.4.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from pmdarima==1.8.0->evalml) (1.24.3)
Collecting Cython<0.29.18,>=0.29
Downloading Cython-0.29.17-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
|████████████████████████████████| 2.1 MB 56.2 MB/s
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pmdarima==1.8.0->evalml) (1.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost>=0.20->evalml) (1.15.0)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from category-encoders>=2.2.2->evalml) (0.5.1)
Collecting distributed>=2.12.0
Downloading distributed-2021.8.0-py3-none-any.whl (776 kB)
|████████████████████████████████| 776 kB 38.7 MB/s
Collecting pyyaml>=5.4
Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
|████████████████████████████████| 636 kB 50.1 MB/s
Requirement already satisfied: tqdm>=4.32.0 in /usr/local/lib/python3.7/dist-packages (from featuretools>=0.26.1->evalml) (4.62.0)
Collecting partd>=0.3.10
Downloading partd-1.2.0-py3-none-any.whl (19 kB)
Collecting fsspec>=0.6.0
Downloading fsspec-2021.7.0-py3-none-any.whl (118 kB)
|████████████████████████████████| 118 kB 71.7 MB/s
Requirement already satisfied: toolz>=0.7.3 in /usr/local/lib/python3.7/dist-packages (from dask>=2.12.0->evalml) (0.11.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.12.0->featuretools>=0.26.1->evalml) (2.11.3)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.12.0->featuretools>=0.26.1->evalml) (2.0.0)
Collecting distributed>=2.12.0
Downloading distributed-2021.7.2-py3-none-any.whl (769 kB)
|████████████████████████████████| 769 kB 69.5 MB/s
Downloading distributed-2021.7.1-py3-none-any.whl (766 kB)
|████████████████████████████████| 766 kB 56.5 MB/s
Downloading distributed-2021.7.0-py3-none-any.whl (1.0 MB)
|████████████████████████████████| 1.0 MB 42.3 MB/s
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.12.0->featuretools>=0.26.1->evalml) (1.0.2)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.12.0->featuretools>=0.26.1->evalml) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.12.0->featuretools>=0.26.1->evalml) (1.7.0)
Requirement already satisfied: tornado>=5 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.12.0->featuretools>=0.26.1->evalml) (5.1.1)
Downloading distributed-2021.6.2-py3-none-any.whl (722 kB)
|████████████████████████████████| 722 kB 64.3 MB/s
Downloading distributed-2021.6.1-py3-none-any.whl (722 kB)
|████████████████████████████████| 722 kB 53.6 MB/s
Downloading distributed-2021.6.0-py3-none-any.whl (715 kB)
|████████████████████████████████| 715 kB 56.3 MB/s
Downloading distributed-2021.5.1-py3-none-any.whl (705 kB)
|████████████████████████████████| 705 kB 73.3 MB/s
Downloading distributed-2021.5.0-py3-none-any.whl (699 kB)
|████████████████████████████████| 699 kB 75.0 MB/s
Downloading distributed-2021.4.1-py3-none-any.whl (696 kB)
|████████████████████████████████| 696 kB 67.9 MB/s
Downloading distributed-2021.4.0-py3-none-any.whl (684 kB)
|████████████████████████████████| 684 kB 74.6 MB/s
Downloading distributed-2021.3.1-py3-none-any.whl (679 kB)
|████████████████████████████████| 679 kB 49.8 MB/s
Downloading distributed-2021.3.0-py3-none-any.whl (675 kB)
|████████████████████████████████| 675 kB 59.0 MB/s
Downloading distributed-2021.2.0-py3-none-any.whl (675 kB)
|████████████████████████████████| 675 kB 53.0 MB/s
Downloading distributed-2021.1.1-py3-none-any.whl (672 kB)
|████████████████████████████████| 672 kB 45.5 MB/s
Downloading distributed-2021.1.0-py3-none-any.whl (671 kB)
|████████████████████████████████| 671 kB 46.9 MB/s
Downloading distributed-2020.12.0-py3-none-any.whl (669 kB)
|████████████████████████████████| 669 kB 60.9 MB/s
Downloading distributed-2.30.1-py3-none-any.whl (656 kB)
|████████████████████████████████| 656 kB 41.6 MB/s
Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5->evalml) (5.5.0)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5->evalml) (5.1.3)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5->evalml) (3.5.1)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5->evalml) (1.0.0)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5->evalml) (5.0.5)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5->evalml) (4.10.1)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets>=7.5->evalml) (5.3.5)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5->evalml) (0.8.1)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5->evalml) (4.8.0)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5->evalml) (2.6.1)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5->evalml) (4.4.2)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5->evalml) (1.0.18)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5->evalml) (0.7.5)
Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm>=2.3.1->evalml) (0.37.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.3.3->evalml) (0.10.0)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.3.3->evalml) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.3.3->evalml) (2.8.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.3.3->evalml) (7.1.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.3.3->evalml) (1.3.1)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5->evalml) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5->evalml) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5->evalml) (4.7.1)
Collecting nltk>=3.4.5
Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
|████████████████████████████████| 1.5 MB 48.2 MB/s
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from nltk>=3.4.5->nlp-primitives>=1.1.0->evalml) (2019.12.20)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.2.5->evalml) (2018.9)
Collecting locket
Downloading locket-0.2.1-py2.py3-none-any.whl (4.1 kB)
Collecting tenacity>=6.2.0
Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0->ipywidgets>=7.5->evalml) (0.2.5)
Collecting threadpoolctl>=2.0.0
Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Collecting pyaml>=16.9
Downloading pyaml-21.8.3-py2.py3-none-any.whl (17 kB)
Collecting slicer==0.0.7
Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: numba in /usr/local/lib/python3.7/dist-packages (from shap>=0.36.0->evalml) (0.51.2)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba->shap>=0.36.0->evalml) (0.34.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (5.3.1)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (0.10.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (1.8.0)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (5.6.1)
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.7/dist-packages (from terminado>=0.8.1->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (0.7.0)
Requirement already satisfied: heapdict in /usr/local/lib/python3.7/dist-packages (from zict>=0.1.3->distributed>=2.12.0->featuretools>=0.26.1->evalml) (1.0.1)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->distributed>=2.12.0->featuretools>=0.26.1->evalml) (2.0.1)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (0.5.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (1.4.3)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (0.8.4)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (4.0.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (0.7.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (0.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (21.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5->evalml) (0.5.1)
Building wheels for collected packages: requirements-parser, shap
Building wheel for requirements-parser (setup.py) ... done
Created wheel for requirements-parser: filename=requirements_parser-0.2.0-py3-none-any.whl size=7688 sha256=2f7e9fe7c627f7c263a5ec8f5aae0d03d9c60b8573ff5c02c77674636bab6d3d
Stored in directory: /root/.cache/pip/wheels/40/e4/ca/8af24ee94c3863d620d6a52793d82930f4d1d3515a5121d495
Building wheel for shap (setup.py) ... done
Created wheel for shap: filename=shap-0.39.0-cp37-cp37m-linux_x86_64.whl size=491637 sha256=57942c69d15f97644e811a122c877c564594d05509f9c696f3ee859016991b4b
Stored in directory: /root/.cache/pip/wheels/ca/25/8f/6ae5df62c32651cd719e972e738a8aaa4a87414c4d2b14c9c0
Successfully built requirements-parser shap
Installing collected packages: numpy, locket, pyyaml, psutil, partd, pandas, fsspec, cloudpickle, threadpoolctl, tenacity, scipy, distributed, statsmodels, slicer, scikit-learn, pyaml, plotly, nltk, matplotlib, graphviz, featuretools, Cython, xgboost, woodwork, texttable, sktime, shap, scikit-optimize, requirements-parser, pmdarima, nlp-primitives, networkx, lightgbm, kaleido, imbalanced-learn, colorama, category-encoders, catboost, evalml
Attempting uninstall: numpy
Found existing installation: numpy 1.19.5
Uninstalling numpy-1.19.5:
Successfully uninstalled numpy-1.19.5
Attempting uninstall: pyyaml
Found existing installation: PyYAML 3.13
Uninstalling PyYAML-3.13:
Successfully uninstalled PyYAML-3.13
Attempting uninstall: psutil
Found existing installation: psutil 5.4.8
Uninstalling psutil-5.4.8:
Successfully uninstalled psutil-5.4.8
Attempting uninstall: pandas
Found existing installation: pandas 1.1.5
Uninstalling pandas-1.1.5:
Successfully uninstalled pandas-1.1.5
Attempting uninstall: cloudpickle
Found existing installation: cloudpickle 1.3.0
Uninstalling cloudpickle-1.3.0:
Successfully uninstalled cloudpickle-1.3.0
Attempting uninstall: scipy
Found existing installation: scipy 1.4.1
Uninstalling scipy-1.4.1:
Successfully uninstalled scipy-1.4.1
Attempting uninstall: distributed
Found existing installation: distributed 1.25.3
Uninstalling distributed-1.25.3:
Successfully uninstalled distributed-1.25.3
Attempting uninstall: statsmodels
Found existing installation: statsmodels 0.10.2
Uninstalling statsmodels-0.10.2:
Successfully uninstalled statsmodels-0.10.2
Attempting uninstall: scikit-learn
Found existing installation: scikit-learn 0.22.2.post1
Uninstalling scikit-learn-0.22.2.post1:
Successfully uninstalled scikit-learn-0.22.2.post1
Attempting uninstall: plotly
Found existing installation: plotly 4.4.1
Uninstalling plotly-4.4.1:
Successfully uninstalled plotly-4.4.1
Attempting uninstall: nltk
Found existing installation: nltk 3.2.5
Uninstalling nltk-3.2.5:
Successfully uninstalled nltk-3.2.5
Attempting uninstall: matplotlib
Found existing installation: matplotlib 3.2.2
Uninstalling matplotlib-3.2.2:
Successfully uninstalled matplotlib-3.2.2
Attempting uninstall: graphviz
Found existing installation: graphviz 0.10.1
Uninstalling graphviz-0.10.1:
Successfully uninstalled graphviz-0.10.1
Attempting uninstall: Cython
Found existing installation: Cython 0.29.24
Uninstalling Cython-0.29.24:
Successfully uninstalled Cython-0.29.24
Attempting uninstall: xgboost
Found existing installation: xgboost 0.90
Uninstalling xgboost-0.90:
Successfully uninstalled xgboost-0.90
Attempting uninstall: networkx
Found existing installation: networkx 2.6.2
Uninstalling networkx-2.6.2:
Successfully uninstalled networkx-2.6.2
Attempting uninstall: lightgbm
Found existing installation: lightgbm 2.2.3
Uninstalling lightgbm-2.2.3:
Successfully uninstalled lightgbm-2.2.3
Attempting uninstall: imbalanced-learn
Found existing installation: imbalanced-learn 0.4.3
Uninstalling imbalanced-learn-0.4.3:
Successfully uninstalled imbalanced-learn-0.4.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.5.0 requires numpy~=1.19.2, but you have numpy 1.21.2 which is incompatible.
google-colab 1.0.0 requires pandas~=1.1.0; python_version >= "3.0", but you have pandas 1.3.2 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed Cython-0.29.17 catboost-0.26.1 category-encoders-2.2.2 cloudpickle-1.6.0 colorama-0.4.4 distributed-2.30.1 evalml-0.30.2 featuretools-0.26.1 fsspec-2021.7.0 graphviz-0.17 imbalanced-learn-0.8.0 kaleido-0.2.1 lightgbm-3.2.1 locket-0.2.1 matplotlib-3.4.3 networkx-2.5.1 nlp-primitives-1.1.0 nltk-3.6.2 numpy-1.21.2 pandas-1.3.2 partd-1.2.0 plotly-5.2.1 pmdarima-1.8.0 psutil-5.8.0 pyaml-21.8.3 pyyaml-5.4.1 requirements-parser-0.2.0 scikit-learn-0.24.2 scikit-optimize-0.8.1 scipy-1.7.1 shap-0.39.0 sktime-0.7.0 slicer-0.0.7 statsmodels-0.12.2 tenacity-8.0.1 texttable-1.6.4 threadpoolctl-2.2.0 woodwork-0.5.1 xgboost-1.4.2
df= pd.read_csv("heart.csv")
df.head()
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | oldpeak | slp | caa | thall | output | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
Let us split our Data Set into Dependent i.e our Targer variable and independent variable
x= df.iloc[:,:-1]
x
| age | sex | cp | trtbps | chol | fbs | restecg | thalachh | exng | caa | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.952197 | 0.681005 | 1.973123 | 0.763956 | -0.256334 | 2.394438 | -1.005832 | 0.015443 | -0.696631 | -0.714429 |
| 1 | -1.915313 | 0.681005 | 1.002577 | -0.092738 | 0.072199 | -0.417635 | 0.898962 | 1.633471 | -0.696631 | -0.714429 |
| 2 | -1.474158 | -1.468418 | 0.032031 | -0.092738 | -0.816773 | -0.417635 | -1.005832 | 0.977514 | -0.696631 | -0.714429 |
| 3 | 0.180175 | 0.681005 | 0.032031 | -0.663867 | -0.198357 | -0.417635 | 0.898962 | 1.239897 | -0.696631 | -0.714429 |
| 4 | 0.290464 | -1.468418 | -0.938515 | -0.663867 | 2.082050 | -0.417635 | 0.898962 | 0.583939 | 1.435481 | -0.714429 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 298 | 0.290464 | -1.468418 | -0.938515 | 0.478391 | -0.101730 | -0.417635 | 0.898962 | -1.165281 | 1.435481 | -0.714429 |
| 299 | -1.033002 | 0.681005 | 1.973123 | -1.234996 | 0.342756 | -0.417635 | 0.898962 | -0.771706 | -0.696631 | -0.714429 |
| 300 | 1.503641 | 0.681005 | -0.938515 | 0.706843 | -1.029353 | 2.394438 | 0.898962 | -0.378132 | -0.696631 | 1.244593 |
| 301 | 0.290464 | 0.681005 | -0.938515 | -0.092738 | -2.227533 | -0.417635 | 0.898962 | -1.515125 | 1.435481 | 0.265082 |
| 302 | 0.290464 | -1.468418 | 0.032031 | -0.092738 | -0.198357 | -0.417635 | -1.005832 | 1.064975 | -0.696631 | 0.265082 |
303 rows × 10 columns
y= df.iloc[:,-1:]
y= lbl.fit_transform(y)
y
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
import evalml
Eval ML Library will do all the pre processing techniques for us and split the data for us. The proportionate use is 80 percent for training data and 20 percent for test data.
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(x, y, problem_type='binary')
There are different problem type parameters in Eval ML, we have a Binary type problem here, that's why we are using Binary as a input
evalml.problem_types.ProblemTypes.all_problem_types
[<ProblemTypes.BINARY: 'binary'>, <ProblemTypes.MULTICLASS: 'multiclass'>, <ProblemTypes.REGRESSION: 'regression'>, <ProblemTypes.TIME_SERIES_REGRESSION: 'time series regression'>, <ProblemTypes.TIME_SERIES_BINARY: 'time series binary'>, <ProblemTypes.TIME_SERIES_MULTICLASS: 'time series multiclass'>]
Running the Auto ML to select best Algorithm BINARY, MULTICLASS:, REGRESSION, TIME_SERIES_REGRESSION, TIME_SERIES_BINARY and TIME_SERIES_MULTICLASS are different multi classes and we use binary.
AutoML is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset. During the search, AutoML will explore different combinations of model type, model parameters and model architecture.
from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')
automl.search()
Using default limit of max_batches=1. Generating pipelines to search over... 8 pipelines ready for search. ***************************** * Beginning pipeline search * ***************************** Optimizing for Log Loss Binary. Lower score is better. Using SequentialEngine to train and score pipelines. Searching up to 1 batches for a total of 9 pipelines. Allowed model families: random_forest, extra_trees, decision_tree, lightgbm, linear_model, catboost, xgboost
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline Mode Baseline Binary Classification Pipeline: Starting cross validation Finished cross validation - mean Log Loss Binary: 15.699 ***************************** * Evaluating Batch Number 1 * ***************************** Elastic Net Classifier w/ Imputer + Standard Scaler: Starting cross validation Finished cross validation - mean Log Loss Binary: 0.488 Decision Tree Classifier w/ Imputer: Starting cross validation Finished cross validation - mean Log Loss Binary: 7.031 Random Forest Classifier w/ Imputer: Starting cross validation Finished cross validation - mean Log Loss Binary: 0.452 LightGBM Classifier w/ Imputer: Starting cross validation Finished cross validation - mean Log Loss Binary: 0.517 Logistic Regression Classifier w/ Imputer + Standard Scaler: Starting cross validation Finished cross validation - mean Log Loss Binary: 0.488
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[08:58:54] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[08:58:54] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[08:58:55] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGBoost Classifier w/ Imputer: Starting cross validation Finished cross validation - mean Log Loss Binary: 0.534 Extra Trees Classifier w/ Imputer: Starting cross validation Finished cross validation - mean Log Loss Binary: 0.442 CatBoost Classifier w/ Imputer: Starting cross validation Finished cross validation - mean Log Loss Binary: 0.656 Search finished after 00:15 Best pipeline: Extra Trees Classifier w/ Imputer Best pipeline Log Loss Binary: 0.441986
As we see from the above output thge Auto ML Classifier has given us the best fit Algorithm which is Extra Trees Classifier with Imputer We can also commpare the rest of the models
automl.rankings
| id | pipeline_name | search_order | mean_cv_score | standard_deviation_cv_score | validation_score | percent_better_than_baseline | high_variance_cv | parameters | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7 | Extra Trees Classifier w/ Imputer | 7 | 0.441986 | 0.025672 | 0.456773 | 97.184590 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 1 | 3 | Random Forest Classifier w/ Imputer | 3 | 0.451508 | 0.021426 | 0.466939 | 97.123934 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 2 | 5 | Logistic Regression Classifier w/ Imputer + St... | 5 | 0.488080 | 0.031375 | 0.520345 | 96.890973 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 3 | 1 | Elastic Net Classifier w/ Imputer + Standard S... | 1 | 0.488306 | 0.030127 | 0.519350 | 96.889536 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 4 | 4 | LightGBM Classifier w/ Imputer | 4 | 0.517282 | 0.019277 | 0.539132 | 96.704958 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 5 | 6 | XGBoost Classifier w/ Imputer | 6 | 0.534406 | 0.052329 | 0.592613 | 96.595882 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 6 | 8 | CatBoost Classifier w/ Imputer | 8 | 0.655683 | 0.002337 | 0.654572 | 95.823353 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 7 | 2 | Decision Tree Classifier w/ Imputer | 2 | 7.031453 | 0.909063 | 6.980170 | 55.210248 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 8 | 0 | Mode Baseline Binary Classification Pipeline | 0 | 15.698798 | 0.135402 | 15.776972 | 0.000000 | False | {'Baseline Classifier': {'strategy': 'mode'}} |
automl.best_pipeline
pipeline = BinaryClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'Extra Trees Classifier': ['Extra Trees Classifier', 'Imputer.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Extra Trees Classifier':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1}}, random_seed=0)
best_pipeline=automl.best_pipeline
We can have a Detailed description of our Best Selected Model
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
*************************************
* Extra Trees Classifier w/ Imputer *
*************************************
Problem Type: binary
Model Family: Extra Trees
Pipeline Steps
==============
1. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : mean
* categorical_fill_value : None
* numeric_fill_value : None
2. Extra Trees Classifier
* n_estimators : 100
* max_features : auto
* max_depth : 6
* min_samples_split : 2
* min_weight_fraction_leaf : 0.0
* n_jobs : -1
Training
========
Training for binary problems.
Total training time (including CV): 2.1 seconds
Cross Validation
----------------
Log Loss Binary MCC Binary Gini AUC Precision F1 Balanced Accuracy Binary Accuracy Binary # Training # Validation
0 0.457 0.584 0.785 0.893 0.857 0.738 0.779 0.790 161 81
1 0.412 0.676 0.806 0.903 0.833 0.822 0.837 0.840 161 81
2 0.457 0.591 0.732 0.866 0.913 0.712 0.769 0.787 162 80
mean 0.442 0.617 0.774 0.887 0.868 0.757 0.795 0.806 - -
std 0.026 0.051 0.038 0.019 0.041 0.057 0.037 0.029 - -
coef of var 0.058 0.083 0.049 0.021 0.047 0.076 0.046 0.036 - -
best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])
OrderedDict([('AUC', 0.8852813852813852),
('F1', 0.7812499999999999),
('Precision', 0.8064516129032258),
('Recall', 0.7575757575757576)])
Now if we want to build our Model for a specific objective we can do that
automl_auc = AutoMLSearch(X_train=X_train, y_train=y_train,
problem_type='binary',
objective='auc',
additional_objectives=['f1', 'precision'],
max_batches=1,
optimize_thresholds=True)
automl_auc.search()
Generating pipelines to search over... 8 pipelines ready for search. ***************************** * Beginning pipeline search * ***************************** Optimizing for AUC. Greater score is better. Using SequentialEngine to train and score pipelines. Searching up to 1 batches for a total of 9 pipelines. Allowed model families: random_forest, extra_trees, decision_tree, lightgbm, linear_model, catboost, xgboost
Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline Mode Baseline Binary Classification Pipeline: Starting cross validation Finished cross validation - mean AUC: 0.500 ***************************** * Evaluating Batch Number 1 * ***************************** Elastic Net Classifier w/ Imputer + Standard Scaler: Starting cross validation Finished cross validation - mean AUC: 0.847 Decision Tree Classifier w/ Imputer: Starting cross validation Finished cross validation - mean AUC: 0.723 Random Forest Classifier w/ Imputer: Starting cross validation Finished cross validation - mean AUC: 0.874 LightGBM Classifier w/ Imputer: Starting cross validation Finished cross validation - mean AUC: 0.843 Logistic Regression Classifier w/ Imputer + Standard Scaler: Starting cross validation Finished cross validation - mean AUC: 0.848
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[09:05:38] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[09:05:38] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [09:05:39] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGBoost Classifier w/ Imputer: Starting cross validation Finished cross validation - mean AUC: 0.849
/usr/local/lib/python3.7/dist-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
Extra Trees Classifier w/ Imputer: Starting cross validation Finished cross validation - mean AUC: 0.887 CatBoost Classifier w/ Imputer: Starting cross validation Finished cross validation - mean AUC: 0.822 Search finished after 00:14 Best pipeline: Extra Trees Classifier w/ Imputer Best pipeline AUC: 0.887205
automl_auc.rankings
| id | pipeline_name | search_order | mean_cv_score | standard_deviation_cv_score | validation_score | percent_better_than_baseline | high_variance_cv | parameters | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7 | Extra Trees Classifier w/ Imputer | 7 | 0.887205 | 0.018958 | 0.892506 | 38.720539 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 1 | 3 | Random Forest Classifier w/ Imputer | 3 | 0.873658 | 0.013643 | 0.869779 | 37.365775 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 2 | 6 | XGBoost Classifier w/ Imputer | 6 | 0.849162 | 0.027477 | 0.818182 | 34.916166 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 3 | 5 | Logistic Regression Classifier w/ Imputer + St... | 5 | 0.848007 | 0.017890 | 0.842752 | 34.800710 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 4 | 1 | Elastic Net Classifier w/ Imputer + Standard S... | 1 | 0.847393 | 0.016866 | 0.842752 | 34.739285 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 5 | 4 | LightGBM Classifier w/ Imputer | 4 | 0.842994 | 0.013211 | 0.837224 | 34.299356 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 6 | 8 | CatBoost Classifier w/ Imputer | 8 | 0.821640 | 0.023350 | 0.796069 | 32.163982 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 7 | 2 | Decision Tree Classifier w/ Imputer | 2 | 0.722694 | 0.052690 | 0.746929 | 22.269429 | False | {'Imputer': {'categorical_impute_strategy': 'm... |
| 8 | 0 | Mode Baseline Binary Classification Pipeline | 0 | 0.500000 | 0.000000 | 0.500000 | 0.000000 | False | {'Baseline Classifier': {'strategy': 'mode'}} |
automl_auc.describe_pipeline(automl_auc.rankings.iloc[0]["id"])
*************************************
* Extra Trees Classifier w/ Imputer *
*************************************
Problem Type: binary
Model Family: Extra Trees
Pipeline Steps
==============
1. Imputer
* categorical_impute_strategy : most_frequent
* numeric_impute_strategy : mean
* categorical_fill_value : None
* numeric_fill_value : None
2. Extra Trees Classifier
* n_estimators : 100
* max_features : auto
* max_depth : 6
* min_samples_split : 2
* min_weight_fraction_leaf : 0.0
* n_jobs : -1
Training
========
Training for binary problems.
Total training time (including CV): 2.1 seconds
Cross Validation
----------------
AUC F1 Precision # Training # Validation
0 0.893 0.738 0.857 161 81
1 0.903 0.822 0.833 161 81
2 0.866 0.712 0.913 162 80
mean 0.887 0.757 0.868 - -
std 0.019 0.057 0.041 - -
coef of var 0.021 0.076 0.047 - -
best_pipeline_auc = automl_auc.best_pipeline
# get the score on holdout data
best_pipeline_auc.score(X_test, y_test, objectives=["auc"])
OrderedDict([('AUC', 0.8852813852813852)])
We got an 88.5 % AUC Score which is the highest of all this far better than our other models
Save the model
best_pipeline.save("model.pkl")
Loading our Model
final_model=automl.load('model.pkl')
final_model.predict_proba(X_test)
| 0 | 1 | |
|---|---|---|
| 0 | 0.468324 | 0.531676 |
| 1 | 0.093848 | 0.906152 |
| 2 | 0.383646 | 0.616354 |
| 3 | 0.107272 | 0.892728 |
| 4 | 0.141027 | 0.858973 |
| ... | ... | ... |
| 56 | 0.268136 | 0.731864 |
| 57 | 0.846652 | 0.153348 |
| 58 | 0.861607 | 0.138393 |
| 59 | 0.739515 | 0.260485 |
| 60 | 0.878833 | 0.121167 |
61 rows × 2 columns
We got an 88.5 % AUC Score which is the highest of all this far better than our other models. This is how can you use to the best model to predict our data to get the most accurate model based on the data set.
Once again, our model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is .90166, which is close to 1. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is extremely well, which means that. Our model is doing a good job in distinguishing what is in a binary value, said what is a 0 and what is a 1. For row 1, we get 10 percent binary response Y takes on a value of 0 and 90.6 percent binary response Y takes on a value of 1. These are good results.
AutoML is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset and which is better than our other models.