Heart Attack Risk Predictor

Heart Attack Risk Predictor with Automated Machine Learning Techniques

The main purpose of this project is to predict whether a person is at risk of a heart attack or not, using automated Automated Machine Learning Techniques.

What Causes a Heart Attack?

A heart attack occurs when one or more of your coronary arteries become blocked. Over time, a buildup of fatty deposits, including cholesterol, forms substances called plaques, which can narrow the arteries. This condition, called coronary artery disease, causes most heart attacks.

Understanding the problem statement:

We are given a data set in which various attributes are provided, crucial for heart disease detection. We will build a model using Automated Machine Learning Techniques using this data.

AutoML is the process of automating the tasks of applying machine learning to real-world problems. We will be using EvalML Library. EvalML is an open-source AutoML library written in python that automates a more significant part of the machine learning process. We can quickly evaluate which machine learning pipeline works better for the given data set.

Propose of this project?

In this project, we will first use different ML Algorithms. Then we will then understand how we can use Auto ML techniques, EvalML for this project to simplify our work.

We will do the following things:

Let us import the necessary liabraries and read our DataSet

Let us import our Data Set

Data Analysis

Understanding our DataSet:

In order to understand the purpose of this project we need to understand the data set and what each column is in our heart dataset.

Age : Age of the patient

Sex : Sex of the patient

Exang: exercise induced angina (1 = yes; 0 = no)

Ca: number of major vessels (0-3)

Cp : Chest Pain type chest pain type

Trtbps : resting blood pressure (in mm Hg)

Chol : cholestoral in mg/dl fetched via BMI sensor

Fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

Rest_ecg : resting electrocardiographic results

Thalach : maximum heart rate achieved

Target :

This is our independent variable for our data set and it is the most important variable.

What is a binary and if 0 there's less of a chance of getting a heart attack. a bar Target value is 1 there is a higher chance of getting a heart attack.

As we can see will have 303 rows and 11 columns in our data set.

As we can see there are no null values in our Data Set. This is extremely important to determined because Null values can lead to miscalculations During our data analysis process. It is extremely important to make sure there is no NULL values in our current Data Set. If our Data set had Null Values we will have to use different techniques to deal with them.

we are going to process our data set through the correlation function and as we could see above chard that is the results of the correlation. Correlation refers to a process for establishing the relationships between two variables. The value of 1 means that the two variables are completely correlated. This is only possible if we establish relationships between the same variables as we can see in our chart above. Suppose either -1 or 1 one means that they are highly correlated between the two variables of comparison. We're going to convert this to a more graphical form to better understand the relationship between the variables in terms of correlation.

As we can see from the diagram our variables are not highly correlated to each other.

As you can see our data from above gets converted into a graphical form. The darker the square is the less correlation the two variables that comparisons have to each other. The lighter the square is, the closer to the value 1 and the two variables a comparison is, have a higher correlation between the two variables.

We will do Uni and Bi variate analysis on our Features.

Univariate and Bivariate analysis in out data. Univariate analysis looks at one variable. This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. Bivariate analysis looks at two variables and their relationship. This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables. This is important to get a better understanding how the correlation is distributed through our data. Now we're going to find how the age of the patient is distributed throughout our Data Set. This is important so we could have a better understanding of our data in order to make the best interpretation when we start analyzing the data set further down the process.

As we can see the Patients are of Age Group 51-67 years in majority.

You can see from the graph above after graphing our data that we can see the Sex of Patients. 0=Female and 1=Male. We can clearly see that there is more than 100 more males in our data sets than females. This would be extremely important during our findings later on after analysis of our data.

As we can see from the data for chest pain of patients. Angina is chest pain or discomfort caused when your heart muscle doesn't get enough oxygen-rich blood. It may feel like pressure or squeezing in your chest. The discomfort also can occur in your shoulders, arms, neck, jaw, abdomen or back. Angina pain may even feel like indigestion. In addition, some people don’t feel any pain but have other symptoms like shortness of breath or fatigue. But angina is not a disease. It's a symptom of an underlying heart problem. We see from the diagram below from our Data set.

We have seen how the the Chest Pain Category is distributed

Now we going to do the same for our ECG values.

An ECG is used to see how the heart is functioning. It mainly records how often the heart beats (heart rate) and how regularly it beats (heart rhythm). It can give us important information, for instance about possible narrowing of the coronary arteries, a heart attack or an irregular heartbeat like atrial fibrillation.

This is our ECG Data

We could see our distribution of ECG data and we could see that from the normal to show me probable signs of issues.

Now, after this, we can have multivariate analysis. Multivariate analysis is based on the principles of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. This will tell us how all the features are correlated to each other and also distributed throughout our Data Set. This is done in the form of different bar chart and distribution charts. This is the same as above but it is in the form of a different charge which corresponds to the multivariable analysis. As we can see hue is this 1 or 0. 0 is blue and 1 is orange in our diagram above. We can see that our x and y axis gives us different features. As you can see the different types of graphs above. It gives us an idea how data is distributed throughout our Data Set.

Now let us see for our Continuous Variables

The first graph is resting blodd pressure and the other graph is maximum heart rate achieved. This is a very good sign because in order to feed our data through our models our data has to be normally distributed to get our best results.

We have done the Analysis of the data

Now we have to do the same thing for our cholesterol. As you can see are data is normally distributed throughout our data. this is a good sign this means that we can use our data to feed our models. However there are some outliers in our data, Not too many so we don't have to worry about it affecting our data that much.

Now that we have confirmed our data above that it can be used to build our model we will move forward. How will we build our model? Before building a model let's look at our data in the diagram below. There are various parameters in a data set, which range from continuous data to categorical data. Before feeding our data to the machine learning algorithm it is very important we take the step known as data privacy. Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them.

Let us do Standardisation

Now we will apply a technique called standardscaler. StandardScaler is an important technique that is mainly performed as a preprocessing step before many machine learning models, in order to standardize the range of functionality of the input dataset. We use a standardized to scale the Data Set values to get the best results when using our models. Then to fit our model we have you standard scaler for our data set.

We can see from the chart above our data has been scaled down between -1 and 1. These new scaled datasets values will better fit our models.

We can insert this data into our ML Models

We will use the following models for our predictions :

Then we will use the ensembling techniques

Let us split our data

However, to accomplish all these models above we have to split our data to dependent and independent variables in order to analyze our data correctly through all the different type of models we are going to use.

We can see from our chart above X is our independent variable

We can see from our chart above Y is our dependent variable

From Sklean.Model_selection will are going to import train_test_split, x_train, x_test, y_train, y_test variables. We are using the trian_test_split funciton. This function is going to take our data set and feed it to different variables. These variables are going to be used in our various different models.

Logistic Regression

The logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. Predicting the category of a categorical response is known as classification. Because of the classification and regression models, I need a cutoff point where the predictive value will be True or False. Because the Y dependent variable is a probability percentage, we need a cutoff point where a particular value will be true or false to develop some regression model analysis. A confusion matrix can evaluate a logistic regression model's performance on the dataset used to create the model. The table's rows represent the predicted outcomes, while the columns represent the actual outcomes.

image-2.png

image.png

Once again, a logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is .857143, which is close to 1. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is extremely well, which means that. Our model is doing a good job in distinguishing what is in a binary value, said what is a 0 and what is a 1. Then we should expect good results are moving forward with her model and Theory.

As we see the Logistic Regression Model have a 85% accuracy

The logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our Recall is 85.714 percent, which is close to 1. The Recall is the ratio of correct positive predictions to the total positives examples. When assessing the classification model's performance, our accuracy is near one exceptionally well, which means that. Our model is doing an excellent job in distinguishing what is in a binary value, said what is a 0 and what is a 1.

Decision Tree

image.png

positive predictions to the total predicted positives. When assessing the classification model's performance, our accuracy is to distinguishing what is in a binary value, said what is a 0 and what is a 1.

image.png

The model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our Recall is 70.329 percent. The Recall is the ratio of correct positive predictions to the total positives examples. When assessing the classification model's performance, our accuracy is not well, which means that. Our model is not doing well distinguishing what is in a binary value, saying what a 0 and a 1 are. This is not the best model results.

As we see our Decision Tree Model does not perform well as it gives a score of only 70%

The Heart Attack Risk Predictor model is 70% accurate. Though, I think more variables need to be analyzed and evaluated to find which variable will produce the highest probability of finding the model that best fits the situation. The whole objective is to develop a model that is custom to the scenario, however, this is not the best results.

Random Forest

What is the use of training and testing sets when creating a random forest model? Random forest is a viral machine learning algorithm to make predictions on classification and regression problems. The critical part of the random forest is that the algorithm uses not just one but multiple learners to obtain better predictive performance. Then we use the multiple decision trees on the training data to make predictions. We were able to obtain a model that we then use to make predictions on the testing data set. However, how would the predictions occur well? In case we use classification data sets, we have to predict a binary variable. For example, yes or no, we can use the majority rule to create a random forest model with decision trees. We can then drive down the classification model to split the point into multiple samples to better predict the actual model. Using the training set and testing said is extremely important and creating a random forest model.

image.png Once again, our model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is 75.82417. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is not doinf well, which means that. Our model is not doing a job in distinguishing what is in a binary value, said what is a 0 and what is a 1.

Random Forest also gives us an accuracy of around 75%

K Nearest Neighbour

We have to select what k we will use for the maximum accuracy

Let's write a function for it

As we see from the graph we should select K= 12 as it gives the best error rate

Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set k = n^(1/2). The main reason is the disadvantages that K Nearest Neighbourhave is that they can overfit. If the data set is used to train those models and produce a model that fits the training data sets well makes the best K-value; However, if it fits it so well, it has biased and does not perform well on any future data sets. The K Nearest Neighbour not generalize very well and goes around this overfitting issue, and we can use the training set to train the model. THis whay our value is K=12 gives us the best error rate.

As we see KNN gives us an accuracy of around 85% which is good

Support Vector Machine(SVM)

We get an accuracy of 80% in SVM

Let us see our model accuracy in Table form

Now we can going to add our accuracy for Logistic Regression, Decision Tree, Random Forest, K nearest neighbor and SVM to a DataFrame so we can compare the model accuracy.

Logistic Regression model had the highest accuracy at around 85 percent. These are some of the more primitive model that are used in machine learning.

Let us use one more Techniques known as Adaboost, this is a Boosting technique which uses multiple models for better accuracy.

AdaBoost as one of the best weighted combinations of classifiers – and one that is sensitive to noise, and conducive to certain machine learning results. Some confusion results from the reality that AdaBoost can be used with multiple instances of the same classifier with different parameters – where professionals might talk about AdaBoost "having only one classifier" and get confused about how weighting occurs.

AdaBoost also presents a particular philosophy in machine learning – as an ensemble learning tool, it proceeds from the fundamental idea that many weak learners can get better results than one stronger learning entity. With AdaBoost, machine learning experts are often crafting systems that will take in a number of inputs and combine them for an optimized result.

Adaboost Classifier

Let us first use some random parameters for training the model without Hypertuning.

As we see our model has performed very poorly with just 50% accuracy

As we can see we got worse results than our other models that were used. This can occur because parameters need to be more fine tuned to get better Accuracy.

We will use Grid Seach CV for HyperParameter Tuning

Hyperparameter Tuning with GridSearch Cross-Validation (CV) is another technique to find the best parameters for your model. To perform hyperparameter tuning with Grid Search using Scikit Learn (Sklearn).

Grid Search CV

Let us try Grid Search CV for our top 3 performing Algorithms for HyperParameter tuning

Logistic Regression

We have different attributes that we're going to define for our logistic regression model.

'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear' are our parameters we're going to apply to our regression model.

Let us apply these para in our Model

We got an accuracy of 81%

KNN

Now with our K nearest neighbor now let us try Grid Search CV and performing Algorithms for HyperParameter tuning for our model

'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'distance' are best are our parameters we're going to apply to our model.

Let's apply

We have an Accuracy of 82.5%

SVM

Now with our SVM now let us try Grid Search CV and performing Algorithms for HyperParameter tuning for our model

'C': 0.1, 'gamma': 'scale', 'kernel': 'sigmoid' are best are our parameters we're going to apply to our model.

Let us apply these

Accuracy is 81%

Final Verdict

After comparing all the models the best performing model is :

Logistic Regression with no Hyperparameter tuning is our best model

As we see the Logistic Regression Model have a 85% accuracy compaired to 81%. Logistic regression is used to calculate the probability of a binary event occurring, and to deal with issues of classification and this is our best model we are going to use.

Let us build a proper confusion matrix for our model

The logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. Predicting the category of a categorical response is known as classification. Because of the classification and regression models, I need a cutoff point where the predictive value will be True or False. Because the Y dependent variable is a probability percentage, we need a cutoff point where a particular value will be true or false to develop some regression model analysis. A confusion matrix can evaluate a logistic regression model's performance on the dataset used to create the model. The table's rows represent the predicted outcomes, while the columns represent the actual outcomes.

image.png

image-2.png

Once again, a logistic regression model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is 85.714 percent, which is close to 1. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is extremely well, which means that. Our model is doing a good job in distinguishing what is in a binary value, said what is a 0 and what is a 1. Then we should expect good results are moving forward.

We have succesfully made our model which predicts weather a person is having a risk of Heart Disease or not with 85.7% accuracy

The general form of a logistic regression model for as independent variables are Age, Sex, Exang, Ca, Cp, Trtbps, Chol, Fbs, Rest_ecg, Thalach . Independent variable which is the Target predicts weather a person is having a risk of heart disease not and our accuracy is 85.7 perceent Though, I think more variables need to be analyzed and evaluated to find which variable will produce the highest probability of finding the regression model that best fits the situation. The whole objective is to develop a regression model that is custom to the scenario. The key is to understand when a data set has and what type of regression model to use.

Using Auto ML

EVAL ML : 64432Screenshot (645).png

EvalML is an open-source AutoML library written in python that automates a large part of the machine learning process and we can easily evaluate which machine learning pipeline works better for the given set of data.

AutoML is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset. During the search, AutoML will explore different combinations of model type, model parameters and model architecture. We used to further evaluate our data set and compare to are logistic regression model.

Installing Eval ML

Let us load our DataSet.

Let us split our Data Set into Dependent i.e our Targer variable and independent variable

Importing Eval ML Library

Eval ML Library will do all the pre processing techniques for us and split the data for us. The proportionate use is 80 percent for training data and 20 percent for test data.

There are different problem type parameters in Eval ML, we have a Binary type problem here, that's why we are using Binary as a input

Running the Auto ML to select best Algorithm BINARY, MULTICLASS:, REGRESSION, TIME_SERIES_REGRESSION, TIME_SERIES_BINARY and TIME_SERIES_MULTICLASS are different multi classes and we use binary.

AutoML is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset. During the search, AutoML will explore different combinations of model type, model parameters and model architecture.

As we see from the above output thge Auto ML Classifier has given us the best fit Algorithm which is Extra Trees Classifier with Imputer We can also commpare the rest of the models

We can have a Detailed description of our Best Selected Model

Now if we want to build our Model for a specific objective we can do that

We got an 88.5 % AUC Score which is the highest of all this far better than our other models

Save the model

Loading our Model

Conclusion

We got an 88.5 % AUC Score which is the highest of all this far better than our other models. This is how can you use to the best model to predict our data to get the most accurate model based on the data set.

Once again, our model's goal is to predict whether the binary response Y takes on a value of 0 or 1. We can see that our accuracy is .90166, which is close to 1. Accuracy is the ratio of the number of correct predictions to the total number of observations. When assessing the classification model's performance, our accuracy is near one, which is extremely well, which means that. Our model is doing a good job in distinguishing what is in a binary value, said what is a 0 and what is a 1. For row 1, we get 10 percent binary response Y takes on a value of 0 and 90.6 percent binary response Y takes on a value of 1. These are good results.

AutoML is the process of automating the construction, training and evaluation of ML models. Given a data and some configuration, AutoML searches for the most effective and accurate ML model or models to fit the dataset and which is better than our other models.