Predicting Heart Diseases with a Machine Learning model
Any problem with the heart means a direct threat to human life sooner or later. This is why, Heart Diseases become a prominent health concern for mankind.
Heart Disease remains the primary cause of death worldwide. According to the American Heart Association, Cardiovascular disease (CVD) is the leading global cause of death, and accounted for approximately 18.6 million deaths in 2019. 1 out of every 4 deaths in the United States is a result of cardiovascular disease. That’s a death every 37 seconds.
The World Heart Federation predicts more than 23 million CVD-related deaths per year by 2030
The 2019 report from the European Society of Cardiology (ESC) Atlas states that the CVD causes 45% of all deaths in Europe and 37% in the EU. CVD by itself is the leading cause of mortality under 65 years in Europe.
According to the Australian Institute of Health and Welfare, (In 2018), Coronary Heart Disease was the leading single cause of death in Australia, accounting for 17,500 deaths as the underlying cause of death. This represents 11% of all deaths, and 42% of cardiovascular deaths.
In this project, I will be applying different Machine Learning algorithms for identifying whether the person is suffering from Heart Disease or not. The dataset used for this project is Heart Disease UCI from Kaggle. Both dataset and code for this project are available on my GitHub repository.
The original database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
The dataset contains 303 individuals data with 14 features set. The detailed description of all the features are as follows:
- age: indicates the age of the person
- sex: indicates the gender of the person (1 = male, 0 = female)
- cp: indicates the type of chest-pain (0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptotic)
- trestbps: indicates the resting blood pressure value in mm/Hg
- chol: indicates the serum cholesterol in mg/dl
- fbs: indicates the fasting blood sugar value larger than 120mg/dl (1 = yes, 0 = no)
- restecg: indicates resting ecg (0 = normal, 1 = having ST-T wave abnormality, 2 = left ventricular hypertrophy)
- thalach: indicates the max heart rate achieved. The maximum rate is based on the age of the person, as subtracted from 220. So for a 40-year-old, maximum heart rate is 220–40 = 180 beats per minute
- exang: Exercise Induced Angina (1 = yes, 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: Peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping)
- ca : Number of major veins/vessels (0–3). Fluoroscopy is used for checking the flow of blood through the coronary arteries for arterial blockages. Veins (Vessels) can be categorized into four important types: pulmonary, systemic, superficial & deep veins.
- thal : indicates the thalassemia which is a inherited blood disorder that causes human body to have less hemoglobin than normal (1 = normal, 2 = fixed defect, 3 = reversible defect)
- target: indicates whether the person is suffering from heart disease or not (0 = absence, 1 = present)
Part 1: Let us perform an EDA (Exploratory Data Analysis) first on the dataset to understand the data better.
Importing all required libraries before importing the data file, importing time for training details and ignored warnings
Next, importing the dataset as well.
Data set file ‘heart.csv’ imported using pandas command. This command creates a tabular data-structure called ‘Dataframe’ in which the first column is called an ‘index’ and makes each row of data unique and the first row has a label/name for each column.
So, what does this dataset contain? This can be found easily by printing a concise summary of the dataframe using .info() command. It shows information about the Dataframe including the index dtype and column dtypes, non-null values, and memory usage. So this shows that there are no values missing.
Observation : It is evident that there are no missing values in this dataset since all the features have 303 entries.
Taking a closer look at the dataset to help reveal some interesting insights.
The gender ratio appears skewed in this dataset as there are 207 males and only 96 females. Next, we need to check how many target outcomes have a possibility of heart disease.
Also, checking the relationship of different attributes through a correlation matrix. Here, age is clearly an important factor affecting various attributes.
Next, plotting cholesterol vs. gender and finding chest pain types and diagnosis for each gender.
Additionally, generating some more plots to understand the distribution of the data.
Now that we have seen the data and distribution of some important features in it, next we create a machine learning model
Part 2: Preparing dataset and applying machine learning algorithms to train the model.
First, defining the inputs X and output Y for the dataset. Further, using 80–20 split for the dataset to create training and test sets.
Since this is a classification problem, algorithms like Random Forest, Logistic Regression, Naïve Bayes (Gaussian Naïve Bayes), SVM are the likely choices for this model.
Figure below shows a sample of applying Random forest classifier, predicting the classifier accuracy, time required to train the model (this is an essential value to consider when looking for a trade-off between accuracy and training time for the model) and printing the confusion matrix to see false positives and false negatives.
As expected, Random forest performs well with an accuracy of >80%.
Lets look at accuracies of other classifying algorithms.
It can be clearly seen that Gaussian Naïve Bayes, Random Forest and Decision Tree algorithms perform better than kNN and SVM.
And finally, performing cross-validation on the top three algorithms and plotting the before and after accuracy for each.
After cross-validation, Decision Tree algorithm accuracy dropped to 70% and hence, Gaussian Naïve Bayes and Random Forest seem to be reliable in predicting the likelihood of a Heart Disease with accuracy >75%.
Since the dataset has a ratio of roughly 70:30 for gender distribution, it is interesting to see how the selected algorithm performs if we split the dataset based on gender.
Preparing a new dataframe for the female values in the dataset.
Next, using 80:20 split for training the model with Naïve Bayes algorithm.
The Gaussian Naïve Bayes algorithm appears to be faster and even more accurate in predicting the probability of a Heart Disease in female population.
Various models were used to predict whether a person is suffering from heart disease or not. Gaussian Naïve Bayes model yields a very good performance as indicated by the model accuracy which was found to be 86.89%, retains its accuracy above 75% even after cross-validation and thus, makes it more reliable for this dataset. The model is expected to be around 80.16% accurate on average using cross validation.
Human life is invaluable and hence, must be protected at all costs. Although medical technology is rapidly progressing with the support of AI, only time can reveal if machine learning models could prove more than just decision support systems for healthcare. Till then, they definitely are worthy of uncovering hidden patterns to assist healthcare professionals make more accurate decisions and enable them to save lives. Predicting and preventing heart diseases are one of the prime focus areas in medical research as multiple parameters are to be evaluated before any diagnosis is made. This is why, applying machine learning to heart diseases data might be appealing to data science professionals.
Through this project I have also tried my hands on the famous old UCI Heart Disease dataset to see if the ML model can be accurate enough to predict the likelihood of a heart disease. This dataset has already been studied quite a lot in the past few years but still this is my small contribution to exploring the dataset and building a machine learning model for predicting heart diseases.
Based on my experience with this dataset, here’s what I have learnt from the project -
- Looking at the accuracy numbers, the model seems accurate enough to be able to predict heart disease from data.
- Although, this machine learning model could likely generate accurate predictions but the fact cannot be ignored that it still can only assist an expert Cardiologist in decision making only to a certain extent. This is subject to the variability of certain parameters (like maximum heart rate, resting blood pressure, cholesterol) which are fluctuating periodically and hence, making accurate predictions becomes challenging if based solely on predicted output by the model.
References for statistics: