Stroke Prediction with Machine Learning

Image by Raman Oza from Pixabay

A brain attack also known as ‘Stroke’ occurs when the supply of blood to a part of the brain is interrupted causing some parts of the brain to become damaged. It is considered as a medical emergency and can cause permanent damage to the brain, disability for a long time, or even death.

According to World Health Organization, 15 million people suffer strokes worldwide each year. Of these, 5 million die and another 5 million are permanently disabled. Every year, more than 795,000 people in the United States have a stroke.

The objective of this project is to construct a machine learning model for predicting stroke and to evaluate the accuracy of the model. We are going to apply different machine learning algorithms to see which algorithms produce reliable results with good accuracy.

The dataset used for this project is the Stroke Prediction Dataset from Kaggle. The code for this project is available on my GitHub repository. This dataset is used to predict whether a patient is likely to get a stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

The dataset contains 5110 individuals data with 12 features set. The detailed description of all the features are as follows:

  1. id: indicates the unique id
  2. gender: indicates the gender of the person (“Male”, “Female” or “Other”)
  3. age: indicates the age of the person
  4. hypertension: indicates if the person have hypertension (1 = yes, 0 = no)
  5. heart_disease: indicates if the person have any heart diseases (1 = yes, 0 = no)
  6. ever_married: indicates the marital status of person (“No” or “Yes”)
  7. work_type: indicates the work type of person (“children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”)
  8. Residence_type: indicates type of residence of the person (“Rural” or “Urban”)
  9. avg_glucose_level: indicates the average glucose level in blood of the person
  10. bmi: indicates the body mass index of person
  11. smoking_status: indicates the smoking status of the person (“formerly smoked”, “never smoked”, “smokes” or “Unknown”*) *Unknown in smoking_status means that the information is unavailable for this patient
  12. stroke: indicates if the patient had a stroke or not (1 = yes, 0 = no)

To begin with, let us import all required libraries and the dataset.

After importing the dataset, we can now perform the EDA (Exploratory Data Analysis).

Getting basic information on the dataset

It is obvious that only the bmi attribute has some missing values.

Additionally, let us understand the values for each attribute.

As expected, attribute ‘id’ has 5110 unique values, attributes {age, bmi, avg_glucose_level} are numerical in nature whereas attributes { gender, hypertension, heart_disease, ever_married, work_type, Residence_type, smoking_status, stroke} are categorical variables.

We need to tackle the missing bmi values in the dataset. Although it is possible to estimate bmi values using height and weight of an individual, these attributes are unavailable in this dataset. In theory, it might be important to correlate the entries with missing data to the rest of the data. However, for the sake of convenience it might be a better option to either drop the column or replace the missing bmi values. 201 missing values are <5% of the total entries in the column, so it might be worth replacing the missing values with a mean value with an assumption that the mean values will not significantly affect the results.

However, it is also important to check for outliers in the bmi variable and how many of those have an outcome associated with it.

There is only one value which has a stroke possibility. First, we can replace the bmi outliers with the mean value.

Next, we can fill all missing bmi values with mean value.

Checking once more to ensure that there are no further missing values in the dataset.

Next, we can drop the id column from the dataset.

Now, taking a look at the gender distribution in the dataset reveals that there are 59% females and 41% males in the dataset along with one value labelled as ‘other’. We can convert this single value to male to simplify the data.

There are five attributes with string datatype — {gender, ever_married, work_type, Residence_type, smoking_status}. Rest of the attributes are numeric (int/float64).

We need to convert these string values to numeric values to simplify the dataset, prepare the dataframe for analysis and allow the algorithm to perform better.

Since the missing values and datatypes have been set up correctly, it’s time to plot some graphs from the data to gather insights from it.

All these graphs and charts divulge a lot of important information on the dataset like -

  • <10% of the individuals suffer from hypertension
  • A little above 5% of the people suffer from heart disease
  • There is an equal split in the dataset for residence type i.e. 50% of the population is from rural areas
  • 57% of the people are employed in the private sector & >65% are married
  • Only about 5% people have a possibility of a stroke

Additionally, we can plot some more bar charts to understand how each of these factors are related to the target variable i.e. possibility of a stroke.

A correlation map gives an overall idea of how the variables affect each other as well as the outcome.

Now, we can train the model.

Using 80–20 split on the data, we can use {Random Forest, Gaussian Naïve Bayes, Decision Tree k-Nearest Neighbor, Support Vector Machines} for this classification problem.

All the selected algorithms seem to perform well as their accuracies are >90%.

Since all the algorithms exhibit high accuracy, it is essential to perform cross-validation to ensure the model is reliable. I have applied cross-validation for the Random Forest classifier and k-Nearest Neighbor classifier to ensure their reliability in this dataset and guess their performance on unseen data.

I have used two cross-validation techniques : Stratified K-Fold and Stratified Shuffle Split for the exploring the algorithm performance in more details.

Based on the above graph, both these algorithm perform exceptionally well and therefore, can be selected for this classification problem.

Next, getting more into the dataset details, it can be seen that there is a discrepancy in the reading for the feature ‘age’. Minimum age is too less to be considered in the dataset.

A scatter plot of age vs bmi gives a clearer idea about age distribution.

This indicates that there are entries with age<20 years in the dataset with bmi values >30. These could be considered as outliers as the majority of the entries are working population. Therefore, it might be interesting to see how the model performs after removing age entries below 20 years of age.

Removing these age entries, the new dataset has 4144 rows with 11 features (we had already dropped the id column previously)

For testing purpose, I have used the Random Forest classifier on the 80–20 split which yielded an accuracy of 92.16% against the previous 93.63 % while the cross validation resulted in 94.5 % against the earlier 95.4% accuracy. Though the accuracy has decreased a little after removing these entries, it is likely that the model reliability has increased since we removed the age values below 20 years which logically do not seem to be worth considering in this stroke prediction dataset.


All the chosen algorithms performed equally well with an accuracy above 90%. Even with cross-validation, the selected algorithms {Random Forest, kNN} seemed accurate enough as resulting accuracies were more than 90% for both.

After cleaning the dataset for less-relevant age values, the resulting accuracies for Random Forest with and without cross-validation remained at >90%. It is possible to apply the same approach to the rest of the algorithms to fine-tune the model further and select the best algorithm.

In sum, it means that the model is accurate enough to classify an input for stroke prediction using either Random Forest or kNN algorithms.

However, some additional steps might be tried for cleansing/tuning the dataset and improving the model further. Some of the bmi values are too high and these would impact the mean value if the dataset size is small. It might be interesting to try removing the bmi column altogether or dropping the entries with missing bmi values and testing the model for performance.

The model has only 5% entries with a likelihood of stroke. This is why, we can also try a different split of the dataset to validate the model further. A feature importance study could also be helpful in improving the model.


The above stroke prediction model is accurate for the selected algorithms. Such models can likely assist a healthcare professional in decision making.


AI provides a promising ground in the race of protecting human life with cutting-edge medical technology. Currently, machine learning models are doing their best in bringing forth the hidden information and thus, could likely prove more than just decision support systems for healthcare in the upcoming future.

Looking at the accuracy numbers, the model seems accurate enough for stroke prediction. However, it cannot be denied that the model could benefit from additional data entries. For example, this dataset seems to have a higher number of healthy individuals with no heart disease or stroke possibility.

Data Enthusiast focusing on applications of Machine Learning and Deep Learning in different domains.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store