Predict whether a patient is likely to get stroke using machine learning classification algorithms. Performance Comparison of algorithms.


According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

In our project we want to predict stroke using machine learning classification algorithms, evaluate and compare their results. We did the following tasks:

  • Performance Comparison using Machine Learning Classification Algorithms on a Stroke Prediction dataset.
  • using visualization libraries, ploted various plots like pie chart, count plot, curves, etc.
  • Used various Data Preprocessing techniques.
  • Handle class imbalanced.
  • Build various machine learning models
  • Optimized SVM and Random Forest Classifiers using RandomizedSearchCV to reach the best model.

Domain: Machine Learning, Data Science.


Installing Python libraries and packages

The required python libraries and packages are,

  • pandas
  • Numpy
  • sklearn
  • matplotlib
  • seaborn

Features of the Dataset

Dataset contains 5111 rows. Each row in the data provides relevant information about the patient.

  • gender: "Male", "Female" or "Other"
  • age: age of the patient
  • hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
  • heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
  • ever_married: "No" or "Yes"
  • work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
  • Residence_type: "Rural" or "Urban"
  • avg_glucose_level: average glucose level in blood
  • bmi: body mass index
  • smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
  • stroke: 1 if the patient had a stroke or 0 if not

Data Preprocessing

The data was cleaned to make it usable for the model. The following changes were made:

Handling Missing Values - replaced the null values by median using Sklearn Simple Imputer.

from sklearn.impute import SimpleImputer

si_X_train = pd.DataFrame() # create a new dataframe to save the train dataset
si_X_test = pd.DataFrame() # create a new dataframe to save the test dataset

for column in X_train.columns:
  if (is_string_dtype(X_train[column].dtype)):
    si = SimpleImputer(strategy='most_frequent')
    si = SimpleImputer(strategy='median')[[column]])
  si_X_train[column] = si.transform(X_train[[column]]).flatten() # Flatten 2D matrix to 1D 
  si_X_test[column] = si.transform(X_test[[column]]).flatten()

Handling Text Features - converted the text features into numeric value using LabelEncoder from Sklearn.

categorical_features = []
for col in data.columns:
  if col=='Class':
  if is_string_dtype(data[col].dtype):
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()   

y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

l_X_train = pd.DataFrame() # Train dataset --> before scaling
l_X_test = pd.DataFrame() # Test dataset --> before scaling

# Convert the text features

for column in X_train.columns:
  if column in categorical_features:
    l_X_train[column] = le.fit_transform(si_X_train[column])
    l_X_test[column] = le.transform(si_X_test[column])
    l_X_train[column] = si_X_train[column].copy()
    l_X_test[column] = si_X_test[column].copy()

Oversampling the dataset - increase the number of positive samples, by using RandomOverSampler from imblearn.

from imblearn.over_sampling import RandomOverSampler

os=RandomOverSampler(0.75) # 75%
l_X_train_ns,y_train_ns = os.fit_resample(l_X_train,y_train)

print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

Feature Scaling - Standardization

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

l_X_train_ns = ss.fit_transform(l_X_train_ns)
l_X_test = ss.transform(l_X_test)

Exploratory Data Analysis (EDA)

At first, using visualization libraries, we did some data visualizations by plotting various plots like pie chart, count plot, curves, etc. in order to understand the dataset better, and to find out the correlation between the attributes.. Below are a few highlights.

Count Plot - Worktype

ax = sns.countplot(data=data, x="work_type")

Proportion of Different Smoking Categories among Stroke Population

Finding correlation to class variable using Heatmap


No Stroke vs Stroke by BMI


sns.distplot(data[data['stroke'] == 0]["bmi"], color='green') # No Stroke - green
sns.distplot(data[data['stroke'] == 1]["bmi"], color='red') # Stroke - Red

plt.title('No Stroke vs Stroke by BMI', fontsize=15)

Catplot - Heart disease

sns.catplot(x="heart_disease", y="stroke", hue='smoking_status', kind="bar", data=data);

Model Building- Machine Learning Models

The categorical variables were transformed into dummy variables. Dataset was split into train and tests sets with a test size of 20%.
After our dataset was finally ready, we have used some machine learning classification algorithms on this dataset and observed their performances.

The different models used are:

  • Logistic Regression
  • Naive Bayes
  • k Nearest Neighbors
  • Random Forest Classifier

Support Vector Machine – Gaussian SVM

from sklearn.svm import SVC
svc = SVC(kernel='rbf',random_state=0),y_train_ns)

y_pred = svc.predict(l_X_test)
model_metrics = evaluate_preds(y_test, y_pred)

Naive Bayes

from sklearn.naive_bayes import GaussianNB
naive = GaussianNB(),y_train_ns)

y_pred = naive.predict(l_X_test)
model_metrics = evaluate_preds(y_test, y_pred)

Logistic Regression

from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(),y_train_ns)

y_pred = logistic.predict(l_X_test)
model_metrics = evaluate_preds(y_test, y_pred)

k Nearest Neighbours

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=40),y_train_ns)

y_pred = neigh.predict(l_X_test)
model_metrics = evaluate_preds(y_test, y_pred)


from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, n_jobs=-1, criterion='entropy'),y_train_ns)

y_pred = rf.predict(l_X_test)
model_metrics = evaluate_preds(y_test, y_pred)

Model performance

Classification Evaluation Metrics

We then compared these results based on various classification metrics. The metrics are: accuracy, precision, recall, f1 score and mcc score.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef

def evaluate_preds(y_test,y_pred):
    accuracy = accuracy_score(y_test,y_pred)
    precision = precision_score(y_test,y_pred)
    recall = recall_score(y_test,y_pred) 
    f1 = f1_score(y_test,y_pred)
    mcc = matthews_corrcoef(y_test,y_pred)

    metric_dict = {
        "mcc": mcc 
    } # A dictionary that stores the results of the evaluation metrics
    print(f"Acc: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")
    print(f'MCC Score: {mcc:.2f}')
    return metric_dict

Project Report

Pattern Lab Project Report - Stroke Prediction.pdf



