College Dropout Prediction

Overview

Our dataset, found on Kaggle, includes demographic data, social-economic factors and academic performance on students enrolled in undergraduate degrees.

Our goal is to see if a student’s background has any impact on their academic success in higher education institutions. We will also evaluate the bias of the data set using Aequitas.

Some things to note about our dataset:

Data is from students in Brazil in 2021
Dropouts include students that change major or schools, leading to higher dropout rates compared to students that solely did not continue schooling

Installing Packages (Jupyter on VSCode)

%pip install pandas numpy statsmodels matplotlib scikit-learn seaborn --user
import math

import pandas as pd

import numpy as np

import statsmodels

import scikit-learn

from sklearn.feature_selection import mutual_info_classif

from sklearn.model_selection import train_test_split

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns
For Aequitas:
- %pip install aequitas --user
- from aequitas.group import Group
  
  from aequitas.bias import Bias
  
  from aequitas.fairness import Fairness
  
  import aequitas.plot as ap
  
  from aequitas.preprocessing import preprocess_input_df
  
  from aequitas.plotting import Plot

Challenges

While trying to conduct a Linear Regression Analysis we realized that it is not the most ideal model for our data as our target values are binary yes/no values rather than a continuous value. Therefore we proceeded to conduct Logistic Regression which was much more suitable and contributed more to our project goals.

Conclusion

After running different correlation methods and classifiers, we found that the Curricular units 2nd sem (approved), Curricular units 1st sem (approved), and Curricular units 2nd sem (grade) columns were the most correlation/influential in a students target column variable (dropout or graduate).

After our analysis through Aequitas, we found that when we analyzed the data frame with all protected attributes including “Age at enrollment” then the overall fairness report shows that that the dataset is unfair.

Whereas when we ran an analysis excluding “Age at enrollment” then the overall fairness shows that the dataset is fair and passes all parity test. For future analysis we would like to broaden the data to higher education institutions in more and define the reason the student is a “dropout” (school change, major change, no longer pursuing degree)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
AequitasAnalysis.ipynb		AequitasAnalysis.ipynb
LinearRegression.ipynb		LinearRegression.ipynb
README.md		README.md
dataset.csv		dataset.csv
df_protected.csv		df_protected.csv
df_top5.csv		df_top5.csv
findingCorrelation.ipynb		findingCorrelation.ipynb
forrestAlg.ipynb		forrestAlg.ipynb
knn.ipynb		knn.ipynb
logisticRegression.ipynb		logisticRegression.ipynb
visualization.ipynb		visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

College Dropout Prediction

Overview

Installing Packages (Jupyter on VSCode)

Challenges

Conclusion

About

Releases

Packages

Contributors 4

Languages

CS-UCR/final-project-dumpy

Folders and files

Latest commit

History

Repository files navigation

College Dropout Prediction

Overview

Installing Packages (Jupyter on VSCode)

Challenges

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages