Skip to content

chawla201/Kaggle-ML-DS-Survey-2020-Analysis

Repository files navigation

Kaggle ML & DS Survey 2020

2020 Kaggle ML & DS Survey is the fourth edition of the annual industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for 3.5 weeks in October, and kaggle collected a little more than 20000 responses.

Data

Main Data:

Responses to multiple choice questions (only a single choice can be selected) were recorded in individual columns. Responses to multiple selection questions (multiple choices can be selected) were split into multiple columns (with one column per answer choice).

Supplementary Data:

Cleaned and Transformed Data:

After cleaning, transforming and splitting the provided data(kaggle_survey_2020_responses.csv), we get 4 seperate DataFrames, namely:

  1. questions: Questions asked in the survey
  2. response: Responses entered by the respondents
  3. professionals: Responses by professional respondents
  4. non professionals: Responses by non-professional respondents
According to the Survey Methodology provided with the Data, a respondent can be categorised as `Non Professional` if the respondent is either a student or unemployed or has never spent money on cloud services.

Tecnologies Used:

  • Python
  • Pandas
  • Numpy
  • Matplotlib
  • Seaborn
  • Plotly

Exploratory Data Analysis:

EDA of the data provides an overview of the demographic distributions and general trends in terms of Age, Location, Qualification, Experience, etc.
Since all the graphs and plots are created using Plotly, it is advised to look at the EDA Python Notebook (2020-kaggle-ml-ds-survey-analysis.ipynb) in Kaggle Notebooks in nbviewer as github does not support interactive graphs.
We have two EDA files:

In the first one, I considered responses from all the respondents. Whereas in the second exploratory data analysis file, I have considered only working professionals to get a better sense of how the professional landscape of Kagglers looks.


Final Analysis

Scientists vs Analysts is a comparative study between Data Scientists, Data Analysts, and Business Analysts. The main focus of the analysis is the difference in the duties they perform and the difference between their annual salaries based on the country they reside in and their educational background. This analysis also highlights how underpaid are Indian data science professionals as compared to the ones in the US.