Further Data Cleaning Based on Business Context and Consideration #15

ykamm · 2024-02-18T06:24:32Z

After Abdul's initial cleaning, the dimensions of our dataset were: (903653, 28).

After further cleaning, I found additional values that were not usable for us (i.e., "not set", "not available in demo dataset"). I also considered that for our use cases, we would need entries with comprehensive geographical data, so I conducted a thorough inspection of geographical data for missing values (that could not be imputed by deduction). Finally, the dimensions of our dataset are now: (193578, 35).

The output of this code is the file called squeaky_clean_train_data.csv which can be used directly for future exploratory data analysis and modelling. This data set could also serve as the common point before the data is used for both use cases. This can only be confirm when Angel (Business Analyst for Use Case 1) approves this request.

This commit encompasses a series of data cleaning steps aimed at preparing the dataset for machine learning modeling. The following changes have been made: - Bounces and NewVisits Handling: Converted `NaN` values to `0` for the 'bounces' and 'newVisits' columns to indicate non-bounce sessions and returning visits, respectively, enhancing the binary feature's consistency. - TransactionRevenue Adjustment: Replaced `NaN` values with `0` in the 'transactionRevenue' column to represent sessions without any transactional revenue, addressing the sparsity of transactional data. - Keyword Column Removal: Dropped the 'keyword' column due to its high volume of `NaN` values and limited predictive value, streamlining the feature set. - IsTrueDirect Normalization: Converted `NaN` values to `False` in the 'isTrueDirect' column, clarifying the interpretation of direct versus indirect session access. - ReferralPath Column Removal: Dropped the 'referralPath' column due to its complexity and the high percentage of missing values, simplifying the dataset. - Campaign and Medium Enhancement: Updated the 'campaign' column to replace "(not set)" with "No Campaign" for clarity. Grouped less common mediums into an "other" category in the 'medium' column for model simplicity and removed rows where the medium was "(not set)". - Pageviews NaN Removal: Eliminated rows with `NaN` values in the 'pageviews' column, ensuring completeness in pageview data. These steps were taken to improve the dataset's quality and readiness for machine learning analysis, ensuring that each feature contributes meaningfully to the model's predictive accuracy.

review-notebook-app · 2024-02-18T06:24:37Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Abdul-AA and others added 2 commits February 17, 2024 19:45

Made some changes for Yvan to do some EDA

c1c660e

ykamm assigned ykamm and Abdul-AA Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further Data Cleaning Based on Business Context and Consideration #15

Further Data Cleaning Based on Business Context and Consideration #15

ykamm commented Feb 18, 2024

review-notebook-app bot commented Feb 18, 2024

Further Data Cleaning Based on Business Context and Consideration #15

Are you sure you want to change the base?

Further Data Cleaning Based on Business Context and Consideration #15

Conversation

ykamm commented Feb 18, 2024

review-notebook-app bot commented Feb 18, 2024