Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further Data Cleaning Based on Business Context and Consideration #15

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ykamm
Copy link
Collaborator

@ykamm ykamm commented Feb 18, 2024

After Abdul's initial cleaning, the dimensions of our dataset were: (903653, 28).

After further cleaning, I found additional values that were not usable for us (i.e., "not set", "not available in demo dataset"). I also considered that for our use cases, we would need entries with comprehensive geographical data, so I conducted a thorough inspection of geographical data for missing values (that could not be imputed by deduction). Finally, the dimensions of our dataset are now: (193578, 35).

The output of this code is the file called squeaky_clean_train_data.csv which can be used directly for future exploratory data analysis and modelling. This data set could also serve as the common point before the data is used for both use cases. This can only be confirm when Angel (Business Analyst for Use Case 1) approves this request.

Abdul-AA and others added 2 commits February 17, 2024 19:45
This commit encompasses a series of data cleaning steps aimed at preparing the dataset for machine learning modeling. The following changes have been made:

- Bounces and NewVisits Handling: Converted `NaN` values to `0` for the 'bounces' and 'newVisits' columns to indicate non-bounce sessions and returning visits, respectively, enhancing the binary feature's consistency.

- TransactionRevenue Adjustment: Replaced `NaN` values with `0` in the 'transactionRevenue' column to represent sessions without any transactional revenue, addressing the sparsity of transactional data.

- Keyword Column Removal: Dropped the 'keyword' column due to its high volume of `NaN` values and limited predictive value, streamlining the feature set.

- IsTrueDirect Normalization: Converted `NaN` values to `False` in the 'isTrueDirect' column, clarifying the interpretation of direct versus indirect session access.

- ReferralPath Column Removal: Dropped the 'referralPath' column due to its complexity and the high percentage of missing values, simplifying the dataset.

- Campaign and Medium Enhancement: Updated the 'campaign' column to replace "(not set)" with "No Campaign" for clarity. Grouped less common mediums into an "other" category in the 'medium' column for model simplicity and removed rows where the medium was "(not set)".

- Pageviews NaN Removal: Eliminated rows with `NaN` values in the 'pageviews' column, ensuring completeness in pageview data.

These steps were taken to improve the dataset's quality and readiness for machine learning analysis, ensuring that each feature contributes meaningfully to the model's predictive accuracy.
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants