Data preparation for Machine Learning Model

Novel Coronavirus COVID-19 Dataset

Original dataset, name, and data definitions/structures Source : https://github.com/MohammadFebriyanto/Bangkit_Project/tree/master/DATA

File Dataset

patient.csv                 22 columns  (Main File)
time.csv                    24 columns  
route.csv                    7 columns
case.csv                     8 columns
trend.csv                    5 columns
TotalCaseConvir_INA.csv      2 columns  (dataset file for country: Indonesia)

The steps to do data preparation are as follows

Import Library

Pandas and Numpy used data analysis and manipulation tools and computation.
Matplotlib and Seaborn used to visualization data.
Math used to mathematical functions defined by the C standard.

Load the dataset

used to read the data frame.

Find information and insights from the dataset.

Missing Value

Identify and handling missing values for data training and data test.

Create New Features

In COVID-19 data, date and range of age can be metrics for approving confirmed cases. New features often referred to as interaction terms.

Aggregating Numerical Variable

In such a scenario, limiting the count of these labels can be a solution. In the data, the variable 'infection_reason' has several numerical (numerical_infection), which can be displayed.

Log Transform

Log transform did quantile capping of the income variable and also done logarithmic transformation to treat extreme values. That had represented with confirmed_date_transform.

Split the dataset into Training Set and Test Set (80/20)

Modeling for optional.

Result or Insight Data Preparation

Based on this result, we conclude that older people are with an age range of 80 to 90 years more susceptible to coronavirus than younger people. And the number of Female patients is more than Male Patients. The graph is displayed more persuasively than before.
The most reason why people get infected by a coronavirus is direct contact with another patient and visit Daegu. Output data results in numerical form.
Handling in missing value can provide optimized performance on the preparation data, especially on the target variable.
The main difference from all of them is the use of Log Transform for positive skewness. That is when the tail on the right side of the distribution is longer. With the Log Transform method, the goal value is to reduce the slope.

Data Visualization

Handling Missing Values

Aggregate dates

Optimize data to be easily understood

Numerical number of confirmed based on infection reason

Handling Skewness

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Data		Data
data_prep		data_prep
README.md		README.md
data_preparation_covid19.ipynb		data_preparation_covid19.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data preparation for Machine Learning Model

Novel Coronavirus COVID-19 Dataset

File Dataset

The steps to do data preparation are as follows

Import Library

Load the dataset

Find information and insights from the dataset.

Missing Value

Create New Features

Aggregating Numerical Variable

Log Transform

Split the dataset into Training Set and Test Set (80/20)

Modeling for optional.

Result or Insight Data Preparation

Data Visualization

About

Releases 1

Languages

noernimat/data_preparation_covid19_dataset

Folders and files

Latest commit

History

Repository files navigation

Data preparation for Machine Learning Model

Novel Coronavirus COVID-19 Dataset

File Dataset

The steps to do data preparation are as follows

Import Library

Load the dataset

Find information and insights from the dataset.

Missing Value

Create New Features

Aggregating Numerical Variable

Log Transform

Split the dataset into Training Set and Test Set (80/20)

Modeling for optional.

Result or Insight Data Preparation

Data Visualization

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Languages