Skip to content

Design and implementation of different MapReduce jobs used to analyze a dataset on Covid-19 disease created by Our World In Data

Notifications You must be signed in to change notification settings

d-elicio/COVID-19-MapReduce-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Covid-19 MapReduce project

GitHub watchers GitHub forks GitHub Repo stars GitHub last commit

Design and implementation of different simple MapReduce jobs used to analyze a dataset on Covid-19 disease created by Our World In Data.

Design and implementation in MapReduce of:

  • a job returning the classification of continents in order of decreasing total_cases
  • a job returning, for each location, the average number of new-tests per day
  • a job returning the 5 days in which the total number of patients in Intensive care units (ICUs) and hospital was highest (icu_patients+ hosp_patients), by decreasing order of this total number.

All the MapReduce jobs will be executed with different size of input values and with different Hadoop configurations (local, pseudo-distributed without YARN and pseudo-distributed with YARN) to evaluate the different execution times.

🚀 About Me

I'm a computer science Master's Degree student and this is one of my university project. See my other projects here on GitHub!

portfolio linkedin

💻 The project

Dataset

The dataset is a collection of the COVID-19 data manteined and updated daily by Our World In Data and contains data on confirmed cases, deaths, hospitalizations, and other variables of potential interest.

The dataset available in different formats can be found here, while the data dictionary useful to understand the meaning of all the dataset's columns is available here.

Data dictionary

Job 1

Implementation of a MapReduce job that returns the classification of continents in order of decreasing total_cases.

Before the Map phase there is a Splitting phase where the raw lines of data contained in the dataset are divided in “single blocks of data” using the “,” value to separate all the different values and the “\t” value to separate all the different rows of the dataset.

MAPPING PHASE

Input: all the values contained inside each row of the dataset and separated by ","

Output: continent of each nation and the number of total cases (take only the information on the last day of the month (e.g. if your input data contains info only on March-April period, to take the number of total cases you have to take only the info on 30th of April, because the number of total_cases in the dataset is a cumulative number based on the sum of the cases on the previous days))

SHUFFLING PHASE

The MapReduce framework processes the output of the map function before sending it to the reducer function. Map outputs are divided into groups.

Input: output of the map function

Output: each continent appear with a list of all of its total_cases

REDUCE PHASE

Input: output of the shuffling phase

Output: total number of cases for each continent

FINAL RESULT

The final result is composed by a key-value pair made of KEY = name of continent, VALUE = number of total cases for that continent.

At the end there will be a sorting of this key-value pairs to show the results in order of decreasing of total cases.

Job 1

Job 1 results:

The execution of this job has been done in all 3 Hadoop configurations (Local and Pseudo-distributed, with and without YARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

immagine

Input: MARCH-AUGUST data

immagine

Input: MARCH-OCTOBER data

immagine

Job 2

Implementation of a MapReduce job that returns for each location the average number of new-tests per day

MAPPING PHASE

Input: all the values contained inside each row of the dataset and separated by ","

Output: location and number of new_tests per day

SHUFFLING PHASE

Input: output of the map function

Output: each location appear with a list of all of its new_tests

REDUCE PHASE

Input: output of the shuffling phase

Output: total number of new_tests for each location and divide the result for the number n of days

FINAL RESULT

The final result is composed by a key-value pair made of KEY = name of location, VALUE = average number of new_tests for each location.

immagine

Job 2 results:

The execution of this job has been done in all 3 Hadoop configurations (Local and Pseudo-distributed, with and without YARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

marapr

Input: MARCH-AUGUST data

maraug

Input: MARCH-OCTOBER data

marott

Job 3

Implementation of a MapReduce job that returns the classification of the 5 days in which the total number of patients in Intensive Care Unit (ICUs) and hospital (icu_patients + hosp_patients) was highest, by decreasing order of this total number

MAPPING PHASE

Input: all the values contained inside each row of the dataset and separated by ","

Output: date and number of icu_patients + hosp_patients per day

SHUFFLING PHASE

Input: output of the map function

Output: each date appear with a list of all of its icu_patients + hosp_patients

REDUCE PHASE

Input: output of the shuffling phase

Output: sum of icu_patients + hosp_patients for each date

FINAL RESULT

The final result is composed by a key-value pair made of KEY = date, VALUE = number of icu_patients + hosp_patients for that date.

immagine

Job 3 results:

The execution of this job has been done in all 3 Hadoop configurations (Local and Pseudo-distributed, with and without YARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

immagine

Input: MARCH-AUGUST data

immagine

Input: MARCH-OCTOBER data

immagine

Hadoop configurations time comparison

For every job some tabular and graphical comparison of job's execution times in local and pseudo-distributed configurations with and without YARN have been computed. Obviously all these execution times are taken even considering the different input sizes used.

immagine

a

Time comparison results discussion

  • As we expect, the fastest Hadoop configuration is the Local (or standalone mode) and this because in this configuration Hadoop runs in a single JVM and uses the local filesystem instead of the Hadoop Distributed Fyle System (HDFS). Furthermore in this configuration the job will be run with one mapper and one reducer and this guarantees a good speed (not true for too large input data) and it is why Local configuration is used particularly for the code debugging and testing.

  • On the other way Pseudo-distributed mode is used to simulate the possible behavior of distributed computation, but it’s done by running Hadoop daemons in different JVM instances on a single machine. This configuration uses HDFS instead of the local filesystem and there can be multiple mappers and multiple reducers to run the job and this increments the execution times.

  • These times really increase when the job is ran in Pseudo-distributed mode using YARN because the map and reduce times sum up to the times due to resource managing.

Support

For any support, error corrections, etc. please email me at [email protected]