MapReduce, Hadoop Streaming, Pig, Hive, Apache Airflow - project

About

The main goal of the project was to analyse datasets avialable on: http://www.cs.put.poznan.pl/kjankiewicz/bigdata/projekt1 or https://www.imdb.com/interfaces/ and get the results for 3 genres among feature films (titleType=movie) with the most engaged actors. Final result presents 3 columns: genre, number of available films of that genre and number of actors who played a role in these films. The first task of the flow was to analyse title.principals.tsv file using mapreduce in the classic approach developed in Java and count number of actors for every film. The second task was to use Hive platform to generate final result based on mapreduce task result and title.basics.tsv file in the JSON format. The last task was to use Apache Airflow to orchestrate entire data workflow.

Project Structure

Folder MapReduceProject contains the Java project for MapReduce task, analyse_films.hql script to create final result and solution_script which runs all the tasks. The file projekt1.py represents source code for a definition of the entire pipeline for data processing using Apache Airflow. Also the project has the compressed solution to test using Airflow available inside projekt1.zip.

Running the project

The project was created on Google Cloud Platform. To run the analyse it is necessary to load input data inside projekt1/input/datasource1 and projekt1/input/datasource4 folder on your existing bucket. Then you have to create cluster using for example command available inside run_cluster.txt file. Then you have to install and run Apache Airflow on that cluster and copy data from zipped file inside airflow folder. Then if you have configured the previous step in a proper way, solution should be available to test on your localhost. Running the pipeline you have to put your details about username and cluster name in Trigger config. The output of final task of the workflow should be identical to these in result.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
MapReduceProject		MapReduceProject
__pycache__		__pycache__
README.md		README.md
analyse_films.hql		analyse_films.hql
projekt1.py		projekt1.py
projekt1.zip		projekt1.zip
result.txt		result.txt
run_cluster.txt		run_cluster.txt
workflow.png		workflow.png
workflow_with_airflow.png		workflow_with_airflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

MapReduceProject

MapReduceProject

pycache

pycache

README.md

README.md

analyse_films.hql

analyse_films.hql

projekt1.py

projekt1.py

projekt1.zip

projekt1.zip

result.txt

result.txt

run_cluster.txt

run_cluster.txt

workflow.png

workflow.png

workflow_with_airflow.png

workflow_with_airflow.png

Repository files navigation

MapReduce, Hadoop Streaming, Pig, Hive, Apache Airflow - project

About

Project Structure

Running the project

DAG to represent data pipeline created by me from Apache Airflow UI

About

Releases

Packages

Languages

Michu-dev/big-data-first-project

Folders and files

Latest commit

History

Repository files navigation

MapReduce, Hadoop Streaming, Pig, Hive, Apache Airflow - project

About

Project Structure

Running the project

DAG to represent data pipeline created by me from Apache Airflow UI

About

Topics

Resources

Stars

Watchers

Forks

Languages