datasets
notebooks
docker-compose.yml
nessie.properties
-
Download data from this repository. For this project we are using
Asia Towers.csv
and place it on dataset folder. -
Start up the different containers individually (on different terminal tabs). We can also do
docker-compose up
to start everything at once but will need to search for access token to startjupyter notebook
docker-compose up nessie
docker-compose up notebook
docker-compose up trino
docker-compose up minio
-
Login to minio at
http://localhost:9001/
with usernameadmin
and passwordpassword
. Then create a bucket called warehouse. -
Go to jupyter notebook (see output of docker-compose up notebook) to run the spark code to load the data and to run the benchmarking queries.
-
Open new tab on terminal to initialize trino with this command
docker exec -it trino trino
. Run the queries ontrino-sql.md
The queries can be found in trino-sql.md
inside notebooks
folder
- Calculate the average signal strength for each country and network.
- Count the unique units in each country.
- Calculate the maximum, minimum, and average range for each network per country.
- Combine these results in a single output.
- Simple query: 6.97 Sec
- Complex query: 19.00 Sec
- Simple query: 6.13 seconds
- Complex query: 15.26
- Simple query: 1.52 Sec
- Complex query: 4.83 Sec
- Simple query: 1.15 Sec
- Complex query: 4.01 Sec
- Trino outperforms Spark in both simple and complex queries, particularly in partitioned tables with complex queries. This advantage may stem from its internal design, which is optimized for data querying through the concurrent execution of query stages.
- The analysis of Spark was conducted in a Jupyter Notebook, so we might observe slightly improved performance if the analysis were executed in a dedicated Spark environment.
- The size of the data is also likely to influence query times across both engines. Future work could include analysis with datasets in the terabyte range.
- Additionally, exploring the performance of a distributed cluster presents an interesting area for further investigation.