Skip to content

Latest commit

 

History

History
59 lines (44 loc) · 2.3 KB

Readme.md

File metadata and controls

59 lines (44 loc) · 2.3 KB

Evaluation of Big Spatial Data Processing Engines

Motivation

With the advance of Apache Hadoop MapReduce and Apache Spark, extensions to query spatial data on these platforms were created. However, these platforms greatly differ in their features and of course also in performance. While we built our STARK project, we came accross related projects that we needed to compare against our implementation.

In this project we will document our tests and performance evaluation of Hadoop- and Spark-based implementations.

In this evaluation we will test the following systems:

This list will be extended with newly published systems.

Data & Setup

To run "Big Data" tests, we created our own data generator. The project is available here.

All experiments were executed on our cluster of 16 nodes, where each node has:

  • Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz
  • 16 GB RAM
  • 1 TB disk
  • 1 Gbit/s network

On the cluster we have Hadoop 2.7, Spark 2.0.0 (and Spark 1.6.1 if needed), Java 1.8, and Scala 2.11.

Spatial Filter Queries

The following images shows a spatial filter operator on a polygon data set with 50,000,000 polygons (880 GB)

TODO: create image

Impact of query range size

  • SpatialHadoop Query range size SpatialHadoop
  • SpatialSpark Query range size SpatialSpark
  • GeoSpark Query range size GeoSpark
  • STARK Query range size STARK

Point Query on Polygon DS

PolyPoint

Range Query (Polygon) on Polygon DS

PolyPoint

Spatial Join Queries

Self-join with 1,000,000 and 10,000,000 point data sets for STARK with the both partitioners. Self-join points

Comparing the Self-join results for STARK and GeoSpark with the best partitioner for each platform. Note: GepSpark produced wrong results and ~ 100.000 result tuples were missing in both cases. Self-join points

TODOs

  • add runner files
  • generate missing images