Comprehensive Benchmark for Time Series Database Systems

TSM-Bench is a new benchmark that compares seven Time Series Database Systems (TSDBs) using a mixed set of workloads. It can be easily extended with new systems, queries, datasets, and workloads. The benchmark introduces a novel data generation method that augments seed real-world time series datasets, enabling realistic and scalable benchmarking. Technical details can be found in the paper TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications, PVLDB'23.

List of benchmarked systems: ClickHouse, Druid, eXtremeDB*, InfluxDB, MonetDB, QuestDB, TimescaleDB.
The benchmark evaluates bulk-loading, storage performance, offline/online query performance, and the impact of time series features on compression.
We use two datasets for the evaluation: D-LONG [d1] and D-MULTI [d2]. The evaluated datasets can be found here.
^*Note: Due to license restrictions, we can only share the evaluation version of extremeDB. The results between the benchmarked and the public version might diverge.

Prerequisites

Ubuntu 22 (including Ubuntu derivatives, e.g., Xubuntu); 128 GB RAM
Clone this repository (this can take a couple of minutes as it uploads one of the datasets)

Systems Setup

Install the dependencies and activate the created virtual environment

cd systems/
sh install_dep.sh
source TSMvenv/bin/activate

Install all the systems (takes ~15mins)

sh install_all_sys.sh

Dataset Loading

Download and decompress Dataset 1 (takes ~ 3 mins)

cd ../datasets
sh build.sh d1

Load Dataset 1 into all the systems (takes ~ 2 hours)

sh load_all.sh d1

In case you want to load Dataset 1 into a specific system:

cd systems/{system}
sh load.sh d1

Note: To build and load the larger dataset d2, replace d1 with d2.

Experiments

Offline Workload

Activate the virtual environment, if not already done:
```
source systems/TSMvenv/bin/activate
```
The offline queries for all systems can be executed from the root folder using:
```
python3 tsm_eval.py [args]
```
Mandatory Arguments: [args] should be replaced with the name of the system, query, and dataset:

--system	--queries	--datasets
clickhouse	q1 (selection)	d1
druid	q2 (filtering)	d2
extremedb*	q3 (aggregation)
influx	q4 (downsampling)
monetdb	q5 (upsampling)
questdb	q6 (average)
timescaledb	q7 (correlation)
all	all	all

Optional Arguments: The following arguments allow to add variation in the number of sensors and dynamic changes in predicate ranges:
- --nb_st: Number of queried stations when varying other dimensions (Default = 1)
- --nb_sr: Number of queried sensors when varying other dimensions (Default = 3)
- --n_st: Number of stations in the dataset (Default = 10)
- --n_s: Number of sensors in the dataset (Default = 100)
- --nb_sr: Number of queried sensors when varying other dimensions (Default = 3)
- --range: Query range value when varying other dimensions (Default = 1)
- --rangeUnit: Query range unit when varying other dimensions (Default = day)
- --timeout: Maximum query time after five runs (s) (Default = 20)
- --min_ts: Minimum query timestamp (Default = "2019-04-01T00:00:00")
- --max_ts: Maximum query timestamp (Default = "2019-04-30T00:00:00")
Results: All the runtimes and plots will be added to the results folder.
- The runtime results of the systems for a given dataset and query will be added to: results/offline/{dataset}/{query}/runtime/. The runtime plots will be added to the folder results/offline/{dataset}/{query}/plots/.
- All the queries return the runtimes by varying the number of stations (nb_st), number of sensors (nb_sr), and the range.
Examples:

Run query q1 on extremedb for Dataset 1 using default parameters (nb_st=1, nb_sr=3, range=1 day)

python3 tsm_eval.py --systems extremedb --queries q1 --datasets d1

Run q2 and q3 on extremedb and timescaledb for Dataset 1

python3 tsm_eval.py --systems extremedb timescaledb --queries q2 q3 --datasets d1

Run all the offline workload on all systems for Dataset 1 (takes ~ 3 hours)

python3 tsm_eval.py --systems all --queries all --datasets d1

Online Workload

This workload requires two servers: the first serves as a host machine to deploy the systems (similar to above), and the second runs as a client to generate writes and queries.

Client Setup

Clone this repo

Install dependencies:

cd systems/
sh install_dep.sh
source TSMvenv/bin/activate

Install the system libraries
```
sh install_client_lib.sh
```

Query Execution

Run the system on the host side
```
cd systems/{system}
sh launch.sh
```
If the virtual environment is not activated from the root folder using:
```
source systems/TSMvenv/bin/activate
```
Execute the online query on the client side using the --host flag (see examples below).
Stop the system on the host server
```
sh stop.sh
```

Optional Arguments:

--host : remote host machine name (Default = "localhost")
--n_threads: Number of threads to use. (Default 10)
--batch_size: Number data points to be inserted each second (if possible) (Default = 10000)

Examples:

Run query q1 in an online manner on clickhouse.

python3 tsm_eval_online.py --system clickhouse --queries q1 --host "host_address" --batch_size 10000

Run all queries online on influx using different batch sizes.

python3 tsm_eval_online.py --system influx --queries all --host "host_address" --batch_size 10000 20000 1000000

Run all queries online on questdb using one thread.

python3 tsm_eval_online.py --system questdb --queries all --n_threads 1 --host "host_address"

Notes:

We launch each system separately on the host machine and execute the online query on the client machine using the --host flag.
The maximal batch_size depends on your architecture and the selected TSDB.
Druid supports ingestion and queries concurrently, while QuestDB does not support multithreading.
If you stop the program before its termination or shut down the system, the database might not be set into its initial state properly; you need to reload the dataset in the host machine:
```
cd systems/{system}
sh load.sh
```

Results:

The runtime results of the systems will be added to: results/online/{dataset}/{query}/runtime/.
The runtime plots will be added to the folder results/online/{dataset}/{query}/plots/.
All the queries return the runtimes by varying the ingestion rate.

Storage Performance

To compute the storage performance of a given system:
```
cd systems/{system}
sh compression.sh
```
Note: {system} needs to be replaced with the name of one of the systems from the table below.

Benchmark Extension

TSM-Bench allows the integration of new systems seamlessly. We provide a step-by-step tutorial on how to integrate your system as part of the benchmark.

Should users wish, new queries can also be added to the benchmark. They must be added under each system's {system}/queries.sql file. Note that the order of the queries should be respected (e.g., q8 is the eighth query in the file).

Time Series Generation

We provide a GAN-based generation that allows augmenting a seed dataset with more and/or longer time series that have akin properties to the seed ones. The generation can be used either as a pre-trained model or by retraining from scratch the model.

Technical Report

Additional results not reported in the paper can be found here. The additional experiments cover:

Advanced analytical queries in SQL and UDF
Selection of the evaluated systems
Parameterization of the systems
Impact of data characteristics

Contributors

Abdelouahab Khelifati ([email protected])
Mourad Khayati
Luca Althaus

Citation

@article{DBLP:journals/pvldb/KhelifatiKDDC23,
  author       = {Abdelouahab Khelifati and
                  Mourad Khayati and
                  Anton Dign{\"{o}}s and
                  Djellel Eddine Difallah and
                  Philippe Cudr{\'{e}}{-}Mauroux},
  title        = {TSM-Bench: Benchmarking Time Series Database Systems for Monitoring
                  Applications},
  journal      = {Proc. {VLDB} Endow.},
  volume       = {16},
  number       = {11},
  pages        = {3363--3376},
  year         = {2023},
  url          = {https://www.vldb.org/pvldb/vol16/p3363-khelifati.pdf},
  doi          = {10.14778/3611479.3611532},
  timestamp    = {Mon, 23 Oct 2023 16:16:16 +0200},
  biburl       = {https://dblp.org/rec/journals/pvldb/KhelifatiKDDC23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,052 Commits
datasets		datasets
generation		generation
misc		misc
systems		systems
utils		utils
.DS_Store		.DS_Store
README.md		README.md
tsm_eval.py		tsm_eval.py
tsm_eval_online.py		tsm_eval_online.py

eXascaleInfolab/TSM-Bench

Folders and files

Latest commit

History

Repository files navigation