Feature Request: Time Series Forecasting Datasets & Task Definitions #15
shchur
started this conversation in
OpenML Design and Feature Requests
Replies: 1 comment
-
Thanks for the proposal and taking time to illustrate the task and provide references. We are currently rewriting the server back-end so we have a hold on adding new task types, though we hope to be able to start adding task types again sometime early next year. Nevertheless, that shouldn't stop us from further discussing the task type and seeing whether we would like to adopt it when we are ready. I'll have a closer look at this later 👍 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Motivation
Time series forecasting is an extremely popular machine learning task. However, there doesn't exist a unified repository of forecasting datasets & task definitions. OpenML has an opportunity to fill this gap, which would be very extremely helpful for researchers working in this area.
Prior work
As far as I know, the only successful existing attempt at constructing such a repository of forecasting datasets has been made by the Monash Forecasting Repository. However it has several limitations:
Time series forecasting
We considered the following setting:
Dataset schema
A time series dataset can be represented as a tabular dataset with some additional constraints on the schema.
At the very minimum, the dataset should contain a single table with 3 columns:
item_id
: Unique identifier of each time seriestimestamp
: Date & time when the measurement was recordedtarget
: Time series value that needs to be predictedFor example, a dataset with 2 time series, each containing 3 observations made at daily frequency can be represented as
Such tabular representation for time series is rather standard and is used by many open-source packages (e.g., Prophet, StatsForecast) and Kaggle competitions (e.g., M5, Corporación Favorita Grocery Sales Forecasting ).
The dataset may contain additional columns corresponding to so-called covariates. For example, when predicting sales for different products, a related covariate may represent the price of each product at a given day.
Task definition
At the very minimum, the task definition should contain
prediction_length
(a.k.a.forecast_horizon
): an integer that denotes how many future values need to be predicted for each individual time series in the dataset.Train/test split
In time series forecasting, the train/test split cannot be decoupled from the task definition (as in tabular tasks). For example, given the above data and
prediction_length=1
, the train setdata_train
would correspond toand the test set
data_test
would containEach forecasting algorithm needs to produce predictions
predictions
in the following formatNote that some time series metrics depend on the historic data. That is, they are defined not as
metric(data_test, predictions)
, but rather asmetric(data_test, predictions, data_train)
(e.g., MASE).Potential extensions
Instances of forecasting problems
Kaggle competitions
Scientific publications
Beta Was this translation helpful? Give feedback.
All reactions