Skip to content

Commit

Permalink
Merge pull request #156 from fraunhoferportugal/feature/datasets
Browse files Browse the repository at this point in the history
Added the new Datasets module.
  • Loading branch information
mbarandas authored Aug 13, 2024
2 parents 9412b21 + 1af30ae commit 953b363
Show file tree
Hide file tree
Showing 20 changed files with 458 additions and 70 deletions.
55 changes: 55 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: "Tests"

on:
push:
pull_request:
types: [ opened, synchronize, reopened ]

jobs:
test:
name: Run template tests
strategy:
max-parallel: 4
fail-fast: false
matrix:
python-version: [ '3.8', '3.9' ]
platform: [ 'ubuntu-20.04', 'macos-latest', 'windows-latest' ]
runs-on: ${{ matrix.platform }}
timeout-minutes: 30

steps:
- name: Git Checkout
uses: actions/checkout@v3
with:
ref: ${{ github.head_ref }}
- name: Extract branch name
run: echo running on branch ${GITHUB_REF##*/}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -VV
python -m pip install --upgrade pip setuptools
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-test.txt
python -m pip install .
- name: Check numpy and scipy flags
run: |
python -c "import numpy as np; np.show_config()"
python -c "import scipy as sp; sp.show_config()"
# - name: Test the input/output
# run: |
# python tests/test_calc_features.py
- name: Test the data loaders
run: |
$retry_count = 3
while ($retry_count -gt 0) {
python tests/test_dataset_loaders.py
if ($LASTEXITCODE -eq 0) {
break
}
$retry_count--
Start-Sleep -Seconds 5
}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ celerybeat-schedule
# Environments
.env
.venv
.venv-docs
env/
venv/
ENV/
Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![Documentation Status](https://readthedocs.org/projects/tsfel/badge/?version=latest)](https://tsfel.readthedocs.io/en/latest/?badge=latest)
[![Documentation Status](https://readthedocs.org/projects/tsfel/badge/?version=dev)](https://tsfel.readthedocs.io/en/latest/?badge=latest)
[![license](https://img.shields.io/badge/License-BSD%203-brightgreen)](https://github.com/fraunhoferportugal/tsfel/blob/master/LICENSE.txt)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tsfel)
![PyPI](https://img.shields.io/pypi/v/tsfel)
Expand All @@ -7,6 +7,10 @@

# Time Series Feature Extraction Library

<p align="center">
<a href="https://readthedocs.org/projects/tsfel/badge/?version=latest"><img src="https://raw.githubusercontent.com/fraunhoferportugal/tsfel/feature/datasets/docs/imgs/tsfel_feature_sets.png" width="100%" alt="tsfel logo" /></a>
</p>

## Intuitive time series feature extraction

This repository hosts the **TSFEL - Time Series Feature Extraction Library** python package. TSFEL assists researchers
Expand Down
16 changes: 14 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,18 @@
"sphinx.ext.napoleon",
]

# If true, figures, tables and code-blocks are automatically numbered
# if they have a caption.
numfig = True

numfig_format = {
"figure": "Fig. %s",
"table": "Table %s",
"code-block": "Listing %s",
}

numfig_secnum_depth = 2

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]

Expand All @@ -79,9 +91,9 @@
# built documents.
#
# The short X.Y version.
version = "0.1.6"
version = "0.1.8"
# The full version, including alpha/beta/rc tags.
release = "0.1.6"
release = "0.1.8"

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
4 changes: 3 additions & 1 deletion docs/descriptions/feature_list.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
List of available features
==========================

The following table provides an overview of the available featurest in the current version of TSFEL.
The table below provides an overview of the features available in ``TSFEL``.

.. _feature-list:

.. autosummary::
:toctree: _generated
Expand Down
166 changes: 118 additions & 48 deletions docs/descriptions/get_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,66 +2,54 @@
Get Started
===========

Overview
--------
``TSFEL`` is a simple yet powerful package for time series feature extraction. This page summarizes the key points to help you get started with using ``TSFEL`` for your feature extraction tasks. We begin by presenting the input data types, window splitting configurations, and the output data structure. Finally, we describe the available feature sets and important aspects of feature extraction configuration.

Time series are passed as inputs for the main TSFEL extraction method, either as arrays previously loaded in memory or stored in files on a dataset. Since TSFEL can handle multidimensional time series, a set of preprocessing methods is afterwards applied to ensure that the signal quality is adequate and time series synchronisation so that the window calculation process is properly achieved. After the feature extraction, the result is saved using a standard schema ready to be digested by most of the classification and data mining platforms. Each line corresponds to a window with the results of the feature extraction methods stored along with the corresponding columns.
Extract from time series stored in DataFrames, Series or ndarrays
-----------------------------------------------------------------

.. image:: ../imgs/tsfel_pipeline.png
:align: center
:scale: 25 %
:alt: TSFEL!

Extract from time series as array objects
-----------------------------------------

Let us start by downloading some data. A complete dataset description can be found in [1]_. On this example we will use a time series sampled by an accelerometer sensor at 50 Hz.
Let's start by downloading some data. The following method returns a single-lead electrocardiogram (ECG) recorded at 100 Hz for a duration of 10 seconds.

.. code:: python
import tsfel
import zipfile
import numpy as np
import pandas as pd
# Load the dataset from online repository
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip
data = tsfel.datasets.load_biopluxecg() # A single-lead ECG collected during 10 s at 100 Hz.
# Unzip the dataset
zip_ref = zipfile.ZipFile("UCI HAR Dataset.zip", 'r')
zip_ref.extractall()
zip_ref.close()
Let us look to the input data structure:

# Store the dataset as a Pandas dataframe.
x_train_sig = np.loadtxt('UCI HAR Dataset/train/Inertial Signals/total_acc_x_train.txt', dtype='float32')
X_train_sig = pd.DataFrame(np.hstack(x_train_sig), columns=["total_acc_x"])
.. code:: python
Let us look to the data structure:
data.head()
0 2.898565
1 2.462342
2 -0.513560
3 -5.263333
4 -8.934970
dtype: float64
.. code:: python
Our input data is a ``Series`` named *LeadII*. Note that ``TSFEL`` can also handle multivariate time series. In such cases, you will need to include the additional time series as a ``DataFrame`` or ``ndarray``.

X_train_sig.head()
Now that we have the input data, we are ready for the feature extraction step. ``TSFEL`` relies on JSON configuration files to set up the configuration for feature extraction. We provide a set of template JSON configurations that can be used out of the box to extract all or individual feature sets.

===== ===========
Id total_acc_x
----- -----------
0 1.012817
1 1.022833
2 1.022028
3 1.017877
4 1.023680
===== ===========
In this example, we will use the configuration that extracts all available features from the temporal, statistical, and spectral sets.

We have now a DataFrame composed by a unique column with associated column name. Note that TSFEL can also handle multidimensional datastreams. In that case, it will be necessary to pass the additional time series as additional columns in the dataframe.
.. code:: python
cfg = tsfel.get_features_by_domain() # Extracts the temporal, statistical and spectral feature sets.
X = tsfel.time_series_feature_extractor(cfg, data, fs=100)
X.shape # (1, 165)
We now have ``X`` as the extracted feature vector, composed of 165 features calculated for the entire length of the input data.

Now that we have the input data we are ready for the feature extraction step. TSFEL relies on dictionaries to setup the configuration of the extractly. We provide a set of template configuration dictionaries that can be used out of the box. In this example we will use the example that extracts all the available features of TSFEL. We will configure TSFEL to divide our time series in windows of equal length of size 250 points (corresponding to 5 seconds).
Alternatively, if we are interested in performing window splitting before the feature extraction, we can divide the input data into shorter 10 equal-length windows of size 100 (corresponding to 1 second).

.. code:: python
cfg_file = tsfel.get_features_by_domain() # If no argument is passed retrieves all available features
X_train = tsfel.time_series_features_extractor(cfg_file, X_train_sig, fs=50, window_size=250) # Receives a time series sampled at 50 Hz, divides into windows of size 250 (i.e. 5 seconds) and extracts all features
cfg = tsfel.get_features_by_domain() # Extracts the temporal, statistical and spectral feature sets.
X = tsfel.time_series_feature_extractor(cfg, data, fs=100, window_size=100) # Performs window splitting before feature extraction
X.shape # (10, 165)
We finally have now ``X_train`` as the final feature vector composed of 205 features calculated for each of the 3764 extracted windows.
Extract from time series stored in datasets
-------------------------------------------
Expand All @@ -74,7 +62,6 @@ TSFEL provides a method to increase flexibility when extracting features over mu
* TSFEL crawls over a given dataset root directory and extracts features from all text files which match filenames provided by the user



* **Files store time series in delimited format**

* TSFEL expects that the first column must contain the timestamp and following columns contain the time series values.
Expand All @@ -100,13 +87,95 @@ The following code block extracts features on data residing over ``main_director
output_directory=output_directory
)
Input data formats
------------------

``TSFEL`` offers flexibility concerning the datatypes and dimensions that specify the format of the time series data. It supports both univariate and multivariate series.

A time series is a series of real valued observations taken at successive equally spaced points in time. We usually define the total number of observations as *length*. Univariate series consists of a single variable evolving over time (e.g., the daily closing stock prices for a given company). Multivariate series are composed of several variables concurrently evolving over time and we usually refer to each variable as a *dimension*. For instance, the input data from an accelerometer that measures the tridimensional acceleration (x, y, and z) has 3 dimensions. Note we assume that all dimensions share the same sampling frequency.

``TSFEL`` supports time series stored in ``ndarray``, ``Series`` and ``DataFrame`` . We provide below an intuitive visual guide that summarizes the supported input data types and dimensions.

.. _visual_guide:

.. figure:: ../imgs/tsfel_visual_guide.png
:align: center
:scale: 15 %
:alt: A visual guide do data representation in TSFEL.

A visual intro to data representation in TSFEL.

Feature extraction relies on the ``time_series_feature_extractor`` method. Its base arguments are ``input_data``, which stores the time series data, and ``features_dict``, which contains the feature extraction configuration. Further explanation on feature extraction configuration is available :ref:`here <feature-config>`.

A univariate time series must be stored in one column with several rows of observations. The ``time_series_feature_extractor`` extracts features from the whole length of a time series (:numref:`visual_guide` A) or from shorter fixed-size windows (:numref:`visual_guide` B).

The several variables of multivariate time series must be stored in separate columns. Similarly, to the univariate case, ``time_series_feature_extractor`` extracts features from the whole length of a time series (:numref:`visual_guide` C) or from shorter fixed-size windows (:numref:`visual_guide` D).

Output data format
------------------

The output is always provided as a ``DataFrame`` object, with the results from different features along the columns. If the input is multivariate, the features from the various dimensions are horizontally stacked together. The rows correspond to the values extracted from the entire time series (or from the windows, in the case of window division).

.. list-table::
:widths: 15 15 15 15 15 15 10
:header-rows: 0

* -
- 0_feature_1
- 0_feature_2
- ...
- 1_feature_1
- 1_feature_2
- ...
* - Window / Time Series
-
-
- ...
-
-
- ...
* - ...
-
-
- ...
-
-
- ...

Note: If the input is stored in a ``Series`` or ``DataFrame`` objects with column names, the prefixes 0_* and 1_* are replaced by the variable names (e.g., AccX_feature_1, AccY_feature_1).

Feature sets
------------

Our feature sets are divided according to the *temporal*, *statistical, spectral*, and *fractal* domains.

.. figure:: ../imgs/tsfel_feature_sets_squared.png
:align: center
:scale: 15 %
:alt: The available feature sets in TSFEL - temporal, statistical, spectral and fractal.

The available feature sets in TSFEL - temporal, statistical, spectral and fractal.

**Temporal** features analyze the changes and patterns in data over time. They capture information such as trends, cycles, and temporal correlations, which are essential for understanding dynamic behaviors and predicting future values. Time-domain features sensitive to the ordering of observations are included in this set.

**Statistical** features summarize data using descriptive statistics. They include measures such as mean, variance, skewness, and kurtosis, providing a comprehensive overview of the data distribution, central tendency, dispersion, and shape. Features insensitive to the ordering of observations are included in this set.

**Spectral** features focus on the frequency domain of the data. By transforming data using techniques like Fourier or wavelet transforms, they reveal underlying periodicities, harmonics, and frequency components, which are crucial for identifying cyclical patterns and oscillations.

**Fractal** features describe the complexity and self-similarity of data across different scales. They are derived from fractal theory and include measures like fractal dimension, capturing the intricate patterns and irregularities that are often present in natural and complex systems.

The description of each feature is available :ref:`here <feature-list>`.

.. _feature-config:

Set up the feature extraction config file
------------------------------------------
One of the main advantages of TSFEL is providing a large number of time series features out-of-the-box. Nevertheless, there are occasions where you might not be interested in extracting the complete set. Examples comprise scenarios where the models will be deployed in low-power embedded devices, or you simply want to be more specific in what features are extracted.
``TSFEL`` provides a large number of time series features by default. However, there are occasions when you might not need to extract the complete feature set. Examples include scenarios where models will be deployed on low-power embedded devices or when you simply want to specify which features are extracted.

TSFEL divides the available features into three domains: statistical, temporal and spectral. The two methods to extract features explained above expect a configuration file - ``feat_dict`` - a dictionary containing which features and hyperparameters will be used.
The ``TSFEL`` complete feature set includes features from the statistical, temporal, spectral, and fractal domains. Some features also have their own hyperparameters, such as the number of bins for the histogram or the wavelet family. Information regarding which features to extract and their hyperparameters is stored in a JSON feature configuration file.

Bellow, we list four examples to set up the configuration dictionary.
We provide convenient methods to easily set up JSON configuration files to extract the complete feature set or feature sets related to individual domains. An example is provided below:

.. code:: python
Expand All @@ -116,10 +185,11 @@ Bellow, we list four examples to set up the configuration dictionary.
cgf_file = tsfel.get_features_by_domain("statistical") # All statistical domain features will be extracted
cgf_file = tsfel.get_features_by_domain("temporal") # All temporal domain features will be extracted
cgf_file = tsfel.get_features_by_domain("spectral") # All spectral domain features will be extracted
cgf_file = tsfel.get_features_by_domain("fractal") # All fractal domain features will be extracted
In case you want a customised set of features or a combination of features from several domains, you can need to edit the configuration dictionary (JSON). You must edit the value of the key ``use`` to ``yes`` or ``no`` as appropriate. You can load any of the previous configuration dictionaries and set to ``"use": "no"`` the features you are not interested in or edit a dictionary manually or programmatically and set the ``use`` as ``yes`` or ``no`` as appropriate. An example file is available `here <https://github.com/fraunhoferportugal/tsfel/blob/development/tsfel/feature_extraction/features.json/>`_.
To create a customized set of features, you need to edit the JSON feature configuration file accordingly by setting the value of the ``use`` key to ``yes`` or ``no`` as appropriate. Some feature extractors also have a ``parameter`` key to configure individual settings. An example file is available `here <https://github.com/fraunhoferportugal/tsfel/blob/development/tsfel/feature_extraction/features.json/>`_.

References
----------
Usage example
-------------

.. [1] `https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones <https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones>`_.
We provide an `example notebook <https://colab.research.google.com/github/fraunhoferportugal/tsfel/blob/master/notebooks/TSFEL_HAR_Example.ipynb>`_ that summarizes the complete development stack using ``TSFEL``, including feature extraction, modeling, and model evaluation applied to a Human Activity Recognition (HAR) dataset.
1 change: 1 addition & 0 deletions docs/descriptions/modules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ Module Reference

modules/tsfel.feature_extraction
modules/tsfel.utils
modules/tsfel.datasets
15 changes: 15 additions & 0 deletions docs/descriptions/modules/tsfel.datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
datasets
========

.. automodule:: tsfel.datasets
:members:
:undoc-members:
:show-inheritance:

_single_problem_loaders
-----------------------

.. automodule:: tsfel.datasets._single_problem_loaders
:members:
:undoc-members:
:show-inheritance:
Binary file added docs/imgs/tsfel_feature_sets.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/tsfel_feature_sets_squared.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/imgs/tsfel_visual_guide.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 953b363

Please sign in to comment.