Set data source to handle pd.DataFrame correctly #301

bangxiangyong · 2022-12-29T08:53:22Z

In the function set_data_source of base_streams.py , the handling of pd.DataFrame might not be ideal, considering it converts into a numpy array with the following snippet:

elif type(self._quantities).__name__ == "DataFrame":  # dataframe or numpy
    self._quantities = self._quantities.to_numpy()
    self._n_samples = self._quantities.shape[0]

This might confuse user and raise an error when the intended output is a Dataframe rather than a numpy array. I believe the conversion to numpy array should be omitted:

elif type(self._quantities).__name__ == "DataFrame":  # dataframe or numpy
    self._n_samples = self._quantities.shape[0]

Here's the full code for the modified set_data_source function.

    def set_data_source(
        self,
        quantities: Union[List, DataFrame, np.ndarray] = None,
        target: Optional[Union[List, DataFrame, np.ndarray]] = None,
        time: Optional[Union[List, DataFrame, np.ndarray]] = None,
    ):
        """
        This sets the data source by providing up to three iterables: ``quantities`` ,
        ``time`` and ``target`` which are assumed to be aligned.

        For sensors data, we assume:
        The format shape for 2D data stream (timesteps, n_sensors)
        The format shape for 3D data stream (num_cycles, timesteps , n_sensors)

        Parameters
        ----------
        quantities : Union[List, DataFrame, np.ndarray]
            Measured quantities such as sensors readings.

        target : Optional[Union[List, DataFrame, np.ndarray]]
            Target label in the context of machine learning. This can be
            Remaining Useful Life in predictive maintenance application. Note this
            can be an unobservable variable in real-time and applies only for
            validation during offline analysis.

        time : Optional[Union[List, DataFrame, np.ndarray]]
            ``dtype`` can be either ``float`` or ``datetime64`` to indicate the time
            when the ``quantities`` were measured.

        """
        self._sample_idx = 0
        self._current_sample_quantities = None
        self._current_sample_target = None
        self._current_sample_time = None

        if quantities is None and target is None:
            self._quantities = list(np.arange(10))
            self._target = list(np.arange(10))
            self._time = list(np.arange(10))
            self._target.reverse()
        else:
            self._quantities = quantities
            self._target = target
            self._time = time

        # infer number of samples
        if type(self._quantities).__name__ == "list":
            self._n_samples = len(self._quantities)
        elif type(self._quantities).__name__ == "DataFrame":  # dataframe or numpy
            self._n_samples = self._quantities.shape[0]
        elif type(self._quantities).__name__ == "ndarray":
            self._n_samples = self._quantities.shape[0]
        self._set_data_source_type("dataset")

This could be a pull request but i'm quite occupied to start a new pull request atm!

The text was updated successfully, but these errors were encountered:

BjoernLudwigPTB · 2023-02-01T10:38:14Z

Hi @bangxiangyong ! Good to see you again. I guess the reason for the conversion was, that as of now, some mechanisms later in the process rely on the quantities being of type np.ndarray although this indeed is not ideal. This could apply to printing and buffering for instance, but I did not check. As a very first measure I would suggest to inform about the conversion and the expected output, in case a Dataframe was used to initialize. In a second step though, we should thoroughly check, what would be needed to allow for processing pd.DataFrames and implement these required changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set data source to handle pd.DataFrame correctly #301

Set data source to handle pd.DataFrame correctly #301

bangxiangyong commented Dec 29, 2022

BjoernLudwigPTB commented Feb 1, 2023

Set data source to handle pd.DataFrame correctly #301

Set data source to handle pd.DataFrame correctly #301

Comments

bangxiangyong commented Dec 29, 2022

BjoernLudwigPTB commented Feb 1, 2023