Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set data source to handle pd.DataFrame correctly #301

Open
bangxiangyong opened this issue Dec 29, 2022 · 1 comment
Open

Set data source to handle pd.DataFrame correctly #301

bangxiangyong opened this issue Dec 29, 2022 · 1 comment

Comments

@bangxiangyong
Copy link
Member

In the function set_data_source of base_streams.py , the handling of pd.DataFrame might not be ideal, considering it converts into a numpy array with the following snippet:

elif type(self._quantities).__name__ == "DataFrame":  # dataframe or numpy
    self._quantities = self._quantities.to_numpy()
    self._n_samples = self._quantities.shape[0]

This might confuse user and raise an error when the intended output is a Dataframe rather than a numpy array. I believe the conversion to numpy array should be omitted:

elif type(self._quantities).__name__ == "DataFrame":  # dataframe or numpy
    self._n_samples = self._quantities.shape[0]

Here's the full code for the modified set_data_source function.

    def set_data_source(
        self,
        quantities: Union[List, DataFrame, np.ndarray] = None,
        target: Optional[Union[List, DataFrame, np.ndarray]] = None,
        time: Optional[Union[List, DataFrame, np.ndarray]] = None,
    ):
        """
        This sets the data source by providing up to three iterables: ``quantities`` ,
        ``time`` and ``target`` which are assumed to be aligned.

        For sensors data, we assume:
        The format shape for 2D data stream (timesteps, n_sensors)
        The format shape for 3D data stream (num_cycles, timesteps , n_sensors)

        Parameters
        ----------
        quantities : Union[List, DataFrame, np.ndarray]
            Measured quantities such as sensors readings.

        target : Optional[Union[List, DataFrame, np.ndarray]]
            Target label in the context of machine learning. This can be
            Remaining Useful Life in predictive maintenance application. Note this
            can be an unobservable variable in real-time and applies only for
            validation during offline analysis.

        time : Optional[Union[List, DataFrame, np.ndarray]]
            ``dtype`` can be either ``float`` or ``datetime64`` to indicate the time
            when the ``quantities`` were measured.

        """
        self._sample_idx = 0
        self._current_sample_quantities = None
        self._current_sample_target = None
        self._current_sample_time = None

        if quantities is None and target is None:
            self._quantities = list(np.arange(10))
            self._target = list(np.arange(10))
            self._time = list(np.arange(10))
            self._target.reverse()
        else:
            self._quantities = quantities
            self._target = target
            self._time = time

        # infer number of samples
        if type(self._quantities).__name__ == "list":
            self._n_samples = len(self._quantities)
        elif type(self._quantities).__name__ == "DataFrame":  # dataframe or numpy
            self._n_samples = self._quantities.shape[0]
        elif type(self._quantities).__name__ == "ndarray":
            self._n_samples = self._quantities.shape[0]
        self._set_data_source_type("dataset")

This could be a pull request but i'm quite occupied to start a new pull request atm!

@BjoernLudwigPTB
Copy link
Member

Hi @bangxiangyong ! Good to see you again. I guess the reason for the conversion was, that as of now, some mechanisms later in the process rely on the quantities being of type np.ndarray although this indeed is not ideal. This could apply to printing and buffering for instance, but I did not check. As a very first measure I would suggest to inform about the conversion and the expected output, in case a Dataframe was used to initialize. In a second step though, we should thoroughly check, what would be needed to allow for processing pd.DataFrames and implement these required changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants