You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The climate data sources (ERA5DaliExternalSource and ClimateDataSourceSpec) both specify a stride parameter to read data at strides larger than what's stored in the dataset files.
But the stride is only applied when reading multiple time steps (num_steps>1)
The first index that is read remains independent of the stride. Example: https://github.com/NVIDIA/modulus/blob/main/modulus/datapipes/climate/era5_hdf5.py#L598
My understanding of the stride parameter is that it should allow reading subsamples of the data source even for the first index.
If that's not the expected usage, this bug could instead be viewed as a feature enhancement, for what a quick addition to allow coarser-granularity subsets of the same dataset on disk (e.g. read every 6 hours from a 1-hour dataset on disk)
Minimum reproducible example
No response
Relevant log output
No response
Environment details
No response
The text was updated successfully, but these errors were encountered:
I've looked at this a bit more and I think if we agree with the interpretation of stride here, then it should be sufficient to fix it by loading data[self.stride * in_idx] instead of data[in_idx]. But it may also be reasonable if one interprets the currrent definition to be consistent with the implementation.
Overall, such behavior may be left better as something for the user to extend on their own. Both the above implementations may be relevant to different users, and it's hard for them to confirm the behavior without looking at the code. If existing behavior is not what they want, then it's hard to extend the datapipe due to lack of modularity. An alternate design proposed has been proposed internally & may be more appropriate.
Version
0.7
On which installation method(s) does this occur?
No response
Describe the issue
The climate data sources (
ERA5DaliExternalSource
andClimateDataSourceSpec
) both specify astride
parameter to read data at strides larger than what's stored in the dataset files.But the stride is only applied when reading multiple time steps (
num_steps
>1)The first index that is read remains independent of the stride. Example: https://github.com/NVIDIA/modulus/blob/main/modulus/datapipes/climate/era5_hdf5.py#L598
My understanding of the stride parameter is that it should allow reading subsamples of the data source even for the first index.
If that's not the expected usage, this bug could instead be viewed as a feature enhancement, for what a quick addition to allow coarser-granularity subsets of the same dataset on disk (e.g. read every 6 hours from a 1-hour dataset on disk)
Minimum reproducible example
No response
Relevant log output
No response
Environment details
No response
The text was updated successfully, but these errors were encountered: