Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: Climate data sources do not set stride for first index #653

Open
sahnimanas opened this issue Aug 23, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@sahnimanas
Copy link
Contributor

Version

0.7

On which installation method(s) does this occur?

No response

Describe the issue

The climate data sources (ERA5DaliExternalSource and ClimateDataSourceSpec) both specify a stride parameter to read data at strides larger than what's stored in the dataset files.
But the stride is only applied when reading multiple time steps (num_steps>1)
The first index that is read remains independent of the stride. Example: https://github.com/NVIDIA/modulus/blob/main/modulus/datapipes/climate/era5_hdf5.py#L598

My understanding of the stride parameter is that it should allow reading subsamples of the data source even for the first index.
If that's not the expected usage, this bug could instead be viewed as a feature enhancement, for what a quick addition to allow coarser-granularity subsets of the same dataset on disk (e.g. read every 6 hours from a 1-hour dataset on disk)

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

@sahnimanas sahnimanas added ? - Needs Triage Need team to review and classify bug Something isn't working labels Aug 23, 2024
@mnabian
Copy link
Collaborator

mnabian commented Oct 17, 2024

@loliverhennigh could you please take a look at this issue?

@mnabian mnabian removed the ? - Needs Triage Need team to review and classify label Oct 17, 2024
@sahnimanas
Copy link
Contributor Author

I've looked at this a bit more and I think if we agree with the interpretation of stride here, then it should be sufficient to fix it by loading data[self.stride * in_idx] instead of data[in_idx]. But it may also be reasonable if one interprets the currrent definition to be consistent with the implementation.

Overall, such behavior may be left better as something for the user to extend on their own. Both the above implementations may be relevant to different users, and it's hard for them to confirm the behavior without looking at the code. If existing behavior is not what they want, then it's hard to extend the datapipe due to lack of modularity. An alternate design proposed has been proposed internally & may be more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants