-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the peak memory usage when loading dataset. #514
base: main
Are you sure you want to change the base?
Conversation
Thanks @Light-V. Did you observe a change in index building time/search performance with this change? I have made very bad experiences without this cast, but this was a long time ago. |
@maumueller Thank you for your suggestion. I will give it a try and post the performance comparison later. |
Hi, @maumueller I have run ann-benchmark on qsg-ngt with these two different ways to load dataset. And here's the search result: diff result.1 result.2
|
I've made a small optimization in data loading process that reduces peak memory usage when handling large datasets.
Previously, we were using np.array() to convert large datasets from h5py objects into NumPy arrays. This operation, while straightforward, was causing a spike in memory usage due to the creation of a new array copy, which could lead to OOM errors when working with particularly large datasets.
To mitigate this issue, I have replaced np.array() with np.asarray() in the relevant sections of the code. The np.asarray() function, unlike np.array(), will attempt to pass through the input data without creating a new array copy if the input is already a NumPy array. This behavior helps to reduce unnecessary memory allocation and can be particularly effective in scenarios where memory is a constraint.