Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PopulationSim Performance Improvements #188

Open
5 tasks
dhensle opened this issue Oct 25, 2024 · 0 comments
Open
5 tasks

PopulationSim Performance Improvements #188

dhensle opened this issue Oct 25, 2024 · 0 comments

Comments

@dhensle
Copy link

dhensle commented Oct 25, 2024

While the need for multiprocessing for very largescale applications remains, there are some inefficiencies in the existing code that could be improved to reduce computation time and/or memory, thus reducing the need for multiprocessing in small and medium-sized applications. A more lightweight and simpler to use software lends itself to faster application work and broader usage.
The following updates should be performed:

  • Vectorizing for loops over arrays – Since Python is not a compiled language, looping can come with a substantial performance cost because each loop must be interpreted at runtime. In cases where a static array is being evaluated, significant performance improvements can be achieved by vectorizing loops into a function that is applied to an entire array rather than looping over individual values.
  • Memory reducing “On-the-fly” calculations – Conversely to the previous point, there are some loops that hold an entire expanded process in memory (e.g., household expansion) when only the relevant data is necessary in each step, substantially reducing memory requirements.
  • More rigorous type checking – There are instances where dynamic data types are not safely handled in the code. For example, generic integer, “int” in python will try to use the smallest unit possible to conserve memory. This can introduce errors if a larger integer is being assigned to a variable initialized at a smaller value (e.g., int64 to int32). This issue tends to occur when dealing with large unique IDs, such as Census Block Groups.
  • Hard constraints for data weighting – The calculated maximum expansion factor can exceed the value set in the settings.yaml because of how the upper bound weights are calculated. Adding a hard constraint is preferrable in some in weighting applications.
  • CVXPY Timeout bug – The CVXPY package provides the API interface to many of the key optimization tools used, particularly for integerization. The API has a time limit before the connection to the optimizer is closed, which can be problematic in certain multiprocessing cases because one session will reach the time limit while waiting for another session to complete. Those failed cases then fall back to a different method, yielding different results. Thus, depending on the order in which the zones are being processed and their failure pattern, results can vary both against single-thread and multiprocessing. Ideally, the time limit could be extended, or at least catch the specific error in an exception so that the user is aware that they may need to modify their approach.
@jpn-- jpn-- moved this to Punt in Phase 10A Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Punt
Development

No branches or pull requests

1 participant