Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add note on why we need to use dill instead of pickle #68

Open
kostaleonard opened this issue Sep 26, 2022 · 1 comment
Open

Add note on why we need to use dill instead of pickle #68

kostaleonard opened this issue Sep 26, 2022 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@kostaleonard
Copy link
Owner

We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time. To understand this problem, see test_serialized_data_processor_uses_original_methods() in test_serialization.py. If dill is switched to pickle, the test fails because the object loaded from the serialized representation uses the redefined methods.

The reason that we need serialization to capture all method implementations is because a user may redefine a class in a new version of a project, which will make it impossible to know which version of the data processor class was used to produce the versioned dataset. Saving the commit is not sufficient to know this because the changes could be uncommitted.

Consider the following scenario. The user defines a data processor subclass and produces a versioned dataset. Later, the user decides that the versioned dataset should use a different representation, and changes the data processor. If the original data processor is loaded and it doesn't also serialize its methods (recursive serialization), then it will use the redefined methods and the serialized data processor will not be able to transform new data to match the its versioned dataset representation. Unless the user knows the exact version of the data processor that corresponded to the versioned dataset--and this version is not necessarily tied to any commit--it is impossible to perform prediction on new data.

@kostaleonard kostaleonard added the documentation Improvements or additions to documentation label Sep 26, 2022
@kostaleonard kostaleonard added this to the Second release milestone Sep 26, 2022
@kostaleonard kostaleonard self-assigned this Sep 26, 2022
@kostaleonard
Copy link
Owner Author

We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time.

See the pickle docs, SO, and dill for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant