Add note on why we need to use dill instead of pickle #68

kostaleonard · 2022-09-26T13:54:57Z

We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time. To understand this problem, see test_serialized_data_processor_uses_original_methods() in test_serialization.py. If dill is switched to pickle, the test fails because the object loaded from the serialized representation uses the redefined methods.

The reason that we need serialization to capture all method implementations is because a user may redefine a class in a new version of a project, which will make it impossible to know which version of the data processor class was used to produce the versioned dataset. Saving the commit is not sufficient to know this because the changes could be uncommitted.

Consider the following scenario. The user defines a data processor subclass and produces a versioned dataset. Later, the user decides that the versioned dataset should use a different representation, and changes the data processor. If the original data processor is loaded and it doesn't also serialize its methods (recursive serialization), then it will use the redefined methods and the serialized data processor will not be able to transform new data to match the its versioned dataset representation. Unless the user knows the exact version of the data processor that corresponded to the versioned dataset--and this version is not necessarily tied to any commit--it is impossible to perform prediction on new data.

The text was updated successfully, but these errors were encountered:

kostaleonard · 2022-12-23T13:51:08Z

We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time.

See the pickle docs, SO, and dill for more information.

kostaleonard added the documentation Improvements or additions to documentation label Sep 26, 2022

kostaleonard added this to the Second release milestone Sep 26, 2022

kostaleonard self-assigned this Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add note on why we need to use dill instead of pickle #68

Add note on why we need to use dill instead of pickle #68

kostaleonard commented Sep 26, 2022

kostaleonard commented Dec 23, 2022

Add note on why we need to use dill instead of pickle #68

Add note on why we need to use dill instead of pickle #68

Comments

kostaleonard commented Sep 26, 2022

kostaleonard commented Dec 23, 2022