You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time. To understand this problem, see test_serialized_data_processor_uses_original_methods() in test_serialization.py. If dill is switched to pickle, the test fails because the object loaded from the serialized representation uses the redefined methods.
The reason that we need serialization to capture all method implementations is because a user may redefine a class in a new version of a project, which will make it impossible to know which version of the data processor class was used to produce the versioned dataset. Saving the commit is not sufficient to know this because the changes could be uncommitted.
Consider the following scenario. The user defines a data processor subclass and produces a versioned dataset. Later, the user decides that the versioned dataset should use a different representation, and changes the data processor. If the original data processor is loaded and it doesn't also serialize its methods (recursive serialization), then it will use the redefined methods and the serialized data processor will not be able to transform new data to match the its versioned dataset representation. Unless the user knows the exact version of the data processor that corresponded to the versioned dataset--and this version is not necessarily tied to any commit--it is impossible to perform prediction on new data.
The text was updated successfully, but these errors were encountered:
We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time.
We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time. To understand this problem, see
test_serialized_data_processor_uses_original_methods()
intest_serialization.py
. If dill is switched to pickle, the test fails because the object loaded from the serialized representation uses the redefined methods.The reason that we need serialization to capture all method implementations is because a user may redefine a class in a new version of a project, which will make it impossible to know which version of the data processor class was used to produce the versioned dataset. Saving the commit is not sufficient to know this because the changes could be uncommitted.
Consider the following scenario. The user defines a data processor subclass and produces a versioned dataset. Later, the user decides that the versioned dataset should use a different representation, and changes the data processor. If the original data processor is loaded and it doesn't also serialize its methods (recursive serialization), then it will use the redefined methods and the serialized data processor will not be able to transform new data to match the its versioned dataset representation. Unless the user knows the exact version of the data processor that corresponded to the versioned dataset--and this version is not necessarily tied to any commit--it is impossible to perform prediction on new data.
The text was updated successfully, but these errors were encountered: