You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An out-of-band (abbrev. OOB) change is an edit of a collaborative file not done through JupyterLab's UI. For example, suppose user A runs git reset --hard <other-branch> while another user B is editing a collaborative document within the same repository. A's git reset deletes all edits made by B and sets the contents of that document to some fixed value. Then the YStore table containing B's edits needs to be dropped, and this event must be somehow communicated to all Yjs clients connected to the document. Here, we say that an out-of-band change occurred on B's document.
Please see #240 for information on how OOB changes are detected & signaled. This issue focuses exclusively on how OOB changes are handled by the backend.
Description
The current implementation of how OOB changes can be improved to potentially reduce data loss and to greatly improve performance on long-running collaboration sessions.
Context: how OOB changes are handled currently
The handling of OOB changes is implemented in DocumentRoom, mostly in the _on_outofband_change() method. The relevant portions of its source are:
classDocumentRoom(YRoom):
"""A Y room for a possibly stored document (e.g. a notebook)."""def__init__(
self, ...
):
...
# Listen for document changesself._document.observe(self._on_document_change)
self._file.observe(self.room_id, self._on_outofband_change)
asyncdef_on_outofband_change(self) ->None:
""" Called when the file got out-of-band changes. """self.log.info("Out-of-band changes. Overwriting the content in room %s", self._room_id)
self._emit(LogLevel.INFO, "overwrite", "Out-of-band changes. Overwriting the room.")
try:
model=awaitself._file.load_content(self._file_format, self._file_type)
exceptExceptionase:
msg=f"Error loading content from file: {self._file.path}\n{e!r}"self.log.error(msg, exc_info=e)
self._emit(LogLevel.ERROR, None, msg)
returnasyncwithself._update_lock:
self._document.source=model["content"]
self._document.dirty=False
From the above source, we see that when an OOB change occurs, DocumentRoom will fetch the latest file contents. If that does not raise an exception, then the most important block is reached, which is highlighted below. When an OOB change occurs, this block sets the source of the YDoc to the latest file contents:
To understand this, we need to know what A) what self._document is, and B) what happens when self._document.source is set.
self._document is a Ydoc from jupyter_ydoc. jupyter_ydoc provides YDoc-like classes that are wrappers around pycrdt.Doc, each of which implement a unique interface for a specific file type.
jupyter_ydoc currently provides three types of Ydoc-like classes:
YUnicode/YFile: a YDoc that represents a UTF-8 plaintext file
YNotebook: a YDoc that represents a notebook file
YBlob: a YDoc that represents a blob file
We will refer to these classes provided by jupyter_ydoc as Jupyter Ydocs, for specificity.
We now know what self._document is. It is one of the 3 Jupyter Ydocs, all of which inherit from jupyter_ydoc.YBaseDoc. Now we need to know what happens when self._document.source is set.
Here are the relevant portions of YBaseDoc's source that determine the behavior of setting its source attribute:
classYBaseDoc(ABC):
""" Base YDoc class. This class, defines the minimum API that any document must provide to be able to get and set the content of the document as well as subscribe to changes in the document. """
...
@propertydefsource(self) ->Any:
""" Returns the content of the document. :return: The content of the document. :rtype: Any """returnself.get()
@source.setterdefsource(self, value: Any):
""" Sets the content of the document. :param value: The content of the document. :type value: Any """returnself.set(value)
@abstractmethoddefget(self) ->Any:
""" Returns the content of the document. :return: Document's content. :rtype: Any """@abstractmethoddefset(self, value: Any) ->None:
""" Sets the content of the document. :param value: The content of the document. :type value: Any """
As shown above, get() and set() are implemented in the YDoc class directly inheriting from YBaseDoc. This means that ultimately, the handling of an OOB change depends on the file type, and is determined by the definition of set() in the corresponding Jupyter Ydoc.
Issue part 1: Verify that OOB change handling doesn't lead to data loss
I have looked through the latest implementation of the set() method on YUnicode, YNotebook, YBlob, and could not find any immediate concerns. Each implementation flags all of the existing items for deletion, and then inserts the new content from the beginning. This does not delete the Ydoc and therefore should not result in any data loss. I've also tested the OOB change handling in the latest version of jupyter_collaboration with this one-liner shell script:
sleep 5; cat Untitled.ipynb | jq '.cells[0].source[0] = "i am an out-of-band edit!"'> temp.ipynb && mv temp.ipynb Untitled.ipynb
This script sleeps 5 seconds, then makes an OOB change that sets the content of the first cell to a fixed string. I ran this script and immediately began making edits to the first cell until the OOB change occurred. This did not lead to any data loss for me locally.
While I have not yet found any specific code paths that lead to data loss in the latest implementation, I think that it remains debatable whether the current OOB change handling leads to data loss in certain scenarios. This is supported by the fact that myself and others have noticed OOB changes being logged by the extension immediately prior to data loss incidents. See #219 as an example.
Issue part 2: Each OOB change grows the Ydoc by at least the new file size
The implementations of set(), while correct, have the unfortunate consequence of growing the Ydoc's size by a factor of the new file size. That is, an OOB change on a 0.5 MB notebook will grow the Ydoc by at least 0.5 MB.
Wait a few seconds, then record the file size of .jupyter_ystore.db again.
On my machine, the Ystore starts at 12K and grows to 20K after making 5 OOB changes. I expected a growth in size of at least 5K, and observed a growth of 8K. This test confirms that currently, each OOB change grows the Ydoc and Ystore by at least the new file size.
Larger Ydocs lead to poorer performance. The most significant consequence is that a larger Ydoc results in a slower connection for new users, as the entire Ydoc is streamed to new clients. I will elaborate on this in a future issue concerning the YStore specifically.
Proposed next steps
Increase confidence in the claim that OOB changes do not lead to data loss. To do so, we should:
Add test coverage of OOB change handling.
Simplify OOB change handling code.
Reduce the growth in YStore size on OOB changes
For example, for "small" changes, it is preferable to compute a diff and only apply that diff as a Yjs item. This would drastically reduce the growth in YStore size per OOB change. This may be implemented as a boolean configurable to avoid a breaking change, though I personally do not see this as a breaking change.
It is also worth considering a more general configurable that allows for other ways of handling OOB changes. For example, we may want DocumentRoom either to a) completely ignore OOB changes, or b) to delete the YDoc from memory, drop the YStore table, and begin anew.
Credit to @ellisonbg for some of these suggestions. Thank you!
The text was updated successfully, but these errors were encountered:
For example, for "small" changes, it is preferable to compute a diff and only apply that diff as a Yjs item. This would drastically reduce the growth in YStore size per OOB change. This may be implemented as a boolean configurable to avoid a breaking change, though I personally do not see this as a breaking change.
What you call "computing a diff" would require interpreting OOB changes and map them to e.g. a notebook schema, which is not an easy task. In this issue I proposed a bridge between different document structures, that is basically solving the same issue.
To give you an idea of the complexity, say an OOB inserts a character in a notebook cell. Now you need to infer the cell index where the change applies, the position of the added character in the cell source, and translate that into a Y operation (fake API here):
For more complex changes, it will even be ambiguous to infer changes. See for instance how git diff cannot sometimes infer that something has moved, and it will just mark the change as a deletion and an addition (and git works with lines of text, which is a very basic document structure).
Introduction
(this section is mostly copied from #240)
An out-of-band (abbrev. OOB) change is an edit of a collaborative file not done through JupyterLab's UI. For example, suppose user A runs
git reset --hard <other-branch>
while another user B is editing a collaborative document within the same repository. A'sgit reset
deletes all edits made by B and sets the contents of that document to some fixed value. Then the YStore table containing B's edits needs to be dropped, and this event must be somehow communicated to all Yjs clients connected to the document. Here, we say that an out-of-band change occurred on B's document.Please see #240 for information on how OOB changes are detected & signaled. This issue focuses exclusively on how OOB changes are handled by the backend.
Description
The current implementation of how OOB changes can be improved to potentially reduce data loss and to greatly improve performance on long-running collaboration sessions.
Context: how OOB changes are handled currently
The handling of OOB changes is implemented in
DocumentRoom
, mostly in the_on_outofband_change()
method. The relevant portions of its source are:From the above source, we see that when an OOB change occurs,
DocumentRoom
will fetch the latest file contents. If that does not raise an exception, then the most important block is reached, which is highlighted below. When an OOB change occurs, this block sets the source of the YDoc to the latest file contents:To understand this, we need to know what A) what
self._document
is, and B) what happens whenself._document.source
is set.self._document
is a Ydoc fromjupyter_ydoc
.jupyter_ydoc
provides YDoc-like classes that are wrappers aroundpycrdt.Doc
, each of which implement a unique interface for a specific file type.jupyter_ydoc
currently provides three types of Ydoc-like classes:YUnicode/YFile
: a YDoc that represents a UTF-8 plaintext fileYNotebook
: a YDoc that represents a notebook fileYBlob
: a YDoc that represents a blob fileWe will refer to these classes provided by
jupyter_ydoc
as Jupyter Ydocs, for specificity.We now know what
self._document
is. It is one of the 3 Jupyter Ydocs, all of which inherit fromjupyter_ydoc.YBaseDoc
. Now we need to know what happens whenself._document.source
is set.Here are the relevant portions of
YBaseDoc
's source that determine the behavior of setting itssource
attribute:As shown above,
get()
andset()
are implemented in the YDoc class directly inheriting fromYBaseDoc
. This means that ultimately, the handling of an OOB change depends on the file type, and is determined by the definition ofset()
in the corresponding Jupyter Ydoc.Issue part 1: Verify that OOB change handling doesn't lead to data loss
I have looked through the latest implementation of the
set()
method onYUnicode
,YNotebook
,YBlob
, and could not find any immediate concerns. Each implementation flags all of the existing items for deletion, and then inserts the new content from the beginning. This does not delete the Ydoc and therefore should not result in any data loss. I've also tested the OOB change handling in the latest version ofjupyter_collaboration
with this one-liner shell script:This script sleeps 5 seconds, then makes an OOB change that sets the content of the first cell to a fixed string. I ran this script and immediately began making edits to the first cell until the OOB change occurred. This did not lead to any data loss for me locally.
While I have not yet found any specific code paths that lead to data loss in the latest implementation, I think that it remains debatable whether the current OOB change handling leads to data loss in certain scenarios. This is supported by the fact that myself and others have noticed OOB changes being logged by the extension immediately prior to data loss incidents. See #219 as an example.
Issue part 2: Each OOB change grows the Ydoc by at least the new file size
The implementations of
set()
, while correct, have the unfortunate consequence of growing the Ydoc's size by a factor of the new file size. That is, an OOB change on a 0.5 MB notebook will grow the Ydoc by at least 0.5 MB.You can confirm this locally with these steps:
1kb.txt
locally.jupyter_collaboration
installed.1kb.txt
through JupyterLab..jupyter_ystore.db
.ls -lah . | grep 'ystore.db'
..jupyter_ystore.db
again.On my machine, the Ystore starts at 12K and grows to 20K after making 5 OOB changes. I expected a growth in size of at least 5K, and observed a growth of 8K. This test confirms that currently, each OOB change grows the Ydoc and Ystore by at least the new file size.
Larger Ydocs lead to poorer performance. The most significant consequence is that a larger Ydoc results in a slower connection for new users, as the entire Ydoc is streamed to new clients. I will elaborate on this in a future issue concerning the YStore specifically.
Proposed next steps
Increase confidence in the claim that OOB changes do not lead to data loss. To do so, we should:
Reduce the growth in YStore size on OOB changes
DocumentRoom
either to a) completely ignore OOB changes, or b) to delete the YDoc from memory, drop the YStore table, and begin anew.The text was updated successfully, but these errors were encountered: