Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming parsing of fragments (profile) #42

Open
sandervd opened this issue Aug 24, 2023 · 2 comments
Open

Support streaming parsing of fragments (profile) #42

sandervd opened this issue Aug 24, 2023 · 2 comments

Comments

@sandervd
Copy link

While the most common read pattern for clients will be to read at the end of the log, from time to time new clients will show up that want to sync over the full history of the stream.
As I explained in #40, the issue in open fragments is the maximum size of a fragment has an exponential impact on the necessary bandwidth and processing in the client. This would argue for creating smaller fragments, however this also has downsides, as smaller fragments would mean more requests to the server.
This is why it would be ideal to rewrite historical fragments (say older than a day, immutable), into larger fragments.
Fetching bigger files (especially if the HTTP headers indicate relations, so this can add concurrency) is much more efficient than many smaller ones (connection setup, higher compression rate,...), but has a drawback in the current form: no tree:Node streaming parser exists, essentially requiring the entire graph (of one page) to be parsed in memory.
When compacting historical fragments into these larger graphs, this could be an issue.
This is why I would suggest a default way of structuring the data in a page, such that a stream aware parser can stream parse the document, and emit members as they are processed.
This would significantly reduce the memory requirements in the case of large fragments.

The layout of a page (say, using turtle serialization as it offers best compression) could look something like this:

  • First the stream membership statements, required to find the tree members
  • Then the members, one by one, ordered first by object id, then timestamp path
    ( this allows for member skipping if the client is interested in latest state only, reducing the number of upserts on the database the stream is projected in.)
  • Last the relation pointing to the next page

As all member triples are 'grouped', the parser can read one member at the time.

As the document would be a normal RDF file, and the only semantics added are there to support the streaming behavior, this should be completely backwards compatible for clients that don't support streaming tree parsing.
The capability could be indicated by a statement on the view.

@sandervd
Copy link
Author

Perhaps we could create a LDES/protobuf serialization?

@pietercolpaert
Copy link
Member

Valid point - the biggest problem is the member extraction algorithm at this moment that takes the full HTTP response as bounds of still potentially finding other quads. We’d need to extend existing serializations to indicate the bounds of a member in order to support streaming.

A protobuf LDES proto schema based on the SHACL would indeed be interesting. I’ll see whether we can find a master thesis on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants