Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Motivate streams not ordered by publish time #10

Open
tuukka opened this issue Mar 16, 2021 · 6 comments
Open

Motivate streams not ordered by publish time #10

tuukka opened this issue Mar 16, 2021 · 6 comments

Comments

@tuukka
Copy link

tuukka commented Mar 16, 2021

Intuitively, a "stream" refers to a collection of items ordered by the time they were published in the collection. Thus, the stream grows at its end. The specification seems to consider these an important but not the only relevant type of "streams". (There is a note saying "A 1-dimensional fragmentation based on creation time of the immutable objects is probably going to be the most interesting and highest priority fragmentation for an LDES" but then continuing "sometimes the back-end of an LDES server cannot guarantee that objects will be published chronologically".)

Should these two separate cases be motivated, perhaps already in the introduction?

@pietercolpaert
Copy link
Member

@ddvlanck can you answer this one?

@ddvlanck
Copy link
Collaborator

Hi @tuukka ,

As indicated in the specification, ordering the events by the time they are published in the collection, is indeed one of the most interesting fragmentations, because it allows us to describe more detailed relations to other pages so that query agents can easily decide whether or not it is useful to visit a page. However, it is possible that the backend system on which the LDES has been built does not receive the events at the time they occurred. For example, this is the case with the address and building registry in Flanders where it is possible to receive events today that already occurred in 2019, due to human errors (forgetting to indicate that a change was made), external systems, or just latency.

If we would apply a time-based fragmentation in this situation, we would end up with pages that constantly change and thus lose one of the main advantages of Linked Data Fragments: caching. Therefore, for the address and building registry, we choose to publish the events in the order that they are received by the backend system, which allows us to cache each page because that order is never going to change. However, in that situation, we lose the ability to describe detailed relations to other pages, because there is not really a pattern in the content of the pages (events from 2012 and 2019 can be in the same page). So we just provide the link to the next page (similar to hydra):

"@id" : "http://example.org?page=1",
"tree:relation": [
        {
            "@type": "tree:Relation",
            "tree:node": "http://example.org?page=2"
        }
    ]

I'll update the specification to make it more clear that it is possible to have a Linked Data Event Stream without a time-based fragmentation.

@tuukka
Copy link
Author

tuukka commented Mar 29, 2021

Thank you for the reply @ddvlanck! This may be a concern of terminology that I'm not familiar with. I would like to understand why you call a collection an event stream even if it does not grow at its end; or conversely, why you don't define the time-based fragmentation based on when the event arrived at the stream as opposed to how a source system dates it.

In your example case, could it make sense to talk of two orderings of the events: one is "logical" (when the event occurred legally?) and another is "physical" (when the event arrived at the stream)? If I understand correctly, it would be possible to expose both as distinct properties of the events and distinct fragmentations of the stream. What's more, every event stream would be able to and could be required to provide both of these (in simple cases, they would be identical). The "physical" dimension would be useful for caching and synchronising, and the "logical" dimension would be necessary to capture the real-world changes represented by the data.

@pietercolpaert
Copy link
Member

The core event stream fragmentation should be based on how you can make as many pages as possible cache immutable. All other fragmentations or orderings/paginations/indexing is optional

@tuukka
Copy link
Author

tuukka commented Mar 29, 2021

@pietercolpaert Right, so then you should make fragmentation by the "physical" time dimension mandatory in the spec? [Because it's what you need for 100% cache immutability, and it's always possible.]

@pietercolpaert
Copy link
Member

@tuukka Good point! I think it should be a recommendation! If there would be some exotic reason for which you wouldn’t be able to do it by physical time dimension I think the LDES client will still work, just not in the most optimal way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants