[Discussion] Explicitly handle origin in further revisions? #1156

JayPanoz · 2018-10-20T18:46:44Z

Please allow me to open and use this issue as food for thought.

This is much much longer term but it seems to me that not explicitly handling the origin concept has actually been an underlying issue creating compat, interoperability and security issues for all parties involved in the EPUB ecosystem – so users, authors, distributors (???), and Reading Systems.

First things first, I’d vastly prefer not to be familiar with the origin concept and its security models, which can be painful to learn, but they do exist for a lot of reasons so it’s kinda worth putting ~~some~~ outstanding effort into understanding how it works.

That’s to say I know this is a complex issue, but it’s already dealt with ± implicitly in the spec, albeit partially i.e. remote resources, JS guidance, etc.

JS guidance is the most obvious case, as issue #873 might demonstrate. The thing is it’s been 2 years now, and the situation has not improved that much.

To be fair, kudos to the people who did put effort into that; to name a few: iBooks, to which I personally reported a security issue, Daniel Weck @ Readium, or Mantano in Bookari. I’m pretty sure I’m forgetting some people so please be assured this is much appreciated if you did too.

However, EPUB files still share the same origin in an awful lot of apps – i.e. something that is considered a security/authoring issue by other apps. It also has practical issues for authors because if say one file localStorage.clear() for instance, all EPUB files will lose the items they previously set (cf. education → quizzes, etc.). Web Storage also has quotas, so having a lot of books using it may create other issues for authors.

This isn’t the only example though. My goal isn’t to make a super extensive list, obviously, but to point out how handling origin could help quite significantly.

By extension, origin could be unique but not persistent i.e. the Reading App assigns a random port @ launch, in which case you’ll have Web Storage spread across 127.0.0.1:3000 and 127.0.0.1:3001 for instance.

Another issue is when the Reading App doesn’t use a (local) server behind the scenes so the file:// scheme is used, and it comes with severe restrictions in some underlying rendering engines as the whole origin is consequently opaque (e.g. no Web Storage, restricted access to images, audio, video, iframes, etc.).

Speaking of iframes, we have issue #1061, and there’s once again little interop in RSs there, as they decide how to manage external links (e.g. opening the platform’s default browser, emptying the iframe’s content, etc.).

And that can also apply to nodes of content the RS clones to achieve some UX – e.g. footnotes –, which are then injected into another document/a dialog element. Should something you clone from the EPUB file into a “native element” be considered opaque hence restricted to make it the app more secure or not? (spoiler: yes, but I can’t give more details yet).

Those are a few examples I am familiar with but I’m pretty sure there are others, and when the app is a cloud reader or is using a Webview, the origin concept + its policies apply anyway. So it seems to me the most reasonable option would be to build on top of it, and adjust the model to EPUB whenever needed.

cc @dauwhe as he mentioned that issue in his now-famous thread on twitter.

The text was updated successfully, but these errors were encountered:

dauwhe · 2019-04-10T18:03:43Z

Absolutely important. Labeling as deferred as we cannot tackle this for 3.2.

dauwhe · 2021-02-11T20:57:37Z

The following is non-normative:

Reading Systems need to behave as if a unique domain were allocated to each Content Document, as browser-based security relies heavily on document URLs and domains. Adopting this approach will isolate documents from each other and from other Internet domains, thereby limiting access to external URLs, cookies, DOM storage, etc.

This implies that each content document in an EPUB is in a different origin. Is this what we want?
Should this text say "origin" instead of "domain"?
One major complication is that EPUB content documents are not likely to be top-level browsing contexts--they're probably in iframes.

danielweck · 2021-02-15T10:32:42Z

I think web origin is the correct, well-defined term.
Regarding iframes / sandboxing, I wrote a short piece about Thorium here:
edrlab/thorium-reader#1375
(a few years ago I commented on this "security" topic from the perspective of an older Readium implementation)

iherman · 2021-02-15T12:42:15Z

The following is non-normative:

Reading Systems need to behave as if a unique domain were allocated to each Content Document, as browser-based security relies heavily on document URLs and domains. Adopting this approach will isolate documents from each other and from other Internet domains, thereby limiting access to external URLs, cookies, DOM storage, etc.

This implies that each content document in an EPUB is in a different origin. Is this what we want?

Oops. That definitely sounds like a bug to me. I believe each EPUB instances should have a unique domain (with the caveat below): that is how we could really consider an EPUB as a "website in a box". N.B., if the model above was followed, then no relative URL from one content document to the other would work either!

Should this text say "origin" instead of "domain"?

First of all, if I look at the URL standard then the correct term may have to be a host and not a domain, but even that may not be accurate in practice. If this was followed, and if the RS relies on localhost and then opens up two distinct EPUB instances, then the two would share origins (both being localhost) and that is not necessarily what we want.

But, indeed, if we refer to "origin" as defined in the URL spec, this may be the right term to use.

One major complication is that EPUB content documents are not likely to be top-level browsing contexts--they're probably in iframes.

So? Why is that different from using an iframe in any other content? I guess I do not see the problem.

iherman · 2021-02-15T12:55:51Z

A pair of presentation given by @lrosenthol a few years ago is very much relevant for this discussion:

Both were given at a meeting for the (now defunct) Publishing Interest Group.

mattgarrish · 2021-02-15T20:31:28Z

I believe each EPUB instances should have a unique domain (with the caveat below): that is how we could really consider an EPUB as a "website in a box".

I think the statement above may be getting read out of context. It's guidance on how strongly to restrict untrusted scripts; it's not a model for serving documents. The note for that bullet says as much:

In practice, Reading Systems might share domains across documents, but they still need to maintain isolation between documents.

The reason for being so strict is that it would prevent a malicious third-party script from stealing an entire publication or raiding any information shared across documents.

This is why only scripting within an iframe is recommended to be supported in reflowable epubs, as it limits what a script can access. Spine-level scripting is only recommended for fixed layouts, and that's because container scripting isn't terribly realistic.

mattgarrish · 2021-02-18T12:19:22Z

To collect some thoughts on this issue:

Even if we say that reading systems should restrict access to the content of an epub publication, which I'm personally fine with, instead of the content document a script is served from, what does this change practically? We're only talking about informative guidance listing approaches to being secure. (I still don't see this section defining a reality for how content documents themselves are served, only how scripting is locked down.)
We don't require spine-level scripting, which is the first problem authors face. If the past is any indicator, we're not likely to find consensus on changing this.
We don't define a scripting profile that has to be supported when scripting is supported, which has been a long-running question/complaint. The lack of common support for local storage, fetch, etc. compounds the lack of scripting consistency.

If we don't solve the latter issues, I don't think it really matters much what we say in the security section. I expect reading systems that are restrictive of what scripts can do at the spine level will continue to be. I think it's also somewhat inevitable that even if we define a support profile we'll still have to accept that some reading systems will take a more restrictive approach.

iherman · 2021-02-18T14:04:21Z

To collect some thoughts on this issue:

Even if we say that reading systems should restrict access to the content of an epub publication, which I'm personally fine with, instead of the content document a script is served from, what does this change practically? We're only talking about informative guidance listing approaches to being secure. (I still don't see this section defining a reality for how content documents themselves are served, only how scripting is locked down.)

I do not know whether that is the way current RS-s work. But if each content document is its own origin (which seems to be what the current text says) then scripts running in content documents cannot share data among them using storage API-s. That is why I think the current text is wrong.

And.. today it might be an informative guidance: I am really considering whether that should not be a MUST for a RS. Ie, an EPUB document should behave in such a way that all resources have a single, unique origin.

We don't require spine-level scripting, which is the first problem authors face. If the past is any indicator, we're not likely to find consensus on changing this.

You mean we do not require the capability of spine-level scripting, right? You are probably right that this may not change (we can try...) but we still must specify exactly the origin as above. If a RS does not do scripting at all, it does not change anything for it, so there is no harm for them.

There are browsers that do not do scripting (and there are users who switch browser off entirely). The same way, there are RS-s that do not do scripting. Nevertheless, the Web Platform for browsers is defined with an eye on full-blown scripting; there is an analogy here.

We don't define a scripting profile that has to be supported when scripting is supported, which has been a long-running question/complaint. The lack of common support for local storage, fetch, etc. compounds the lack of scripting consistency.

If we don't solve the latter issues, I don't think it really matters much what we say in the security section.

I do not think I agree. Setting the right framework in terms of origin forms the basis of, maybe, getting to a scripting profile one day (if this is what the community wants later).

I expect reading systems that are restrictive of what scripts can do at the spine level will continue to be. I think it's also somewhat inevitable that even if we define a support profile we'll still have to accept that some reading systems will take a more restrictive approach.

I would certainly not define a scripting profile with the features it may have; I would prefer to have a set of restriction instead. The set of restriction is finite (and maybe small); listing the capabilities may be impossible to handle when new API-s come to the fore every day...

Actually... that there are RS-s that are more restrictive: let the market decide, eventually. I do not think the specification should include restrictions. Instead, the security and privacy sections should list potential security and privacy pitfalls that RS-s may want to be attentive about, and may restrict things (the slides of @lrosenthol lists some examples). Nothing normative there, just informative (as those sections usually are).

That there are RS-s that are more restrictive: let the market decide, eventually. I do not think the specification should include restrictions.

mattgarrish · 2021-02-18T15:57:37Z

But if each content document is its own origin (which seems to be what the current text says)

It doesn't say to assign a unique domain/origin to each document, though, or even that it's realistic that this can be done. It says for untrusted scripts, isolate them as if they have a unique domain/origin.

It's effectively saying to sandbox content documents from each other.

Is it overzealous about security? I'd say yes. Is it inherently wrong? Not if that's your approach to security.

I have no issue with changing "content documents" to "epub publications" in that paragraph, as I think the threat from untrusted scripts comes within an epub from third party content and authors should be advised to sandbox such content themselves. We shouldn't expect paranoid reading systems just because bad things can happen.

But origins and what scripting APIs are available are not intrinsically linked. Changing the informative guidance isn't going to produce support for the APIs that authors can't currently access. Even making it normative that epubs each have a unique origin won't require reading systems to enable such support.

So while it may clean up a bit of outdated advice, we still face bigger challenges to ever getting to consistent spine-level scripting support.

iherman · 2021-02-18T16:12:16Z

But if each content document is its own origin (which seems to be what the current text says)

It doesn't say to assign a unique domain/origin to each document, though, or even that it's realistic that this can be done. It says for untrusted scripts, isolate them as if they have a unique domain/origin.

It's effectively saying to sandbox content documents from each other.

Is it overzealous about security? I'd say yes. Is it inherently wrong? Not if that's your approach to security.

I have no issue with changing "content documents" to "epub publications" in that paragraph, as I think the threat from untrusted scripts comes within an epub from third party content and authors should be advised to sandbox such content themselves. We shouldn't expect paranoid reading systems just because bad things can happen.

O.k., we agree on this. But, also, I may want to say that this is a MUST, i.e., this is how RS MUST operate, ie, by creating, conceptually, a sandbox as you called it. Authors should be able to rely on this.

But origins and what scripting APIs are available are not intrinsically linked. Changing the informative guidance isn't going to produce support for the APIs that authors can't currently access. Even making it normative that epubs each have a unique origin won't require reading systems to enable such support.

So while it may clean up a bit of outdated advice, we still face bigger challenges to ever getting to consistent spine-level scripting support.

That is correct. But how is this different from the browser world? There is a load of APIs defined out there, and web site designers take the risk on whether a specific API is implemented by a specific browser. The same holds for RS-s. The APIs are defined by the Web Platform (unless we want to add our own APIs, but I do not think this is something we would do in 3.3).

We can draw attention on potentially dangerous setups or scripts from a security or privacy point of view (and we should probably do that) in an informative section, but that is as far as the specification should go imho.

mattgarrish · 2021-02-18T16:28:28Z

But how is this different from the browser world?

It's not. I'm just not sure what all we're trying to solve with this discussion. If it's only the origin question, then I think we're fine.

If we're also trying to solve the issues like being able to share cookies, local storage, etc. between content documents in an epub, I think that's a whole other can of worms and one that we've been trying to solve without a lot of luck for as long as 3.0. If not, then just ignore my ramblings... :)

dauwhe · 2021-02-18T22:36:11Z

Even if we say that reading systems should restrict access to the content of an epub publication, which I'm personally fine with, instead of the content document a script is served from, what does this change practically? We're only talking about informative guidance listing approaches to being secure. (I still don't see this section defining a reality for how content documents themselves are served, only how scripting is locked down.)

I'm hoping to build a foundation, perhaps for future work. To do that, I think we need to identify some principles, informed by the kinds of things people hope to do in EPUB. Leonard mentioned the idea of a security boundary. To me, the fundamental security boundary should be the EPUB itself--for example, individual content documents in the EPUB should have access to the same local storage. But different EPUBs should not share the same local storage.

I also believe we need to, as much as we can, describe what we expect using the language of the web security model. My straw-person proposal would be to say that each EPUB should act as if it has a single opaque origin. Would this allow the kinds of scripting people want to do, while limiting the damage a bad script can do? I don't know, but I think it's worth exploring.

I would also note that some of the risk here is not only from malicious scripting, but from poorly written scripts.

document.querySelector('html').innerHTML = '<p>Call me Ishmael.</p>';

iherman · 2021-02-19T06:14:50Z

The issue was discussed in a meeting on 2021-02-18

List of resolutions:

Resolution No. 4: Update the informative statement in the core specification about origin from "content document" to "EPUB", and "domain" to "origin"

View the transcript

3. Origin, cont'd

See github issue #873, #1156.

Wendy Reid: this is continuing from last week's meeting

Dave Cramer: i think most of the discussion is in issue 1153
… we've struggled with how to specify scripting in epub
… we've gotten lots of questions from outside the group about how our security model ties in with the security model of the rest of the web
… we have non-normative text in the spec
… leonard has mentioned the concept of security boundaries, with origin being main boundary in Web world
… my opinion is that the text we currently have is wrong
… boundary should be around the epub and not the content document
… e.g. where content documents within an epub want to share a resource
… also, origin is more the concept we're going for, not domain (which is what we currently reference)
… could we say that each epub should be an opaque/unique origin
… even if not particularly testable, at least it is a stake in the ground
… re. how we are trying to fit into the web security model

Leonard Rosenthol: the thing that is most problematic is the difference between actually doing this in a browser with a content hosted on a real domain vs doing this on a device (mobile, desktop, etc.)
… in the device scenario the RS can completely control the origin
… the RS sets up the origin
… so your statements about every epub being its own RS can be done on device, but you can't do that on the web
… so controlling scripts within the context of an origin makes sense in the device scenario, but not in the web scenario
… that's the main issue

Dave Cramer: i hear you
… in the RS i'm aware of that are web-based, there is a pretty big disconnect between what you see in the URL bar and what is actually happening inside the RS
… is it reasonable to ask the RS to follow a stricter set of rules than would be required by the generic web security model
… say Hachette decided to put all their books on a domain, I think its reasonable to say that if an RS were to do that they need to architect it so that all books aren't on the same origin as each other
… i see this as adding requirements to implementation if the implementation happens to be web-based

Leonard Rosenthol: the problem is that you can't do that
… we tried to do that with sub-origins, but that hasn't been touched since 2017
… never implemented seriously
… never made it through the webapp sec WG
… in your example, all your epubs are originated off the same thing, they would all share the same local storage etc.
… if those are all your books, that's fine, but once that content goes outside, there's no guarantee that books from different publishers won't be able to see each other

Dave Cramer: could you solve that problem with different subdomains for each title?

Leonard Rosenthol: yes, but only in a world where all the epubs come from the same publisher
… e.g. an epub from patreaon uploaded to dropbox or onedrive
… that book would have access to all of dropbox

Dave Cramer: you're kind of creating a non-conforming RS in this example

Leonard Rosenthol: that would make all web-based RS non-conforming

Wendy Reid: I think dropbox actually does have an ebook reader....

Leonard Rosenthol: they're probably taking advantage of no scripting then

Wendy Reid: i think the solution that most RS have come to is just to avoid scripting entirely
… easiest way out of the origin problem

Leonard Rosenthol: that doesn't solve other things, e.g. referencing
… trying to reference a font or other resource inside that domain as a relative link
… nothing prevents referencing outside the epub at that point (e.g. ../../)
… and assuming this is served via HTTPS, that gives it a lot more privileges than an non-secure URL

Brady Duga: this really seems like a scripting issue
… you have to make sure you don't access things you don't own
… but that isn't an origin issue, that a rights access issue
… the real problem is storing cookies, and then someone else's book accessing it

Dave Cramer: Jiminy has real world examples of this sort of stuff
… e.g. an epub in ibooks that goes and finds info about other books
… is there anything in the spec right now that says that's bad?

Brady Duga: maybe? It depends on the RS and the content
… e.g. RS for a school, where every student shares every book, that would be okay
… one book might want to check how far a student got in another book
… i.e. not a bad idea in ALL cases

Leonard Rosenthol: if, say, you're building your own software and documents, and you control the entire system there's no reason why you wouldn't want to do it that way

Dave Cramer: one thing to do is go back to our current language
… do we still want to say that every content document in the same epub should belong to a different domain?

Leonard Rosenthol: can probably change that so that each epub is its own origin, like you said earlier

Matt Garrish: the original wording came at a time when we were just starting to open epub to scripting
… we were designing it to be as restrictive as possible
… we've tried to dodge this in the past by limiting where scripting is allowed

Dave Cramer: to me i feels like a little bit of progress if we relax the current language to say "per epub" instead of "per content document"
… this leaves us vulnerable to intra-epub security issues
… but that really seems like more of an authoring problem than a problem with the spec

Brady Duga: right now the spec is more restrictive, but we're already finding examples IRL where RS are not honoring it
… from testing perspective, its not clear how this would be implemented

Matt Garrish: depends where we are going with this
… right now that section is only informative, so that's fine
… if we change the section to be normative, then yes, that might be an issue

Dave Cramer: given all that, should we take the baby step of updating the non-normative guidance that the boundary should be "per epub"?
… consensus on this?

Leonard Rosenthol: +1

Matt Garrish: +1

Brady Duga: does that include changing from "domain" to "origin"?

Dave Cramer: yes, i think so
… stuff about port randomization scares me a little bit as someone who wants to do something useful with scripting

Proposed resolution: Update the informative statement in the core specification about origin from "content document" to "EPUB", and "domain" to "origin" (Wendy Reid)

Matt Garrish: +1

Matthew Chan: +1

Leonard Rosenthol: +1

Wendy Reid: +1

Brady Duga: +1

Toshiaki Koike: +1

Ben Schroeter: +1

Resolution #4: Update the informative statement in the core specification about origin from "content document" to "EPUB", and "domain" to "origin"

Wendy Reid: that's everything that was on the agenda tonight

Dave Cramer: i think i do have an action item to talk to TAG about the general ideas around epub security

Wendy Reid: there is most likely going to be a special session at the business group next week about WCAG3
… silver is going to be presenting to business group about WCAG3
… extending the invitation here
… i will send out meeting details on the mailing list
… WCAG3 calls out epub as a standard several times
… probably worth providing our feedback
… meeting date is Tues 233d, noon Boston time
… AOB?
… no? Thank you everyone, and thank you leonardr!

dauwhe added the Status-Deferred The issue has been deferred to another revision label Apr 10, 2019

mattgarrish added Topic-ContentDocs The issue affects EPUB content documents Cat-Security Grouping label for all security related issues labels Aug 26, 2020

dauwhe removed the Status-Deferred The issue has been deferred to another revision label Feb 11, 2021

danielweck mentioned this issue Feb 15, 2021

Programmatic window.location redirection to external URLs, plus other security considerations edrlab/thorium-reader#1375

Closed

iherman mentioned this issue Feb 15, 2021

Draft privacy & security section #1511

Merged

iherman mentioned this issue Feb 19, 2021

What is origin in epub context? #873

Closed

mattgarrish mentioned this issue Feb 19, 2021

Change security best practice to unique origin per publication #1518

Merged

mattgarrish closed this as completed in #1518 Feb 26, 2021

mattgarrish added the EPUB33 Issues addressed in the EPUB 3.3 revision label Mar 9, 2021

rdeltour mentioned this issue Nov 4, 2021

What base URLs to use for URL parsing in EPUB? #1888

Closed

mattgarrish added the Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation label Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Explicitly handle origin in further revisions? #1156

[Discussion] Explicitly handle origin in further revisions? #1156

JayPanoz commented Oct 20, 2018

dauwhe commented Apr 10, 2019

dauwhe commented Feb 11, 2021

danielweck commented Feb 15, 2021

iherman commented Feb 15, 2021

iherman commented Feb 15, 2021

mattgarrish commented Feb 15, 2021

mattgarrish commented Feb 18, 2021

iherman commented Feb 18, 2021

mattgarrish commented Feb 18, 2021

iherman commented Feb 18, 2021

mattgarrish commented Feb 18, 2021

dauwhe commented Feb 18, 2021

iherman commented Feb 19, 2021

3. Origin, cont'd

[Discussion] Explicitly handle origin in further revisions? #1156

[Discussion] Explicitly handle origin in further revisions? #1156

Comments

JayPanoz commented Oct 20, 2018

dauwhe commented Apr 10, 2019

dauwhe commented Feb 11, 2021

danielweck commented Feb 15, 2021

iherman commented Feb 15, 2021

iherman commented Feb 15, 2021

mattgarrish commented Feb 15, 2021

mattgarrish commented Feb 18, 2021

iherman commented Feb 18, 2021

mattgarrish commented Feb 18, 2021

iherman commented Feb 18, 2021

mattgarrish commented Feb 18, 2021

dauwhe commented Feb 18, 2021

iherman commented Feb 19, 2021

3. Origin, cont'd