Specify speculative HTML parsing (preload scanner) #5624

zcorpan · 2020-06-09T15:08:38Z

From #1349 (comment) , @chrishtr proposes spec edits to explicitly specify a preload scanner (as one part of a larger proposal on behavior for stylesheet loading). But the proposal doesn't say which exact approach to standardize. I believe this is implemented in slightly different ways:

Gecko: https://web.archive.org/web/20201021003137/https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading
Chromium: #5624 (comment)
WebKit: ?

Are there documents that explain how this is implemented in Chromium and WebKit?

Is there interest in aligning on the observable behavior here?

cc @whatwg/html-parser @othermaciej @hsivonen @richard-townsend-arm @lilles @mfreed7

annevk · 2020-06-09T17:37:57Z

This is also relevant for whatwg/fetch#590 and rel=preload. I'm definitely in favor of getting this a bit more detailed, especially with how the fetches end up being constructed and how that affects future fetches, etc.

mfreed7 · 2020-06-20T06:16:35Z

Here is a very high-level description of the Chromium preload scanner, from @nyaxt:

At the high-level, there are two classes involved: PreloadScanner and HTMLResourcePreloader:

PreloadScanner takes a decoded string stream from the HTMLDocumentParser, drives a HTMLTokenizer internally to generate the HTML token stream (tokenizer output) and emit PreloadRequests.
HTMLResourcePreloader consumes the PreloadRequests and emits actual preload requests to the core/loader stack.

PreloadScanner consumes the token stream without actually constructing the DOM tree. It mostly only cares about the start tag tokens (% inline stylesheets), and sees if there's an interesting URL to start preloading. Since "false-positive" preload requests (== the urls PreloadScanner noticed but turns out not actually used) are very costly, PreloadScanner logic got more and more complicated over time. For example, it simulates the media query match so that it won't request images of the wrong resolution.

PreloadScanner is invoked for:

{main resource bytes received, document.write inputs} while parser is paused waiting for blocking script to run
scanning possible subresources for NoStatePrefetch feature

+@richard-townsend-arm to comment on whether any of the above has changed with the new synchronous parser

zcorpan · 2020-06-20T07:43:11Z

Thanks! What is NoStatePrefetch?

othermaciej · 2020-06-20T09:08:34Z

I'm not sure it's a good idea to fully specify it. The preload scanner is a heuristic for faster page loading, and it needs to be possible to evolve it as implementations learn new things about performance, and possibly to evolve it in implementation-specific ways.

If observable behavior differences are causing real-world interop problems, then it may make sense to specify constraints on what can be done. Ideally, there would be no preloading that is mandatory, and as few cases as possible that are forbidden. It should be like the http cache, where there are constraints but also enough flexibility to do innovative things like racing disk and network.

That said, I'm not the top expert on this area of WebKit. Tagging @cdumez as a person who may know more and @hober to make sure we follow up internally with relevant folks who are not on GitHub.

mfreed7 · 2020-06-20T23:13:23Z

Thanks! What is NoStatePrefetch?

https://developers.google.com/web/updates/2018/07/nostate-prefetch

hsivonen · 2020-06-22T09:55:39Z

Is there interest in aligning on the observable behavior here?

My understanding is that Gecko's speculative parser is capable of looking further ahead than WebKit's that was inherited into Blink. I'm not interested in making Gecko's speculative parser less capable in order to match WebKit and Blink.

IIRC, the reported interop-sensitive problem has been rather niche and isolated: Trying to do server-side responsive images by setting resolution information in a cookie via JS and being unhappy that the browser already fetched the images whose URLs were in the HTML source. However, with a multiprocess architecture, it's increasingly problematic to give guarantees about when script-set cookies are reflected in the network requests that are made, so I think we shouldn't cater to tricks like this.

See whatwg/html#5624

zcorpan · 2020-07-08T23:05:35Z

@othermaciej thanks for your thoughts. I agree that a full specification might not be desirable. I think the spec can still acknowledge that this optimization exists and specify some rules around how it should behave. This could serve web platform predictability and interoperability with regard to which URLs are speculatively prefetched.

@hsivonen thanks. Not wanting to regress on speculative parser capability is understandable. On the script-set cookie case, that's indeed something that doesn't work well together with the preload scanner optimization in particular.

For observable behavior, apart from the case @hsivonen mentioned, I'm aware of a few other cases that have been a source of complaints from web developers:

unnecessary downloads from <script src> with unsupported type (gecko issue, chromium issue)
unnecessary downloads with <picture> (webkit issue, another webkit issue, chromium issue)
double downloads with <img> when the response is not a valid image (gecko issue)

There are likely more issues, this was from a quick search.

To facilitate the discussion, I've written a few tentative tests today and a test generation script (so it's easy to add more tests): web-platform-tests/wpt#24521

Test results: https://wpt.fyi/results/html/syntax/preload-scanner.tentative/page-load?sha=ed26ec1897&label=pr_head

othermaciej · 2020-07-09T11:27:45Z

Acknowledging that the optimization exists, and perhaps giving constraints that would resolve some specific differences in observable behavior, seems like probably a good idea.

See whatwg/html#5624

zcorpan · 2020-09-03T22:57:41Z

I've written some more tests. Here are some things I learned:

Safari 112 preview:
- img-src-referrerpolicy-no-referrer fails because it includes the Referer header in the speculative fetch.
- picture-source-no-img fails, it speculatively fetches even though there's no img element.
- script-src-unsupported-type fails, it speculatively fetches the script with type=text/plain.
Chrome 86 dev:
- picture-source-br-img fails; it doesn't speculatively fetch when there's a br tag before the img in picture.
Safari and Chrome:
- Don't bother at all with svg or math tree builder rules for foreign content in the speculative parser. They tokenize with knowledge of some tags like xmp being RAWTEXT, but don't differentiate between HTML script or SVG script
- ~~In document.write(): the meta-charset-script-src test fails; a meta charset declaration after a blocking script doesn't affect the encoding of the document.~~ (The test for this had a mistake.)
Firefox 82 nightly:
- SVG script with xlink:href is speculatively fetched, but SVG script with href is not, even though href is normally supported.
- img-srcset and picture-source-br-img are flaky in Firefox for some reason. Other tests are not flaky as far as I can tell.

I've made the pass conditions be assuming the correct behavior would be to fetch (for a selective set of resource kinds) what would be fetched had the real HTML parser continued without the current script doing any "destructive" document.write. This appears to be very close to what Firefox does.

Latest test results: https://wpt.fyi/results/html/syntax/speculative-parsing/generated?label=pr_head&max-count=1&pr=24521

Now, I realize that not all of the cases I've tested have a basis of complaints from web developers. I wanted to find and highlight fundamental differences and possible bugs. What to specify is open for discussion.

yoavweiss · 2020-09-09T08:41:04Z

I think it makes sense to specify the preloader, in order to make it easier to specify things that rely on it:

Link preload headers that also include a media attribute currently wait (In both Chromium and WebKit, IIRC) for the preloader to figure out the right viewport (from <meta viewport> tokens) before preloading the resource in question. Currently this is entirely unspecified.
As Simon saw, error handling can be strictly different between different implementations, which results in potentially different web observed fetches.
As @annevk pointed out, speculative preloads rely on the unspecified "preload cache" (or memory cache in WebKit, which is subtly different). It would be good to define what the behavior is when e.g. a non-cacheable script is speculatively fetched multiple times. IIRC, that would result in a different number of fetches between Chromium and Gecko.
The parser behavior in the face of blocking in-body styles, which started this whole discussion.
We want to be sure (and test) that all browsers don't preload the wrong things when Media Queries are involved (for responsive images, as well as with the media attribute on styles).

While the preload scanner may not be a major point of interop pain, but the differences in behavior are highly observable, through Resource Timing, Service Workers and the servers fielding the actual requests.
Different browsers preload different things potentially (e.g. import styles), in ways that may surprise developers. Defining a baseline may help to reduce this divergence in bahvior and reduce surprises.

I agree with @othermaciej and @hsivonen that we wouldn't want such a specification to limit future optimizations, but at this point, I feel this optimization is mature enough that we could specify its basics, and let user agents innovate on top of that.

zcorpan · 2020-09-09T09:29:14Z

Thanks @yoavweiss!

New info for me to consider and test:

Link preload headers waiting for a meta viewport, potentially found by speculative parser
meta CSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)

yoavweiss · 2020-09-09T09:34:08Z

meta CSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)

I believe that to simplify the implementation, it prevents any speculative fetches period (at least in Chromium). Might be good to see what other implementations are doing, and maybe do something smarter.

zcorpan · 2020-09-09T13:01:37Z

Here are two demos with meta CSP:

From what I can tell with the Network tab devtools, in Safari and Firefox, both cases speculatively fetch the image, ignoring the CSP meta directives. Chrome doesn't fetch the image for either case.

hsivonen · 2020-09-09T13:11:23Z

SVG script with xlink:href is speculatively fetched, but SVG script with href is not, even though href is normally supported.

Filed. Thanks! Also: The name of the image test suggests that it's testing src rather than href. The spec says href. Apparently the tests haven't landed yet, so I failed to locate the test source to check.

zcorpan · 2020-09-09T14:49:07Z

@hsivonen thank you!

The tests are here web-platform-tests/wpt#24521

There are multiple image tests, both svg-image-href and svg-image-src (the latter expects no speculative fetch to happen).

zcorpan · 2020-09-29T14:42:12Z

I looked into trying to specify the integration points of speculative parsing in the spec in #5959 , and I have some questions.

Specifically, I'm a bit confused about how to specify speculative parsing for the document.write string. From irc:

I don't follow how what the spec does for <script src=X></script> where X is: document.write("<script src=Y></script><img src=Z>");
first there's https://html.spec.whatwg.org/multipage/parsing.html#scriptEndTag
"Prepare" will mark the script as the pending parser-blocking script, for the first script
"script nesting level" is zero, so the Otherwise steps are run
so far so good. This is where I want to kick off the (first) speculative parser
but then the script is loaded and executed (step 10), and the document.write will invoke the tree builder reentrantly, reaching this part again but with "script nesting level" being 1
where the spec says "Set the parser pause flag to true, and abort the processing of any nested invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume when the caller returns to the "outer" tree construction stage.)"
and that's what I don't get. how is the Y script loaded and executed?
Where should I kick off a second speculative parser to find the img in the document.write string

chrishtr · 2020-09-29T18:18:01Z

Isn't it that once we're back in the "outer" tree construction stage:

Tokenization resumes, getting to the <script src=Y> block, which then blocks the parser on loading and and then executing Y.
In the meantime, the preload scanner either resumes if it hasn't yet gotten to the place affected by document.write, or starts again at the document-write-affected area.

zcorpan · 2020-09-30T07:48:56Z

I thought there was special handling for document.write with the speculative parser.

In particular, the document write steps insert input into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse input without special handling.

chrishtr · 2020-10-01T14:49:20Z

I thought there was special handling for document.write with the speculative parser.

Are you speaking about the Chromium implementation?

In particular, the document write steps insert input into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse input without special handling.

Agree that if the speculative parser has already gotten to the end of the input and then document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?

mfreed7 · 2020-10-02T05:57:29Z

Agree that if the speculative parser has already gotten to the end of the input and then document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?

I think it’s more that the speculative parser might have to “rewind” back to the insertion point, since it might be past it.

zcorpan · 2020-10-02T07:57:42Z

@mfreed7 right. It will be past it as soon as it has speculatively parsed a single character.

It's possible that Gecko has a different strategy with document.write compared to Chromium/WebKit. From https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading

nsHtml5Parser has one tokenizer/tree builder pair for parsing data from document.write(). It also has another lazily initialized tokenizer/tree builder for speculatively scanning the tail of document.write() data when the parser blocks without parsing the document.write() data to completion.

zcorpan · 2020-10-29T16:49:37Z

Draft spec for review

There is now a draft specification for review in #5959. (Preview of the generated spec)

At a high level, the optimization itself is optional, but if it's implemented, it needs to speculatively parse HTML as if it was following the normal HTML parser rules, and only fetch resources that would be fetched from normal processing (if the blocking script does nothing). Which kinds of resources to fetch is implementation-defined.

TPAC discussion

We discussed this topic a few days ago in the WHATWG TPAC breakout session. Minutes here: https://www.w3.org/2020/10/26-whatwg-minutes.html#preload

Commentary inline:

zcorpan: I've been working on this. Tests demonstrate interop issues.

zcorpan: document.write() external scripts cause performance problems. So the speculative parser tries to help. Gecko vs. Chromium/WebKit: Gecko more faithfully uses the real HTML parser to build a speculative tree. Chromium/WebKit only use the tokenizer, with some tree-building knowledge (e.g. know about script elements and style elements; don't know about SVGNS). This allows generating test cases to show interop differences. Don't have data on the web developer pain these interop causes, but it'd be useful to reach interop. Question is: how much detail do we want in the spec, and which behavior do we want to specify? Also we need to speculatively parse the document.write() string if you're writing an external script.

hsivonen: have you found any cases where "the right thing" differs from "what Firefox does"?

zcorpan: document.write("<meta charset>") plus a later document.write("<meta charset>") which invalidates that does not work.

This about meta charset was both minuted inaccurately and I made a mistake in my testing which I've now fixed. There are a couple of interesting bits about meta charset in Gecko's speculative parser, though:

it is ignored by the speculative parser, i.e. it does not change the encoding of the document or the URL character encoding of speculative fetches. (This only makes a difference when the page is doing something "wrong", either using a late encoding declaration, or document.write()ing an encoding declaration.)
in the "page-load" test for this (/html/syntax/speculative-parsing/generated/page-load/meta-charset-script-src.tentative.html), Firefox does not speculatively fetch the script src after the encoding declaration. I asked @hsivonen about this in IRC, but he didn't have an answer to why this happens. (The "document-write" test does speculatively fetch the script.)

smfr: reiterate othermaciej's comments in the issue that this is just an optimization done by UAs, and maybe it's not necessary to fully specify. If there are interop issues then maybe the spec should say enough to address those, but we're not convinced that it needs to be fully specified.

emilio: if we know of authors relying on particular bits, it'd be useful to document them; it's not clear whether it needs to be normative or formal.

zcorpan: so like a bullet point set of requirements or expectations

hsivonen: my main concern is not speccing things so that the less-accurate version is correct and Firefox's more-accurate version is incorrect. So either spec the correct thing or leave it hand-wavey enough that the correct version is allowed.

zcorpan: Another Firefox is not totally correct issue: it doesn't respect CSP. In Chromium the presence of CSP meta entirely disables the speculative parser. (WebKit aligns with Firefox and ignores CSP.)

emilio: that seems like a bug we could probably fix. Is there a bug on file?

Henri filed https://bugzilla.mozilla.org/show_bug.cgi?id=1673407

zcorpan: I haven't filed one. I do envision listing this as a requirement in the spec and filing a bug.

zcorpan: subtopic: the speculative cache. Ideally the spec would say something about that.

emilio: Gecko reuses the existing caches. Per-document stylesheet cache, image cache.

zcorpan: annevk brought this up in #5624.

(some discussion about stylesheet cache, subtleties, how it doesn't exist in the spec)

hsivonen: we have a hash table of URLs that have been speculatively fetched. So e.g. <img> and <script> pointing to the same URL won't speculatively fetch the same URL twice.

I haven't specified this hash table.

See whatwg/html#5624

Fixes #5624.

See whatwg/html#5624

Fixes #5624.

ke1vin4real · 2021-09-18T12:23:02Z

@mfreed7 Is there any article introduced PreloadScanner?

yoavweiss · 2021-09-18T12:42:14Z

I wrote this article many years ago, but I believe it's still mostly relevant

ke1vin4real · 2021-09-18T13:05:06Z

@yoavweiss Thanks

…e HTML parsing, a=testonly Automatic update from web-platform-tests HTML: Add tentative tests for speculative HTML parsing See whatwg/html#5624 -- wpt-commits: 9afacb73a04c1f33837eb0a7ffcd9ec16c9d477f wpt-pr: 24521

See whatwg/html#5624

Fixes whatwg#5624.

zcorpan added needs implementer interest Moving the issue forward requires implementers to express interest topic: parser labels Jun 9, 2020

zcorpan self-assigned this Jun 9, 2020

annevk added the topic: fetch label Jun 9, 2020

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020

HTML: Add tentative tests for preload scanner

1f6de08

See whatwg/html#5624

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020

HTML: Add tentative tests for preload scanner

5261717

See whatwg/html#5624

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020

HTML: Add tentative tests for preload scanner

dfa72e4

See whatwg/html#5624

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Jul 8, 2020

HTML: Add tentative tests for preload scanner

a69d3b1

See whatwg/html#5624

zcorpan mentioned this issue Jul 8, 2020

HTML: Add tentative tests for speculative HTML parsing web-platform-tests/wpt#24521

Merged

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Sep 2, 2020

HTML: Add tentative tests for preload scanner

d1685bd

See whatwg/html#5624

annevk mentioned this issue Sep 10, 2020

Speculative Request Control privacycg/proposals#19

Open

zcorpan removed the needs implementer interest Moving the issue forward requires implementers to express interest label Sep 14, 2020

zcorpan mentioned this issue Sep 29, 2020

Define speculative HTML parsing #5959

Merged

3 tasks

zcorpan mentioned this issue Sep 30, 2020

TPAC 2020: HTML Standard meeting interest #5856

Closed

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Oct 29, 2020

HTML: Add tentative tests for preload scanner

70d25e7

See whatwg/html#5624

zcorpan added a commit that referenced this issue May 20, 2021

Define speculative HTML parsing

d65ab07

Fixes #5624.

rik mentioned this issue Jul 12, 2021

Add render blocking post siakaramalegos/sia.codes-eleventy#12

Merged

zcorpan added a commit that referenced this issue Aug 11, 2021

Define speculative HTML parsing

0dce7bb

Fixes #5624.

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Aug 31, 2021

HTML: Add tentative tests for preload scanner

47fa708

See whatwg/html#5624

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Sep 8, 2021

HTML: Add tentative tests for preload scanner

8c6d3bc

See whatwg/html#5624

zcorpan added a commit to web-platform-tests/wpt that referenced this issue Sep 14, 2021

HTML: Add tentative tests for speculative HTML parsing

9afacb7

See whatwg/html#5624

zcorpan closed this as completed in #5959 Sep 14, 2021

zcorpan added a commit that referenced this issue Sep 14, 2021

Define speculative HTML parsing

92a152c

Fixes #5624.

Gabisampaio pushed a commit to Gabisampaio/wpt that referenced this issue Nov 18, 2021

HTML: Add tentative tests for speculative HTML parsing

66cc027

See whatwg/html#5624

mfreed7 pushed a commit to mfreed7/html that referenced this issue Jun 3, 2022

Define speculative HTML parsing

815e369

Fixes whatwg#5624.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify speculative HTML parsing (preload scanner) #5624

Specify speculative HTML parsing (preload scanner) #5624

zcorpan commented Jun 9, 2020 •

edited

Loading

annevk commented Jun 9, 2020

mfreed7 commented Jun 20, 2020

zcorpan commented Jun 20, 2020

othermaciej commented Jun 20, 2020

mfreed7 commented Jun 20, 2020

hsivonen commented Jun 22, 2020

zcorpan commented Jul 8, 2020

othermaciej commented Jul 9, 2020

zcorpan commented Sep 3, 2020 •

edited

Loading

yoavweiss commented Sep 9, 2020

zcorpan commented Sep 9, 2020

yoavweiss commented Sep 9, 2020

zcorpan commented Sep 9, 2020

hsivonen commented Sep 9, 2020 •

edited

Loading

zcorpan commented Sep 9, 2020

zcorpan commented Sep 29, 2020

chrishtr commented Sep 29, 2020 •

edited

Loading

zcorpan commented Sep 30, 2020

chrishtr commented Oct 1, 2020

mfreed7 commented Oct 2, 2020

zcorpan commented Oct 2, 2020

zcorpan commented Oct 29, 2020

ke1vin4real commented Sep 18, 2021

yoavweiss commented Sep 18, 2021

ke1vin4real commented Sep 18, 2021

Specify speculative HTML parsing (preload scanner) #5624

Specify speculative HTML parsing (preload scanner) #5624

Comments

zcorpan commented Jun 9, 2020 • edited Loading

annevk commented Jun 9, 2020

mfreed7 commented Jun 20, 2020

zcorpan commented Jun 20, 2020

othermaciej commented Jun 20, 2020

mfreed7 commented Jun 20, 2020

hsivonen commented Jun 22, 2020

zcorpan commented Jul 8, 2020

othermaciej commented Jul 9, 2020

zcorpan commented Sep 3, 2020 • edited Loading

yoavweiss commented Sep 9, 2020

zcorpan commented Sep 9, 2020

yoavweiss commented Sep 9, 2020

zcorpan commented Sep 9, 2020

hsivonen commented Sep 9, 2020 • edited Loading

zcorpan commented Sep 9, 2020

zcorpan commented Sep 29, 2020

chrishtr commented Sep 29, 2020 • edited Loading

zcorpan commented Sep 30, 2020

chrishtr commented Oct 1, 2020

mfreed7 commented Oct 2, 2020

zcorpan commented Oct 2, 2020

zcorpan commented Oct 29, 2020

Draft spec for review

TPAC discussion

ke1vin4real commented Sep 18, 2021

yoavweiss commented Sep 18, 2021

ke1vin4real commented Sep 18, 2021

zcorpan commented Jun 9, 2020 •

edited

Loading

zcorpan commented Sep 3, 2020 •

edited

Loading

hsivonen commented Sep 9, 2020 •

edited

Loading

chrishtr commented Sep 29, 2020 •

edited

Loading