You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think we need to extend the PEG to specify the lower-layer protocols explicitly (I.E., chain multiple schemas together). Especially since HTTP can now also be via UDP and more and more stuff uses HTTP as a transport/tunneling protocol. The current parsing spec is just not flexible enough and thereby adds a bunch of redundancies and limitations that lead to non-compatible implementation differences in software already (implementation of UNIX domain sockets or the "http+socket://" notations). And these differences can cause security vulnerabilities when one software gets "glued to another" (different parsing between WAF, frontend, and backend servers).
Like so far, I think this one would encompass all of these (new) challenges and flexibility while still being backward compatible with the current one:
LowestLayer+HigherLayer+EvenHigherLayer://[[username]:[password]@EvenHigherLayerEndpointIdentifier]:[HigherLayerEndpointIdentifier]:[LowerLayerEndPointIdentifier]/resource
(with optional square brackets around each attribute and default values for the lower layers if not specified in the URL explicitly, as well as recommendation to offer a strict parsing mode for implementations that will not try to guess anything and only treat URLs with square brackets around every attribute and explicitly provided data (no implied application ports, no implied lower layer protocols, ...), mainly for security, futureproofing and reliability in usages by scripts and automation, as well as for debugability by experts and prosumers). And multiple (chained) endpoint identifiers only being allowed for the verbose version (to avoid parsing bugs and ambiguity), as well as requiring EndpointIdentifiers to match the number of specified lower layers 1:1 (but in reversed order).
(And the current username:password@ would explicitly become part of the part that specifies the HTTP endpoint, for example so that each layer can have its own independent login information or additional protocol-specific information, we'd just hand it off to the protocol the schema specified as an opaque blob)
Examples:
Stack
example URI
Comment
TCP => HTTP
tcp+http://[example.com]:[80]
HTTP via TCP, no probing and no switching to e.g. UDP
UDP => HTTP
udp+http://[example.com]:[80]
HTTP via UDP, no probing and no switching to e.g. TCP
TCP => TLS => HTTP
tcp+tls+http://[example.com]:[example.com]:[443]
So having two schemas specified for "TLS" wrapped version is now no longer necessary as a side effect, but kepting them for backward compatibility for already added/specified ones isn't an issue anyway
TCP => TLS => HTTP
tcp+https://[example.com]:[443]
Same, but with HTTPS instead of "tls+http"
UDP => HTTP => TCP
udp+http+tcp://[48569]:[example.com]:[80]
Specifies a raw TCP stream that is tunneled through HTTP which itself is served via a UDP connection
This form would mean that an IP connection to 2001:db8::1 is established that contains a TCP connection to port 443, which contains a TLS connection¹. And the HTTP being within it and the SNI header of example.com
authenticating against example2.com to use it as HTTP proxy to connect to example.com, also avoids current ambiguity of credentials being for the destination or the proxy
same, but using file as schema, for an even more generic approach, as technology also other things beside unix sockets that are accessed like files could be used like e.g., a (virtual) serial device
¹: with an explicitly specified hostname example2.com to use for certificate validation. Web browsers should throw a disableable (in the options, not the error message itself) error if this differs from the HTTP SNI, but that's application behavior (shouldn't be part of the PEG), as for CLI tools, debugging and developing or for web proxies like those universities use for off-campus online access to journals etc, it is very much desirable.
This extension (or, admittedly, propose for a new version of the PEG) is my preferred improvement, as it does not break the independence of the different protocols and allows extensibility, debugability, and clarity (no ambiguity and no security vulnerabilities by parsing "trickery"). But if breaking backward compatibility is not an issue (e.g. because we can detect the "parser spec version" easily, then I'd prefer this alternative:
Example: ip[2001:db8::1]+tcp[80]+http[example.com]://index.html (or maybe for backward compatibility via a new schema in the existing style like uri2://ip[2001:db8::1]+tcp[80]+http[example.com]/index.html)
Change the syntax completely to have the endpoint identifiers right after the schema part. Cleaner, simpler to implement a parser for, but a drastic and breaking change to the current one, currently not used (not even in a similar fashion) by any available implementation I'm aware of.
Limitations: Changes to the parsing spec will take a long time to get to clients as it requires every parser to update, but the status quo of "everyone doing their own thing" of working around these limitations is worse than having a cut and a v2, esp. because ways for backward compatibility can be implemented and therefore "switching the parser" should be manageable in almost all cases. Alternatively, the current syntax could be used as "a shorthand" or "auto-detect" mode of the new one, which mainly should be used in automation and prouser use cases anyway (so not visible to the "average end-user" that just wants to browse amazon or Facebook)...
I'm not sure what PEG is, but it's highly unlikely we'll change the URL parser, especially so drastically. The motivation is also somewhat hard to digest and not necessarily convincing, as browsers have adopted UDP for HTTP without a need for a change in scheme.
More things than web browsers exist and the URL schema is not just for browsers. Also just a few days ago someone on twitter asked how CORS is supposed to work with HTTP via Unix Socket, so the limitations are real.
And for web browsers/users the change wouldn't be noticeable, as the current syntax would remain as a shorthand notation, where as the newer more explicit one will allow all the usecases that currently aren't standardized and that everyone implements in hundred different ways...
I think we need to extend the PEG to specify the lower-layer protocols explicitly (I.E., chain multiple schemas together). Especially since HTTP can now also be via UDP and more and more stuff uses HTTP as a transport/tunneling protocol. The current parsing spec is just not flexible enough and thereby adds a bunch of redundancies and limitations that lead to non-compatible implementation differences in software already (implementation of UNIX domain sockets or the "http+socket://" notations). And these differences can cause security vulnerabilities when one software gets "glued to another" (different parsing between WAF, frontend, and backend servers).
Like so far, I think this one would encompass all of these (new) challenges and flexibility while still being backward compatible with the current one:
LowestLayer+HigherLayer+EvenHigherLayer://[[username]:[password]@EvenHigherLayerEndpointIdentifier]:[HigherLayerEndpointIdentifier]:[LowerLayerEndPointIdentifier]/resource
(with optional square brackets around each attribute and default values for the lower layers if not specified in the URL explicitly, as well as recommendation to offer a strict parsing mode for implementations that will not try to guess anything and only treat URLs with square brackets around every attribute and explicitly provided data (no implied application ports, no implied lower layer protocols, ...), mainly for security, futureproofing and reliability in usages by scripts and automation, as well as for debugability by experts and prosumers). And multiple (chained) endpoint identifiers only being allowed for the verbose version (to avoid parsing bugs and ambiguity), as well as requiring EndpointIdentifiers to match the number of specified lower layers 1:1 (but in reversed order).
(And the current
username:password@
would explicitly become part of the part that specifies the HTTP endpoint, for example so that each layer can have its own independent login information or additional protocol-specific information, we'd just hand it off to the protocol the schema specified as an opaque blob)Examples:
tcp+http://[example.com]:[80]
udp+http://[example.com]:[80]
tcp+tls+http://[example.com]:[example.com]:[443]
tcp+https://[example.com]:[443]
udp+http+tcp://[48569]:[example.com]:[80]
ip+tcp+tls+http://[example.com]:[example2.com]:[443]:[2001:db8::1]/foo
example.com
http+http://[example.com]:[username:password@example2.com]
socket+http://[example.com]:[/run/foo/bar.sock]/foobar
file+http://[example.com]:[/run/foo/bar.sock]/foobar
¹: with an explicitly specified hostname
example2.com
to use for certificate validation. Web browsers should throw a disableable (in the options, not the error message itself) error if this differs from the HTTP SNI, but that's application behavior (shouldn't be part of the PEG), as for CLI tools, debugging and developing or for web proxies like those universities use for off-campus online access to journals etc, it is very much desirable.This extension (or, admittedly, propose for a new version of the PEG) is my preferred improvement, as it does not break the independence of the different protocols and allows extensibility, debugability, and clarity (no ambiguity and no security vulnerabilities by parsing "trickery"). But if breaking backward compatibility is not an issue (e.g. because we can detect the "parser spec version" easily, then I'd prefer this alternative:
LowestLayer[[LowerLayerEndPointIdentifier]]+HigherLayer[[HigherLayerEndpointIdentifier]]+EvenHigherLayer[[username]:[password]@EndpointIdentifier]://resourcePath
Example:
ip[2001:db8::1]+tcp[80]+http[example.com]://index.html
(or maybe for backward compatibility via a new schema in the existing style likeuri2://ip[2001:db8::1]+tcp[80]+http[example.com]/index.html
)Change the syntax completely to have the endpoint identifiers right after the schema part. Cleaner, simpler to implement a parser for, but a drastic and breaking change to the current one, currently not used (not even in a similar fashion) by any available implementation I'm aware of.
Limitations: Changes to the parsing spec will take a long time to get to clients as it requires every parser to update, but the status quo of "everyone doing their own thing" of working around these limitations is worse than having a cut and a v2, esp. because ways for backward compatibility can be implemented and therefore "switching the parser" should be manageable in almost all cases. Alternatively, the current syntax could be used as "a shorthand" or "auto-detect" mode of the new one, which mainly should be used in automation and prouser use cases anyway (so not visible to the "average end-user" that just wants to browse amazon or Facebook)...
Address:
#749
#577
...
The text was updated successfully, but these errors were encountered: