Anne van Kesteren

Undue base URL influence

The URL parser has many quirks due to its origins in a time where conformance test suites were atypical and implementation requirements were hidden in the examples section. Some consider these quirks deeply problematic, but personally I don’t really mind that one can write a hundred slashes after a scheme instead of two and get identical results. Sure, it would be better if that were not the case, but in the end it is something that is normalized away and therefore does not impact the fundamental aspects of the URL ecosystem.

I was reminded the other day that there is one quirk however that does yield rather undesirable results. In particular for certain (non-conforming) inputs, the result will not be failure, but the exact URL returned will depend on the presence and type of base URL. This might be best explained with examples:

InputBase URL (serialized)Output (serialized)
https:testhttps://test/
https:testhttp://example/https://test/
https:testhttps://example/https://example/test
hello:testhello:test
hello:testbye://example/hello:test
hello:testhello://example/hello:test

This quirk only impacts so-called special schemes, which include http and https. And only when they match between the input and base URL. As a user of URLs you could work around this quirk by first parsing without a base URL and only if that returns failure, parse a second time with a base URL. That does have the unfortunate side effect of being inconsistent with the web platform (for non-conforming input), but depending on your use case that might be okay.

I remember looking into whether this could be removed completely many years ago, but websites relied on it and end users trump theory.