The Encoding specification maps ISO-8859-1 to windows-1252 and expects
the windows-1252 translation table to be used, which differs from
ISO-8859-1 for 0x80-0x9F.
Other contexts expect to get the actual ISO-8859-1 encoding, with 1-to-1
mapping to U+0000-U+00FF, when requesting it.
`decoder_for_exact_name` is introduced, which skips the mapping from
aliases to the encoding name done by `get_standardized_encoding`.
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.
This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.
For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].
Required by Cloudflare's IUAM challenges.
There were two problems:
1. They didn't handle surrogates
2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4
Also add a basic test that catches both issues.
There's some code duplication with Utf16CodePointIterator::operator*(),
but let's get things working first.
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.
One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)