This wasn't particularly difficult, and there's not much use for the
nicer interface yet either. While unveil() is of limited use in js(1)
as it should be able to open arbitrary files, I feel like we should be
able to add a pledge() call.
As noted by ECMA-402, if a supported locale contains all of a language,
script, and region subtag, then the implementation must also support the
locale without the script subtag. The most complicated example of this
is the zh-TW locale.
The list of locales in the CLDR database does not include zh-TW or its
maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale.
However, zh-Hant-TW is listed in the default-content locale list in the
cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We
must then also support the zh-Hant-TW alias without the script subtag:
zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite
heavily tested by test262.
Previously, we were just copying the locale data into default-content
locales (for example, copying the "en" data into "en-US"). Instead, we
can just define the default-content locales as aliases to their main
locales.
This will be used for locale aliases as well. Also rename the "property"
field in this struct to "name", as it no longer is only used for
property aliases.
Also add slightly richer parse errors now that we can include a string
literal with returned errors.
This will allow us to use TRY() when working with JSON data.
This wasn't the case for compact patterns, but unit patterns can contain
multiple (up to 2, really) identifiers that must each be recognized by
LibJS.
Each generated NumberFormat object now stores an array of identifiers
parsed. The format pattern itself is encoded with the index into this
array for that identifier, e.g. the compact format string "0K" will
become "{number}{compactIdentifier:0}".
This field is currently used to store the StringView into the compact
name/symbol in the format string. Units will need to store a similar
field, so rename the field to be more generic, and extract the parser
for it.
The compact scale of each formatting rule was precomputed in commit:
be69eae651
Using the formula: compact scale = magnitude - pattern scale
This computation was off-by-one.
For example, consider the format key "10000-count-one", which maps to
"00 thousand" in en-US. What we are really after is the exponent that
best represents the string "thousand" for values greater than 10000
and less than 100000 (the next format key). We were previously doing:
log10(10000) - "00 thousand".count("0") = 2
Which clearly isn't what we want. Instead, if we do:
log10(10000) + 1 - "00 thousand".count("0") = 3
We get the correct exponent for each format key for each locale.
This commit also renames the generated variable from "compact_scale" to
"exponent" to match the terminology used in ECMA-402.
For example, in en-US, the decimal, long compact pattern for numbers
between 10,000 and 100,000 is "00 thousand". In that pattern, "thousand"
is the compact identifier, and the generated format pattern is now
"{number} {compactIdentifier}". This also generates that identifier as
its own field in the NumberFormat structure.
Most locales have a single grouping size (the number of integer digits
to be written before inserting a grouping separator). However some have
a primary and secondary size. We parse the primary size as the size used
for the least significant integer digits, and the secondary size for the
most significant.
In order to implement Intl.NumberFormat.prototype.formatToParts, do not
replace {currency} keys in the format pattern before ECMA-402 tells us
to. Otherwise, the array return by formatToParts will not contain the
expected currency key.
Early replacement was done to avoid resolving the currency display more
than once, as it involves a couple of round trips to search through
LibUnicode data. So this adds a non-standard method to NumberFormat to
do this resolution and cache the result.
Another side effect of this change is that LibUnicode must replace unit
format patterns of the form "{0} {1}" during code generation. These were
previously skipped during code generation because LibJS would just
replace the keys with the currency display at runtime. But now that the
currency display injection is delayed, any {0} or {1} keys in the format
pattern will cause PartitionNumberPattern to abort.
Currencies are a bit strange; the layout of currency data in the CLDR is
not particularly compatible with what ECMA-402 expects. For example, the
currency format in the "en" and "ar" locales for the Latin script are:
en: "¤#,##0.00"
ar: "¤\u00A0#,##0.00"
Note how the "ar" locale has a non-breaking space after the currency
symbol (¤), but "en" does not. This does not mean that this space will
appear in the "ar"-formatted string, nor does it mean that a space won't
appear in the "en"-formatted string. This is a runtime decision based on
the currency display chosen by the user ("$" vs. "USD" vs. "US dollar")
and other rules in the Unicode TR-35 spec.
ECMA-402 shies away from the nuances here with "implementation-defined"
steps. LibUnicode will store the data parsed from the CLDR however it is
presented; making decisions about spacing, etc. will occur at runtime
based on user input.
For example, there isn't a unique set of data for the en-US locale;
rather, it defaults to the data for the en locale. See this commit for
much more detail: 357c97dfa8
These are used when formatting a number as currency with a display
option of "name" (e.g. for USD, the name is "US Dollars" in en-US).
These patterns appear in the CLDR in a different manner than other
number formats that are pluralized. They are of the form "{0} {1}",
therefore do not undergo subpattern replacements.
Currently, LibUnicode is only parsing and generating the "long" style of
currency display names. However, the CLDR contains "short" and "narrow"
forms as well that need to be handled. Parse these, and update LibJS to
actually respect the "style" option provided by the user for displaying
currencies with Intl.DisplayNames.
Note: There are some discrepencies between the engines on how style is
handled. In particular, running:
new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd')
Gives:
SpiderMoney: "USD"
V8: "US Dollar"
LibJS: "$"
And running:
new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd')
Gives:
SpiderMonkey: "$"
V8: "US Dollar"
LibJS: "$"
My best guess is V8 isn't handling style, and just returning the long
form (which is what LibJS did before this commit). And SpiderMoney can
handle some styles, but if they don't have a value for the requested
style, they fall back to the canonicalized code passed into of().
libc++ uses a Pthread condition variable in one of its initialization
functions. This means that Pthread forwarding has to be set up in LibC
before libc++ can be initialized. Also, because LibPthread is written in
C++, (at least some) parts of the C++ standard library have to be linked
against it.
This is a circular dependency, which means that the order in which these
two libraries' initialization functions are called is undefined. In some
cases, libc++ will come first, which will then trigger an assert due to
the missing Pthread forwarding.
This issue isn't necessarily unique to LibPthread, as all libraries that
libc++ depends on exhibit the same circular dependency issue.
The reason why this issue didn't affect the GNU toolchain is that
libstdc++ is always linked statically. If we were to change that, I
believe that we would run into the same issue.
The data used for number formatting is going to grow quite a bit when
the cldr-units package is parsed. To prevent the generated UnicodeLocale
file from growing outrageously large, the number formatting data can go
into its own file. To prepare for this, move code that will be common
between the generators for UnicodeLocale and UnicodeNumberFormat to the
utility header.
This will be needed for the ComputeExponentForMagnitude AO for compact
formatting, namely step 5b:
Let exponent be an implementation- and locale-dependent (ILD) integer
by which to scale a number of the given magnitude in compact notation
for the current locale.
A number formatting pattern in the CLDR contains one or two entries,
delimited by a semi-colon. Previously, LibUnicode was just storing the
entire pattern as one string. This changes the generator to split the
pattern on that delimiter and generate the 3 unique patterns expected by
ECMA-402.
The rules for generating the 3 patterns are as follows:
* If the pattern contains 1 entry, it is the zero pattern. The positive
pattern is the zero pattern prepended with {plusSign}. The negative
pattern is the zero pattern prepended with {minusSign}.
* If the pattern contains 2 entries, the first is the zero pattern, and
the second is the negative pattern. The positive pattern is the zero
pattern prepended with {plusSign}.
The number system data in the CLDR contains information on how to format
numbers in a locale-dependent manner. Start parsing this data, beginning
with numeric symbol strings. For example the symbol NaN maps to "NaN" in
the en-US locale, and "非數值" in the zh-Hant locale.
Some locales in the CLDR have alternate default numbering systems listed
under "defaultNumberingSystem-alt-*", e.g.:
"defaultNumberingSystem": "arab",
"defaultNumberingSystem-alt-latn": "latn",
"otherNumberingSystems": {
"native": "arab"
},
We were previously only parsing "defaultNumberingSystem" and
"otherNumberingSystems". This odd format appears to be an artifact of
converting from XML.
This isn't particularly important because this generates code that is
quite hidden from outside callers. But when viewing the generated code,
it's a bit nicer to read e.g. enum identifiers such as "MinusSign"
rather than "Minussign".
First off, this verifies that an initial value is always provided in
Properties.json for each property.
Second, it verifies that parsing that initial value succeeds.
This means that a call to `property_initial_value()` will always return
a valid StyleValue. :^)
This allows libraries and binaries to explicitly link against
`<library>.so.serenity`, which avoids some confusion if there are other
libraries with the same name, such as OpenSSL's `libcrypto`.
This file contains the list of locales which default to their parent
locale's values. In the core CLDR dataset, these locales have their own
files, but they are empty (except for identity data). For example:
https://github.com/unicode-org/cldr/blob/main/common/main/en_US.xml
In the JSON export, these files are excluded, so we currently are not
recognizing these locales just by iterating the locale files.
This is a prerequisite for upgrading to CLDR version 40. One of these
default-content locales is the popular "en-US" locale, which defaults to
"en" values. We were previously inferring the existence of this locale
from the "en-US-POSIX" locale (many implementations, including ours,
strip variants such as POSIX). However, v40 removes the "en-US-POSIX"
locale entirely, meaning that without this change, we wouldn't know that
"en-US" exists (we would default to "en").
For more detail on this and other v40 changes, see:
https://cldr.unicode.org/index/downloads/cldr-40#h.nssoo2lq3cba
This changes Web::Bindings::throw_dom_exception_if_needed() to return a
JS::ThrowCompletionOr instead of an Optional. This allows callers to
wrap the invocation with a TRY() macro instead of making a follow-up
call to should_return_empty(). Further, this removes all invocations to
vm.exception() in the generated bindings.
This also required converting URLSearchParams::for_each and the callback
function it invokes to ThrowCompletionOr. With this, the ReturnType enum
used by WrapperGenerator is removed as all callers would be using
ReturnType::Completion.
On my machine, this script took about 3.4 seconds, and was responsible
for essentially all of the time taken by the precommit hook.
The script is a faithful 1:1 reimplementation, even the regexes are
identical. And yet, it takes about 0.02 seconds, making the pre-commit
hook lightning fast again. Apparently python is just faster in this
case.
Fun fact:
- Just reading all ~4000 files took bash about 1.2 seconds
- Checking the license took another 1.8 seconds in total
- Checking for math.h took another 0.4 seconds in total
- Checking for '#pragma once' took another 0.4 seconds in total
The timing is highly load-dependent, so they don't exactly add up to 3.4
seconds. However, it's good enough to determine that bash is no longer
fit for the purpose of this script.
'bootmode' now only controls which set of services are started by
SystemServer, so it is more appropriate to rename it to system_mode, and
no longer validate it in the Kernel.
Bootmode used to control framebuffers, panic behavior, and SystemServer.
This patch factors framebuffer control into a separate flag.
Note that the combination 'bootmode=self-test fbdev=on' leads to
unexpected behavior, which can only be fixed in a later commit.
This makes all pages look and feel the same, because they all use the
default CSS generated by pandoc. Also, it inserts the banner everywhere
at the top, not only into the top-level index.html.
Credit to @xSlendiX for suggesting that `-B` works here.
To ensure everything works as expected, a unit test was added with
multiple scenarios.
This binary has to have the SetUID flag, and we also bind-mount the
/usr/Tests directory to allow running of SetUID binaries.
Both at the same time because many of them call construct() in call()
and I'm not keen on adding a bunch of temporary plumbing to turn
exceptions into throw completions.
Also changes the return value of construct() to Object* instead of Value
as it always needs to return an object; allowing an arbitrary Value is a
massive foot gun.
The old versions were renamed to JS_DECLARE_OLD_NATIVE_FUNCTION and
JS_DEFINE_OLD_NATIVE_FUNCTION, and will be eventually removed once all
native functions were converted to the new format.
Adds support for methods whose last parameter is a variadic DOMString.
We constructor a Vector<String> of the remaining arguments to pass to
the C++ implementation.
As of the Clang 13 upgrade, we only need to build the toolchain once and
can use that toolchain for both x86_64 and i686. To do this, this breaks
the main Azure configuration into 3 "stages" (Lagom, Toolchain, and
Serenity), where the Serenity stage depends on the Toolchain stage.
This has the added benefit of uploading a new prebuilt toolchain cache
sooner than before, which should help alleviate pressure from PRs.
This commit updates the Clang toolchain's version to 13.0.0, which comes
with better C++20 support and improved handling of new features by
clang-format. Due to the newly enabled `-Bsymbolic-functions` flag, our
Clang binaries will only be 2-4% slower than if we dynamically linked
them, but we save hundreds of megabytes of disk space.
The `BuildClang.sh` script has been reworked to build the entire
toolchain in just three steps: one for the compiler, one for GNU
binutils, and one for the runtime libraries. This reduces the complexity
of the build script, and will allow us to modify the CI configuration to
only rebuild the libraries when our libc headers change.
Most of the compile flags have been moved out to a separate CMake cache
file, similarly to how the Android and Fuchsia toolchains are
implemented within the LLVM repo. This provides a nicer interface than
the heaps of command-line arguments.
We no longer build separate toolchains for each architecture, as the
same Clang binary can compile code for multiple targets.
The horrible mess that `SERENITY_CLANG_ARCH` was, has been removed in
this commit. Clang happily accepts an `i686-pc-serenity` target triple,
which matches what our GCC toolchain accepts.
Note our Attribute class is what the spec refers to as just "Attr". The
main differences between the existing implementation and the spec are
just that the spec defines more fields.
Attributes can contain namespace URIs and prefixes. However, note that
these are not parsed in HTML documents unless the document content-type
is XML. So for now, these are initialized to null. Web pages are able to
set the namespace via JavaScript (setAttributeNS), so these fields may
be filled in when the corresponding APIs are implemented.
The main change to be aware of is that an attribute is a node. This has
implications on how attributes are stored in the Element class. Nodes
are non-copyable and non-movable because these constructors are deleted
by the EventTarget base class. This means attributes cannot be stored in
a Vector or HashMap as these containers assume copyability / movability.
So for now, the Vector holding attributes is changed to hold RefPtrs to
attributes instead. This might change when attribute storage is
implemented according to the spec (by way of NamedNodeMap).
This adds the ParamatizedType, as `Vector<String>` doesn't encode the
full type information. It is a separate struct as you can't have
`Vector<Type>` inside of `Type`. This also makes Type RefCounted
because I had to make parse_type return a pointer to make dynamic
casting work correctly.
The reason I made it RefCounted instead of using a NonnullOwnPtr is
because it causes compiler errors that I don't want to figure out right
now.
Apparently it breaks the fuzzer build. There's probably a better fix
for this, but for now just unbreak the fuzzer build.
Keep this for non-fuzzer builds though since it's apparently a 17%
speedup for running test262 tests :^)
Lagom: Build with -fno-no-semantic-interposition
We build with this in non-lagom builds, and serenity's gcc even adds it
to its CC1_SPEC. Let's use it for lagom too.
Reduces the number of dynamic relocations in liblagom-js.so.0.0.0 (per
`objdump -R`) from 15133 to 14534, and increases its size back to 91M
(95156800 bytes), probably due to more inlining being possible.
This might help perf of lagom binaries.
We build with this in non-lagom builds, so there's no reason not
to use it in lagom builds as well.
Reduces the size of liblagom-js.so.0.0.0 from 94M to 90M
(from 98352784 to 93831056 bytes to be exact).
Typically size_t is used for indices, but we can take advantage of the
knowledge that there is approximately only 46K unique strings in the
generated UnicodeLocale.cpp file. Therefore, we can get away with using
u16 to store indices. There is a VERIFY that will fail if we ever exceed
the limits of u16.
On x86_64 builds, this reduces libunicode.so from 9.2 MiB to 7.3 MiB.
On i686 builds, this reduces libunicode.so from 3.9 MiB to 3.3 MiB.
These savings are entirely in the .rodata section of the shared library.
Note there are a couple of type differences between the spec and the IDL
file added in this commit. For example, we will need to support a type
of Variant to handle spec types such as "(double or sequence<double>)".
But for now, this allows web pages to construct an IntersectionObserver
with any valid type.
The *_from_string() and resolve_*_alias() generated methods are the last
remaining users of HashMap in the LibUnicode generated files (read: the
last methods not using compile-time structures). This converts these
methods to use an array containing pairs of hash values to the desired
lookup value.
Because this code generation is the same between GenerateUnicodeData.cpp
and GenerateUnicodeLocale.cpp, this adds a GeneratorUtil.h header to the
LibUnicode generators to contain the method that generates the methods.
This concept is not present in ECMAScript, and it bothers me every time
I see it.
It's only used by WrapperGenerator, and even there only relevant in two
places, so let's fully remove it from LibJS and use a simple ternary
expression instead:
cpp_name = js_name.is_null() && legacy_null_to_empty_string
? String::empty()
: js_name.to_string(global_object);
Previously this would generate the following code:
JS::Value foo_value;
if (!foo.is_undefined())
foo_value = foo;
Which is dangerous as we're passing an empty value around, which could
be exposed to user code again. This is fine with "= null", for which it
also generates:
else
foo_value = JS::js_null();
So, in summary: a value of type `any`, not `required`, with no default
value and no initializer from user code will now default to undefined
instead of an empty value.
Meta/Lagom/ReadMe.md never had any other name; not sure how that typo
happened.
The link to the non-existent directory is especially vexing because the
text goes on to explain that we don't want such a directory to exist.
Found by running markdown-checker, and 'wget'ing all external links.
The list-format strings used for Intl.ListFormat are small, but quite
heavily duplicated. For example, the string "{0}, {1}" appears 6,519
times. Generate unique strings for this data to avoid duplication.
In the generated UnicodeLocale.cpp file, there are 296,408 strings for
localizations of languages, territories, scripts, currencies & keywords.
Of these, only 43,848 (14.8%) are actually unique, so there are quite a
large number of duplicated strings.
This generates a single compile-time array to store these strings. The
arrays for the localizations now store an index into this single array
rather than duplicating any strings.
Some CLDR languages.json / territories.json files contain localizations
for some lanuages/territories that are otherwise not present in the CLDR
database. We already don't generate anything in UnicodeLocale.cpp for
these anomalies, but this will stop us from even storing that data in
the generator's memory.
This doesn't affect the output of the generator, but will have an effect
after an upcoming commit to unique-ify all of the strings in the CLDR.
There are only 112 code points with special casing rules, so this array
is quite small (compared to the size 34,626 UnicodeData hash map that is
also storing this data). Removing all casing rules from UnicodeData will
happen in a subsequent commit.
Currently, all casing information (simple and special) are stored in a
compile-time array of size 34,626, then statically copied to a hash map
at runtime. In an effort to reduce the resulting memory usage, store the
simple casing rules in standalone compile-time arrays. The uppercase map
is size 1,450 and the lowercase map is size 1,433. Any code point not in
a map will implicitly have an identity mapping.
It's "Clang" (capitalized). Silently building gcc if the toolchain
isn't know seems a bit unfriendly. So print a warning for this
(and for other unknown toolchains).
I used "git grep -FIn http://" to find all occurrences, and looked at
each one. If an occurrence was really just a link, and if a https
version exists, and if our Browser can access it at least as well as the
http version, then I changed the occurrence to https.
I'm happy to report that I didn't run into a single site where Browser
can't deal with the https version.
Having IDL constructors call FooWrapper::create(impl) directly was
creating a wrapper directly without telling the impl object about the
wrapper. This meant that we had wrapped C++ objects with a null
wrapper() pointer.
This introduces 3 classes: NodeList, StaticNodeList and LiveNodeList.
NodeList is the base of the static and live versions. Static is a
snapshot whereas live acts on the underlying data and thus inhibits
the same issues we have currently with HTMLCollection.
They were split into separate classes to not have them weirdly
mis-mashed together.
The create functions for static and live both return a NNRP to the base
class. This is to prevent having to do awkward casting at creation
and/or return, as the bindings expect to see the base NodeList only.