Commit graph

77 commits

Author SHA1 Message Date
Luke Wilde
c0a64f7317 LibWeb: Check for HTML integration points in the tree constructor
This particularly implements these two points:
- "If the adjusted current node is an HTML integration point and the
   token is a start tag"
- "If the adjusted current node is an HTML integration point and the
   token is a character token"

This also adds spec comments to the tree constructor.
2021-10-01 12:26:41 +02:00
Andreas Kling
831fdcaabc LibWeb: Add the PageTransitionEvent interface and fire "pageshow" events
We now fire "pageshow" events at the appropriate time during document
loading (done by the parser.)

Note that there are no corresponding "pagehide" events yet.
2021-09-26 12:47:51 +02:00
Andreas Kling
508edcd217 LibWeb: Add a "page showing" flag to documents
This will be used to determine whether "pageshow" and "pagehide" events
are appropriate. We won't actually make use of it until we implement
more of history traversal and document unloading.
2021-09-26 12:47:51 +02:00
Andreas Kling
a2f77a2e39 LibWeb: Implement "update the current document readiness" from spec
The only difference from what we were already doing is that setting the
same ready state twice no longer fires a "readystatechange" event.
I don't think that could happen in practice though.
2021-09-26 12:47:51 +02:00
Andreas Kling
8496024756 LibWeb: Store HTML document ready state as an enum 2021-09-26 12:47:51 +02:00
Andreas Kling
dbba0a520f LibWeb: Allow HTML parser to delay delivery of the document "load" event
We will now spin in "the end" until there are no more "things delaying
the load event". Of course, nothing actually uses this yet, and there
are a lot of things that need to.
2021-09-26 02:00:00 +02:00
Andreas Kling
e7af6af626 LibWeb: Implement more of HTMLParser::the_end() and bring closer to spec 2021-09-26 00:52:19 +02:00
Andreas Kling
e452550fda LibWeb: Split out "The end" from the HTML parsing spec to a function
Also add a spec link and some comments.
2021-09-26 00:04:33 +02:00
Andreas Kling
f67648f872 LibWeb: Rename HTMLDocumentParser => HTMLParser 2021-09-25 23:36:43 +02:00
Ben Wiederhake
32e98d0924 Libraries: Use AK::Variant default initialization where appropriate 2021-09-21 04:22:52 +04:30
Andreas Kling
c34da16089 LibWeb: Make <script src> loads partially async (by following the spec)
Instead of firing up a network request and synchronously blocking for it
to finish via a nested event loop, we now start an asynchronous request
when encountering <script src>.

Once the script load finishes (or fails), it gets executed at one of the
synchronization points in the HTML parser.

This solves some long-standing issues with random unexpected events
getting dispatched in the middle of parsing.
2021-09-20 17:22:25 +02:00
Andreas Kling
e11ae33c66 LibWeb: Pop entire stack of open elements at the end of parsing 2021-09-20 17:22:25 +02:00
Andreas Kling
cb895edad4 LibWeb: Move Attribute into the DOM namespace 2021-09-16 01:39:47 +02:00
Andreas Kling
70398645f3 LibWeb: Improvements to error handling in HTML foreign content parsing
Follow the spec more closely when encountering an invalid start or end
tag during foreign content parsing.
2021-09-14 23:49:45 +02:00
Luke Wilde
f62477c093 LibWeb: Implement HTML fragment serialisation and use it in innerHTML
The previous implementation was about a half implementation and was
tied to Element::innerHTML. This separates it and puts it into
HTMLDocumentParser, as this is in the parsing section of the spec.

This provides a near finished HTML fragment serialisation algorithm,
bar namespaces in attributes and the `is` value.
2021-09-14 02:09:18 +02:00
Idan Horowitz
4629f2e4ad LibWeb: Add the Web::URL namespace and move URLEncoder to it
This namespace will be used for all interfaces defined in the URL
specification, like URL and URLSearchParams.

This has the unfortunate side-effect of requiring us to use the fully
qualified AK::URL name whenever we want to refer to the AK class, so
this commit also fixes all such references.
2021-09-13 01:43:10 +02:00
Andreas Kling
882c7b1295 LibWeb: Spin the event loop in HTML parser until scripts can run
Call HTML::EventLoop::spin_until() from the HTML parser when deciding
whether we can run a script yet.

Note that spin_until() actually doesn't do any work yet.
2021-09-09 02:30:54 +02:00
TheFightingCatfish
08359ba578 LibWeb: Fix regression of "contenteditable" attribute 2021-07-31 17:39:28 +02:00
ovf
898b8ffcb6 LibWeb: Avoid assertion failure on parsing numeric character references 2021-07-28 18:32:22 +02:00
ovf
13c7d55320 LibWeb: Fix parsing of character references in attribute values 2021-07-27 00:03:43 +02:00
Max Wipfli
ccae0cae45 LibWeb: Rename HTMLToken::doctype_data() => ensure_doctype_data()
This renames the accessor to better reflect what it does, as this will
allocate a DoctypeData struct if there is none.
2021-07-17 16:24:57 +04:30
Max Wipfli
519a1cdc22 LibWeb: Change HTMLToken storage architecture
This completely changes how HTMLTokens store their data. Previously,
space was allocated for all token types separately. Now, the HTMLToken's
data is stored in just a String, two booleans and a Variant.

This change reduces sizeof(HTMLToken) from 68 to 32. Also, this reduces
raw tokenization time by around 20 to 50 percent, depending on the page.
Full document parsing time (with HTMLDocumentParser, on a local HTML
page without any dependency files) is reduced by between 4 and 20
percent, depending on the page.

Since tokenizing HTML pages can easily generated 50'000 tokens and more,
the storage has been designed in a way that avoids heap allocations
where possible, while trying to reduce the size of the tokens. The only
tokens which need to allocate on the heap are thus DOCTYPE tokens (max.
1 per document), and tag tokens (but only if they have attributes). This
way, only around 5 percent of all tokens generated need to allocate on
the heap (except for StringImpl allocations).
2021-07-17 16:24:57 +04:30
Max Wipfli
8a4c44db8c LibWeb: Make HTMLTokens non-copyable 2021-07-17 16:24:57 +04:30
Max Wipfli
7eb294df0d LibWeb: Move HTMLToken in HTMLDocumentParser
This replaces a copy construction of an HTMLToken with a move(). This
allows HTMLToken to be made non-copyable in a further commit.
2021-07-17 16:24:57 +04:30
Max Wipfli
2532bdfabf LibWeb: Remove friend class declarations from HTMLToken
Since all interaction with the HTMLToken class now happens over getters
and setters, there is no more need for HTMLTokenizer and
HTMLDocumentParser to have direct access to the members.
2021-07-17 16:24:57 +04:30
Max Wipfli
25cba4387b LibWeb: Add HTMLToken(Type) constructor and use it 2021-07-17 16:24:57 +04:30
Max Wipfli
f2e3c770f9 LibWeb: Use setter for HTMLToken::m_{start,end}_position 2021-07-17 16:24:57 +04:30
Max Wipfli
8b31e41692 LibWeb: Change HTMLToken::m_doctype into named DoctypeData struct
This is in preparation for an upcoming storage change of HTMLToken. In
contrast to the other token types, the accessor can hand out a mutable
reference to allow users to change parts of the DoctypeData easily.
2021-07-17 16:24:57 +04:30
Max Wipfli
918bde98b1 LibWeb: Hide implementation details of HTMLToken attribute list
Previously, HTMLToken would expose the Vector<Attribute> directly to
its users. In preparation for a future change, all users now use
implementation-agnostic APIs which do not expose the Vector directly.
2021-07-17 16:24:57 +04:30
Max Wipfli
15d8635afc LibWeb: User getter+setter for HTMLToken tag name and self-closing flag 2021-07-17 16:24:57 +04:30
Max Wipfli
1aeafcc58b LibWeb: Use getter and setter for Character type HTMLTokens
While storing the code point in a UTF-8 encoded String in horrendously
inefficient, this problem will be addressed at a later stage.
2021-07-17 16:24:57 +04:30
Max Wipfli
e8e9426b4f LibWeb: User getter and setter for Comment type HTMLTokens 2021-07-17 16:24:57 +04:30
Max Wipfli
f886aa15b8 LibWeb: Rename HTMLToken::AttributeBuilder struct to Attribute
This does not contain StringBuilders anymore, so it can do with a
simpler name: Attribute.
2021-07-17 16:24:57 +04:30
Max Wipfli
d82f3eb085 LibWeb: Make HTMLToken::{Position,AttributeBuilder} structs public
There was and is no reason for those to be private. Making them public
also allows us to explicitly specify the return type of some getters.
2021-07-17 16:24:57 +04:30
Max Wipfli
e22a34badb LibWeb: Fix assertion failures in HTMLTokenizer
The *TagName states are all very similar, so it seems to be correct to
apply the fix from #8761 to all of those states.

This fixes #8788.
2021-07-16 11:55:55 +02:00
Max Wipfli
2404ad6897 LibWeb: Fix assertion failure when tokenizing JS regex literals
This fixes parsing the following regular expression: /</g;

It also adds a simple script element to the HTMLTokenizer regression
test, which also contains that specific regex.
2021-07-15 01:47:22 +02:00
Max Wipfli
bb2aed7d76 LibWeb: Correct behavior of Comment* states in HTMLTokenizer
Previously, this would lead to assertion failures when parsing HTML
comments. This fixes #8757.
2021-07-15 00:48:45 +02:00
Max Wipfli
af0b483123 LibWeb: VERIFY an empty builder when emitting tokens in HTMLTokenizer 2021-07-15 00:48:45 +02:00
Max Wipfli
045a6a566b LibWeb: Remove unused HTMLTokenizer::m_input member variable 2021-07-14 23:03:36 +02:00
Max Wipfli
35f32ac170 LibWeb: Change HTMLToken.h to east const style 2021-07-14 23:03:36 +02:00
Max Wipfli
125982943a LibWeb: Change HTMLTokenizer.{cpp,h} to east const style 2021-07-14 23:03:36 +02:00
Gunnar Beutner
300823c314 LibWeb: Use move() when enqueuing tokens in HTMLTokenizer
We're not using the current token anymore once it's enqueued so let's
use move() when enqueuing the tokens.
2021-07-14 23:03:36 +02:00
Gunnar Beutner
c3ad8e9a52 LibWeb: Remove StringBuilder from HTMLToken::m_comment_or_character 2021-07-14 23:03:36 +02:00
Gunnar Beutner
3aa202c432 LibWeb: Remove StringBuilder from HTMLToken::m_tag 2021-07-14 23:03:36 +02:00
Gunnar Beutner
901d71148b LibWeb: Remove StringBuilders from HTMLToken::AttributeBuilder 2021-07-14 23:03:36 +02:00
Gunnar Beutner
992964aa7d LibWeb: Remove StringBuilders from HTMLToken::m_doctype 2021-07-14 23:03:36 +02:00
Gunnar Beutner
2150609590 LibWeb: Remove more unused StringBuilders in HTMLToken
These fields aren't read anywhere but I didn't feel like removing
them outright.
2021-07-14 23:03:36 +02:00
Gunnar Beutner
d9e52997e2 LibWeb: Use an Optional<String> to track the last HTML start tag
Using an HTMLToken object here is unnecessary because the only
attribute we're interested in is the tag_name.
2021-07-14 23:03:36 +02:00
Luke
e9eae9d880 LibWeb: Add extracting character encoding from a meta content attribute
Some Gmail emails contain this.
2021-07-13 20:23:44 +02:00
Andreas Kling
ee3a73ddbb AK: Rename downcast<T> => verify_cast<T>
This makes it much clearer what this cast actually does: it will
VERIFY that the thing we're casting is a T (using is<T>()).
2021-06-24 19:57:01 +02:00