For now, this uses UTF-16BE and UTF-8 marked strings in page body text.
These markings should be ignored in body text.
Hand-written, with `set fenc=latin1` and `set binary` in vim, and
xref etc fixed up by running
mutool clean Tests/LibPDF/encoding.pdf Tests/LibPDF/encoding.pdf
as usual.
There were two problems:
1. parse_compressed_object_with_index() parses indirect objects
without going through Parser::parse_indirect_value(), so
push_reference() / pop_reference() weren't called.
Manually call them, both for the indirect object containing
the object stream and for the indirect object within the
object stream.
2. The indirect object within the object stream got decrypted
twice: Once when the object stream data itself got decrypted,
and then incorrectly a second time when the object data within
the stream was read. To fix, disable encryption while parsing
object stream data (since it's already decrypted).
The test is from http://opf-labs.org/format-corpus/pdfCabinetOfHorrors/
which according to readme.md at the same location is CC0.
I created this by typing "sup" into TextEdit.app on macOS 13.4,
hitting Cmd-P to bring up the print dialog, clicked the PDF button
at the bottom, changed Title and Author to "sup", clicked
"Security Options…", and checked "Require password to open document"
(with password "sup").
This file tests several things:
- It has a compressed stream as first object. This used to make the
linearization dict detection logic assert.
- It uses AES as encryption key using version 4 of the encryption
dict. This used to not be implemented.
Let's put test files with the tests themselves, instead of a random user
directory. (But still copy them so they appear in the user directory
for convenience.)
Add a unit test for each sample pdf file that currently exists in the
anon user's `~/Document/pdf` directory.
- linear.pdf
- non-linearized.pdf
- complex.pdf
Each test ensures that the pdf document is parsed and that the page
count is the expected one.