There are many assumptions in the stack that argc is not zero, and
argv[0] points to a valid string. The recent pwnkit exploit on Linux
was able to exploit this assumption in the `pkexec` utility
(a SUID-root binary) to escalate from any user to root.
By convention `execve(..)` should always be called with at least one
valid argument, so lets enforce that semantic to harden the system
against vulnerabilities like pwnkit.
Reference: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
These checks in `sys$execve` could trip up the system whenever you try
to execute an `.so` file. For example, double-clicking `libwasm.so` in
Terminal crashes the kernel.
This changes the program header alignment checks to reflect the same
checks in LibELF, and passes the requested alignment on to
`::try_allocate_range()`.
Since we don't return normally from this function, let's make it a
little extra difficult to accidentally leak something by leaving it on
the stack in this function.
This ensures that everything allocated on the stack in Process::exec()
gets cleaned up. We had a few leaks related to the parsing of shebang
(#!) executables that get fixed by this.
Since we don't return from sys$execve() when it's successful, we have to
take special care to tear down anything we've allocated.
Turns out we were not doing this for the full executable path itself.
Previously we would crash the process immediately when a promise
violation was found during a syscall. This is error prone, as we
don't unwind the stack. This means that in certain cases we can
leak resources, like an OwnPtr / RefPtr tracked on the stack. Or
even leak a lock acquired in a ScopeLockLocker.
To remedy this situation we move the promise violation handling to
the syscall handler, right before we return to user space. This
allows the code to follow the normal unwind path, and grantees
there is no longer any cleanup that needs to occur.
The Process::require_promise() and Process::require_no_promises()
functions were modified to return ErrorOr<void> so we enforce that
the errors are always propagated by the caller.
This change lays the foundation for making the require_promise return
an error hand handling the process abort outside of the syscall
implementations, to avoid cases where we would leak resources.
It also has the advantage that it makes removes a gs pointer read
to look up the current thread, then process for every syscall. We
can instead go through the Process this pointer in most cases.
Previously we were assigning to Process::m_space before actually
entering the new address space (assigning it to CR3.)
If a thread was preempted by the scheduler while destroying the old
address space, we'd then attempt to resume the thread with CR3 pointing
at a partially destroyed address space.
We could then crash immediately in write_cr3(), right after assigning
the new value to CR3. I am hopeful that this may have been the bug
haunting our CI for months. :^)
As required by posix. Also rename Thread::clear_signals to
Thread::reset_signals_for_exec since it doesn't actually clear any
pending signals, but rather does execve related signal book-keeping.
Instead of signalling allocation failure with a bool return value
(false), we now use ErrorOr<void> and return ENOMEM as appropriate.
This allows us to use TRY() and MUST() with Vector. :^)
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
Found due to smelly code in InodeFile::absolute_path.
In particular, this replaces the following misleading methods:
File::absolute_path
This method *never* returns an actual path, and if called on an
InodeFile (which is impossible), it would VERIFY_NOT_REACHED().
OpenFileDescription::try_serialize_absolute_path
OpenFileDescription::absolute_path
These methods do not guarantee to return an actual path (just like the
other method), and just like Custody::absolute_path they do not
guarantee accuracy. In particular, just renaming the method made a
TOCTOU bug obvious.
The new method signatures use KResultOr, just like
try_serialize_absolute_path() already did.
This patch converts all the usage of AK::String around sys$execve() to
using KString instead, allowing us to catch and propagate OOM errors.
It also required changing the kernel CommandLine helper class to return
a vector of KString for the userspace init program arguments.
Make use of the new FileDescription::try_serialize_absolute_path() to
avoid String in favor of KString throughout much of sys$execve() and
its helpers.
I had to look at this for a moment before I realized that sys$execve()
and the spawning of /bin/SystemServer at boot are taking two different
paths out of exec().
Add a comment to help the next person looking at it. :^)
Before this patch, we were leaking a ref on the open file description
used for the interpreter (the dynamic loader) in sys$execve().
This surfaced when adapting the syscall to use TRY(), since we were now
correctly transferring ownership of the interpreter to Process::exec()
and no longer holding on to a local copy of it (in `elf_result`).
Fixing the leak uncovered another problem. The interpreter description
would now get destroyed when returning from do_exec(), which led to a
kernel panic when attempting to acquire a mutex.
This happens because we're in a particularly delicate state when
returning from do_exec(). Everything is primed for the upcoming context
switch into the new executable image, and trying to block the thread
at this point will panic the kernel.
We fix this by destroying the interpreter description earlier in
do_exec(), at the point where we no longer need it.
When executing a dynamically linked program, we need to pass the main
program executable via a file descriptor to the dynamic loader.
Before this patch, we were allocating an FD for this purpose long after
it was safe to do anything fallible. If we were unable to allocate an
FD we would simply panic the kernel(!)
We now hoist the allocation so it can fail before we've committed to
a new executable.
Due to the use of ELF::Image::for_each_program_header(), we were
previously unable to use TRY() in the ELF loading code (since the return
statement inside TRY() would only return from the iteration callback.)
Once we commit to a new executable image in sys$execve(), we can no
longer return with an error to whoever called us from userspace.
We must make sure to surface any potential errors before that point.
This patch moves signal trampoline allocation before the commit.
A number of other things remain to be moved.
We previously allowed Thread to exist in a state where its m_name was
null, and had to work around that in various places.
This patch removes that possibility and forces those who would create a
thread (or change the name of one) to provide a NonnullOwnPtr<KString>
with the name.
This expands the reach of error propagation greatly throughout the
kernel. Sadly, it also exposes the fact that we're allocating (and
doing other fallible things) in constructors all over the place.
This patch doesn't attempt to address that of course. That's work for
our future selves.