The stack is misaligned at this point for some reason, this is a hack
that makes the resulting object "correctly" aligned, thus avoiding a
KUBSAN error.
Mere mortals like myself cannot understand more than two lines of
assembly without a million comments explaining what's happening, so do
that and make sure no one has to go on a wild stack state chase when
hacking on these.
POSIX requires that sigaction() and friends set a _process-wide_ signal
handler, so move signal handlers and flags inside Process.
This also fixes a "pid/tid confusion" FIXME, as we can now send the
signal to the process and let that decide which thread should get the
signal (which is the thread with tid==pid, but that's now the Process's
problem).
Note that each thread still retains its signal mask, as that is local to
each thread.
These are not technically required, since the Thread constructor
already sets these, but they are set on i686, so let's try and keep
consistent behaviour between the different archs.
While investigating why gdb is failing when it calls `PT_CONTINUE`
against Serenity I noticed that the names of the programs in the
System Monitor didn't make sense. They were seemingly stale.
After inspecting the kernel code, it became apparent that the sequence
occurs as follows:
1. Debugger calls `fork()`
2. The forked child calls `PT_TRACE_ME`
3. The `PT_TRACE_ME` instructs the forked process to block in the
kernel waiting for a signal from the tracer on the next call
to `execve(..)`.
4. Debugger waits for forked child to spawn and stop, and then it
calls `PT_ATTACH` followed by `PT_CONTINUE` on the child.
5. Currently the `PT_CONTINUE` fails because of some other yet to
be found bug.
6. The process name is set immediately AFTER we are woken up by
the `PT_CONTINUE` which never happens in the case I'm debugging.
This chain of events leaves the process suspended, with the name of
the original (forked) process instead of the name we inherit from
the `execve(..)` call.
To avoid such confusion in the future, we set the new name before we
block waiting for the tracer.
Arguments larger than 32bit need to be passed as a pointer on a 32bit
architectures. sys$profiling_enable has u64 event_mask argument,
which means that it needs to be passed as an pointer. Previously upper
32bits were filled by garbage.
This matches the likes of the adopt_{own, ref}_if_nonnull family and
also frees up the name to allow us to eventually add OOM-fallible
versions of these functions.
Move the definitions for maximum argument and environment size to
Process.h from execve.cpp. This allows sysconf(_SC_ARG_MAX) to return
the actual argument maximum of 128 KiB to userspace.
Function-local `static constexpr` variables can be `constexpr`. This
can reduce memory consumption, binary size, and offer additional
compiler optimizations.
These changes result in a stripped x86_64 kernel binary size reduction
of 592 bytes.
Rename the bound socket accessor from socket() to bound_socket().
Also return RefPtr<LocalSocket> instead of a raw pointer, to make it
harder for callers to mess up.
Previously we would return a bytes written value of 0 if the writing end
of the socket was full. Now we either exit with EAGAIN if the socket
description is non-blocking, or block until the description can be
written to.
This is mostly a copy of the conditions in sys$write but with the "total
nwritten" parts removed as sys$sendmsg does not have that.
This commit removes the usage of HashMap in Mutex, thereby making Mutex
be allocation-free.
In order to achieve this several simplifications were made to Mutex,
removing unused code-paths and extra VERIFYs:
* We no longer support 'upgrading' a shared lock holder to an
exclusive holder when it is the only shared holder and it did not
unlock the lock before relocking it as exclusive. NOTE: Unlike the
rest of these changes, this scenario is not VERIFY-able in an
allocation-free way, as a result the new LOCK_SHARED_UPGRADE_DEBUG
debug flag was added, this flag lets Mutex allocate in order to
detect such cases when debugging a deadlock.
* We no longer support checking if a Mutex is locked by the current
thread when the Mutex was not locked exclusively, the shared version
of this check was not used anywhere.
* We no longer support force unlocking/relocking a Mutex if the Mutex
was not locked exclusively, the shared version of these functions
was not used anywhere.
We were marking the execing thread as Runnable near the end of
Process::do_exec().
This was necessary for exec in processes that had never been scheduled
yet, which is a specific edge case that only applies to the very first
userspace process (normally SystemServer). At this point, such threads
are in the Invalid state.
In the common case (normal userspace-initiated exec), making the current
thread Runnable meant that we switched away from its current state:
Running. As the thread is indeed running, that's a bogus change!
This created a short time window in which the thread state was bogus,
and any attempt to block the thread would panic the kernel (due to a
bogus thread state in Thread::block() leading to VERIFY_NOT_REACHED().)
Fix this by not touching the thread state in Process::do_exec()
and instead make the first userspace thread Runnable directly after
calling Process::exec() on it in try_create_userspace_process().
It's unfortunate that exec() can be called both on the current thread,
and on a new thread that has never been scheduled. It would be good to
not have the latter edge case, but fixing that will require larger
architectural changes outside the scope of this fix.
There are many assumptions in the stack that argc is not zero, and
argv[0] points to a valid string. The recent pwnkit exploit on Linux
was able to exploit this assumption in the `pkexec` utility
(a SUID-root binary) to escalate from any user to root.
By convention `execve(..)` should always be called with at least one
valid argument, so lets enforce that semantic to harden the system
against vulnerabilities like pwnkit.
Reference: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt