Linux did the same thing 18 years ago and their reasons for the change
are similar to ours - https://github.com/torvalds/linux/commit/7d12e78
Most interrupt handlers (i.e. IRQ handlers) never used the register
state reference anywhere so there's simply no need of passing it around.
I didn't measure the performance boost but surely this change can't make
things worse anyway.
Instead, start by trying to read a buffer with size of Elf_Ehdr, and
check it for the shebang sign. If it's indeed an executable with shebang
then read again from the file, now with PAGE_SIZE size, which should
suffice for finding the interpreter path.
However, if the executable is an ELF, we quickly validate it and then
pass the preliminary buffer to the find_elf_interpreter_for_executable
method.
That method calculates the last byte offset which is needed to read all
of the program headers, so we don't just assume 4096 bytes is sufficient
anymore. The same pattern is applied when loading the interpreter ELF
main header and its program headers.
Resolve a regression caused by 01e1af732b.
This unbreaks coredump generation, because we need to use the VFS root
context of the crashed process and not of the FinalizerTask, as it will
hold an empty VFS root context that is assigned to kernel processes.
The LibELF validate_program_headers method tried to do too many things
at once, and as a result, we had an awkward return type from it.
To be able to simplify it, we no longer allow passing a StringBuilder*
but instead we require to pass an Optional<Elf_Phdr> by reference so
it could be filled with actual ELF program header that corresponds to
an INTERP header if such found.
As a result, we ensure that only certain implementations that actually
care about the ELF interpreter path will actually try to load it on
their own and if they fail, they can have better diagnostics for an
invalid INTERP header.
This change also fixes a bug that on which we failed to execute an ELF
program if the INTERP header is located outside the first 4KiB page of
the ELF file, as the kernel previously didn't have support for looking
beyond that for that header.
These 2 syscalls are responsible for unsharing resources in the system,
such as hostname, VFS root contexts and process lists.
Together with an appropriate userspace implementation, these syscalls
could be used for creating a sandbox environment (containers) for user
programs.
Similarly to VFSRootContext and ScopedProcessList, this class intends
to form resource isolation as well.
We add this class as an infrastructure preparation of hostname contexts
which should allow processes to obtain different hostnames on the same
machine.
There's no point in constructing an object just for the sake of keeping
a state that can be touched by anything in the kernel code.
Let's reduce everything to be in a C++ namespace called with the
previous name "VirtualFileSystem" and keep a smaller textual-footprint
struct called "VirtualFileSystemDetails".
This change also cleans up old "friend class" statements that were no
longer needed, and move methods from the VirtualFileSystem code to more
appropriate places as well.
Please note that the method of locking all filesystems during shutdown
is removed, as in that place there's no meaning to actually locking all
filesystems because of running in kernel mode entirely.
This new syscall will be used by the upcoming runc (run-container)
utility.
In addition to that, this syscall allows userspace to neatly copy RAMFS
instances to other places, which was not possible in the past.
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.
The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.
Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
simplify the idea of being jailed to a simple ProtectedValues boolean
flag. This means that we no longer check of matching jail pointers
anywhere in the Kernel code.
To set a process as jailed, a new prctl option was added to set a
Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
A process can either iterate on the global process list, or if it's
attached to a scoped process list, then only over that list.
This essentially replaces the need of checking the Jail pointer of a
process when iterating over process lists.
Expose some initial interfaces in the mount-related syscalls to select
the desired VFSRootContext, by specifying the VFSRootContext index
number.
For now there's still no way to create a different VFSRootContext, so
the only valid IDs are -1 (for currently attached VFSRootContext) or 1
for the first userspace VFSRootContext.
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.
The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.
To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.
We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
We don't really need it, and the entire functionality can be organically
intergrated into the VirtualConsole class, to switch between the Virtual
consoles, and manage initialization of all consoles in the global array.
To be able to do this, we add a new class called CustodyBase, which can
be resolved on-demand internally in the VirtualFileSystem resolving path
code.
When being resolved, CustodyBase will return a known custody if it was
constructed with such, if that's not the case it will provide the root
custody if the original path is absolute.
Lastly, if that's not the case as well, it will resolve the given dirfd
to provide a Custody object.
We add a prctl option which would be called once after the dynamic
loader has finished to do text relocations before calling the actual
program entry point.
This change makes it much more obvious when we are allowed to change
a region protection access from being writable to executable.
The dynamic loader should be able to do this, but after a certain point
it is obvious that such mechanism should be disabled.
We have many places in the kernel code that we have boolean flags that
are only set once, and never reset again but are checked multiple times
before and after the time they're being set, which matches the purpose
of the SetOnce class.
This removes the allocate_tls syscall and adds an archctl option to set
the fs_base for the current thread on x86-64, since you can't set that
register from userspace. enter_thread_context loads the fs_base for the
next thread on each context switch.
This also moves tpidr_el0 (the thread pointer register on AArch64) to
the register state, so it gets properly saved/restored on context
switches.
The userspace TLS allocation code is kept pretty similar to the original
kernel TLS code, aside from a couple of style changes.
We also have to add a new argument "tls_pointer" to
SC_create_thread_params, as we otherwise can't prevent race conditions
between setting the thread pointer register and signal handling code
that might be triggered before the thread pointer was set, which could
use TLS.
These changes are compatible with clang-format 16 and will be mandatory
when we eventually bump clang-format version. So, since there are no
real downsides, let's commit them now.
While this clutters Process.cpp a tiny bit, I feel that it's worth it:
- 2x speed on the kcov_loop benchmark. Likely more during fuzzing.
- Overall code complexity is going down with this change.
- By reducing the code reachable from __sanitizer_cov_trace_pc code,
we can now instrument more code.
Sticking this to the function source has multiple benefits:
- We instrument more code, by not excluding entire files.
- NO_SANITIZE_COVERAGE can be used in Header files.
- Keeping the info with the source code, means if a function or
file is moved around, the NO_SANITIZE_COVERAGE moves with it.
This easily led to kernel deadlocks if the stopped thread held an
important global mutex (like the disk cache lock) while blocking.
Resolve this by ensuring stopped threads have a chance to return to the
userland boundary before actually stopping.
Since the POSIX sigaltstack manpage suggests allocating the stack
region using malloc(), and many heap implementations (including ours)
store heap chunk metadata in memory just before the vended pointer,
we would end up zeroing the metadata, leading to various crashes.
We should consider whether the selected Thread is within the same jail
or not.
Therefore let's make it clear to callers with jail semantics if a called
method checks if the desired Thread object is within the same jail.
As for Thread::for_each_* methods, currently nothing in the kernel
codebase needs iteration with consideration for jails, so the old
Thread::for_each* were simply renamed to include "ignoring_jails" suffix
in their names.
These syscalls are not necessary on their own, and they give the false
impression that a caller could set or get the thread name of any process
in the system, which is not true.
Therefore, move the functionality of these syscalls to be options in the
prctl syscall, which makes it abundantly clear that these operations
could only occur from a running thread in a process that sees other
threads in that process only.
This patch ensures that the shutdown procedure can complete due to the
fact we don't kill kernel processes anymore, and only stop the scheduler
from running after the filesystems unmount procedure.
We also need kernel processes during the shutdown procedure, because we
rely on the WorkQueue threads to run WorkQueue items to complete async
IO requests initiated by filesystem sync & unmounting, etc.
This is also simplifying the code around the killing processes, because
we don't need to worry about edge cases such as the FinalizerTask
anymore.
Now that support for 32-bit x86 has been removed, we don't have to worry
about the top half of `off_t`/`u64` values being chopped off when we try
to pass them in registers. Therefore, we no longer need the workaround
of pointers to stack-allocated values to syscalls.
Note that this changes the system call ABI, so statically linked
programs will have to be re-linked.
Using the kernel stack is preferable, especially when the examined
strings should be limited to a reasonable length.
This is a small improvement, because if we don't actually move these
strings then we don't need to own heap allocations for them during the
syscall handler function scope.
In addition to that, some kernel strings are known to be limited, like
the hostname string, for these strings we also can use FixedStringBuffer
to store and copy to and from these buffers, without using any heap
allocations at all.
Instead, use the FixedCharBuffer class to ensure we always use a static
buffer storage for these names. This ensures that if a Process or a
Thread were created, there's a guarantee that setting a new name will
never fail, as only copying of strings should be done to that static
storage.
The limits which are set are 32 characters for processes' names and 64
characters for thread names - this is because threads' names could be
more verbose than processes' names.
For a long time, our shutdown procedure has basically been:
- Acquire big process lock.
- Switch framebuffer to Kernel debug console.
- Sync and lock all file systems so that disk caches are flushed and
files are in a good state.
- Use firmware and architecture-specific functionality to perform
hardware shutdown.
This naive and simple shutdown procedure has multiple issues:
- No processes are terminated properly, meaning they cannot perform more
complex cleanup work. If they were in the middle of I/O, for instance,
only the data that already reached the Kernel is written to disk, and
data corruption due to unfinished writes can therefore still occur.
- No file systems are unmounted, meaning that any important unmount work
will never happen. This is important for e.g. Ext2, which has
facilites for detecting improper unmounts (see superblock's s_state
variable) and therefore requires a proper unmount to be performed.
This was also the starting point for this PR, since I wanted to
introduce basic Ext2 file system checking and unmounting.
- No hardware is properly shut down beyond what the system firmware does
on its own.
- Shutdown is performed within the write() call that asked the Kernel to
change its power state. If the shutdown procedure takes longer (i.e.
when it's done properly), this blocks the process causing the shutdown
and prevents any potentially-useful interactions between Kernel and
userland during shutdown.
In essence, current shutdown is a glorified system crash with minimal
file system cleanliness guarantees.
Therefore, this commit is the first step in improving our shutdown
procedure. The new shutdown flow is now as follows:
- From the write() call to the power state SysFS node, a new task is
started, the Power State Switch Task. Its only purpose is to change
the operating system's power state. This task takes over shutdown and
reboot duties, although reboot is not modified in this commit.
- The Power State Switch Task assumes that userland has performed all
shutdown duties it can perform on its own. In particular, it assumes
that all kinds of clean process shutdown have been done, and remaining
processes can be hard-killed without consequence. This is an important
separation of concerns: While this commit does not modify userland, in
the future SystemServer will be responsible for performing proper
shutdown of user processes, including timeouts for stubborn processes
etc.
- As mentioned above, the task hard-kills remaining user processes.
- The task hard-kills all Kernel processes except itself and the
Finalizer Task. Since Kernel processes can delay their own shutdown
indefinitely if they want to, they have plenty opportunity to perform
proper shutdown if necessary. This may become a problem with
non-cooperative Kernel tasks, but as seen two commits earlier, for now
all tasks will cooperate within a few seconds.
- The task waits for the Finalizer Task to clean up all processes.
- The task hard-kills and finalizes the Finalizer Task itself, meaning
that it now is the only remaining process in the system.
- The task syncs and locks all file systems, and then unmounts them. Due
to an unknown refcount bug we currently cannot unmount the root file
system; therefore the task is able to abort the clean unmount if
necessary.
- The task performs platform-dependent hardware shutdown as before.
This commit has multiple remaining issues (or exposed existing ones)
which will need to be addressed in the future but are out of scope for
now:
- Unmounting the root filesystem is impossible due to remaining
references to the inodes /home and /home/anon. I investigated this
very heavily and could not find whoever is holding the last two
references.
- Userland cannot perform proper cleanup, since the Kernel's power state
variable is accessed directly by tools instead of a proper userland
shutdown procedure directed by SystemServer.
The recently introduced Firmware/PowerState procedures are removed
again, since all of the architecture-independent code can live in the
power state switch task. The architecture-specific code is kept,
however.