Commit graph

1225 commits

Author SHA1 Message Date
logkos
3a610931f5 Kernel/Syscalls: Require inet promise for AF_INET6 domain 2024-10-05 12:52:10 -04:00
Liav A.
b93ca74d81 Kernel: Add a prctl option to enter jail mode until an execve syscall
In addition to the already existing option to enter jail mode (which is
set indefinitely), there should be a less restrictive option that should
allow exiting jail mode when doing the execve syscall.

This option will be useful for programs that need this kind of security
layer only in their runtime, but they're meant to actually initiate
another program in the end.
2024-10-03 12:39:45 +02:00
Liav A.
b90a36d2a9 Kernel+Userland: Rename jailed => jailed_until_exit
In all instances, it should be clear that the jailing of a process is
ending when the process exits.

This is a preparation before introducing another option to set a process
as jailed until it calls the execve syscall.
2024-10-03 12:39:45 +02:00
Liav A.
fdf3e0aca1 Kernel: Don't assume sizes of needed buffers early in the execve syscall
Instead, start by trying to read a buffer with size of Elf_Ehdr, and
check it for the shebang sign. If it's indeed an executable with shebang
then read again from the file, now with PAGE_SIZE size, which should
suffice for finding the interpreter path.

However, if the executable is an ELF, we quickly validate it and then
pass the preliminary buffer to the find_elf_interpreter_for_executable
method.

That method calculates the last byte offset which is needed to read all
of the program headers, so we don't just assume 4096 bytes is sufficient
anymore. The same pattern is applied when loading the interpreter ELF
main header and its program headers.
2024-09-01 20:52:55 +02:00
Liav A.
4aec3f4ef9 Kernel+Userland: Simplify loading of an ELF interpreter path
The LibELF validate_program_headers method tried to do too many things
at once, and as a result, we had an awkward return type from it.

To be able to simplify it, we no longer allow passing a StringBuilder*
but instead we require to pass an Optional<Elf_Phdr> by reference so
it could be filled with actual ELF program header that corresponds to
an INTERP header if such found.

As a result, we ensure that only certain implementations that actually
care about the ELF interpreter path will actually try to load it on
their own and if they fail, they can have better diagnostics for an
invalid INTERP header.

This change also fixes a bug that on which we failed to execute an ELF
program if the INTERP header is located outside the first 4KiB page of
the ELF file, as the kernel previously didn't have support for looking
beyond that for that header.
2024-07-21 15:38:52 +02:00
Liav A.
c0f55d4b11 Kernel: Add a check on ELF interpreter to verify we open a regular file
While extremely unlikely, it's possible to change the dynamic loader
to a non regular file, which will result in a kernel panic upon VERIFY
of the `interpreter_description->inode()` statement.
2024-07-21 15:38:52 +02:00
Liav A.
03ae9fdb0a Kernel: Check condition earlier for ELF file type
It makes no sense to do all of the loading work just to figure out that
the ELF file is an object file that is a result of compiling and not
an actual executable.

In addition to that, we should disallow running coredumps as well, so
the condition is changed now to only allow ET_DYN or ET_EXEC ELF files.
2024-07-21 15:38:52 +02:00
Liav A.
0e6624dc86 Kernel: Introduce the unshare syscall family
These 2 syscalls are responsible for unsharing resources in the system,
such as hostname, VFS root contexts and process lists.

Together with an appropriate userspace implementation, these syscalls
could be used for creating a sandbox environment (containers) for user
programs.
2024-07-21 11:44:23 +02:00
Liav A.
e52abd4c09 Kernel: Introduce the HostnameContext class
Similarly to VFSRootContext and ScopedProcessList, this class intends
to form resource isolation as well.
We add this class as an infrastructure preparation of hostname contexts
which should allow processes to obtain different hostnames on the same
machine.
2024-07-21 11:44:23 +02:00
Liav A.
3692af528e Kernel: Move most of VirtualFileSystem code to be in a namespace
There's no point in constructing an object just for the sake of keeping
a state that can be touched by anything in the kernel code.

Let's reduce everything to be in a C++ namespace called with the
previous name "VirtualFileSystem" and keep a smaller textual-footprint
struct called "VirtualFileSystemDetails".

This change also cleans up old "friend class" statements that were no
longer needed, and move methods from the VirtualFileSystem code to more
appropriate places as well.
Please note that the method of locking all filesystems during shutdown
is removed, as in that place there's no meaning to actually locking all
filesystems because of running in kernel mode entirely.
2024-07-21 11:44:23 +02:00
Liav A.
4370bbb3ad Kernel+Userland: Introduce the copy_mount syscall
This new syscall will be used by the upcoming runc (run-container)
utility.

In addition to that, this syscall allows userspace to neatly copy RAMFS
instances to other places, which was not possible in the past.
2024-07-21 11:44:23 +02:00
Liav A.
dd59fe35c7 Kernel+Userland: Reduce jails to be a simple boolean flag
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.

The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.

Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
  simplify the idea of being jailed to a simple ProtectedValues boolean
  flag. This means that we no longer check of matching jail pointers
  anywhere in the Kernel code.
  To set a process as jailed, a new prctl option was added to set a
  Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
  A process can either iterate on the global process list, or if it's
  attached to a scoped process list, then only over that list.
  This essentially replaces the need of checking the Jail pointer of a
  process when iterating over process lists.
2024-07-21 11:44:23 +02:00
Liav A.
91c87c5b77 Kernel+Userland: Prepare for considering VFSRootContext when mounting
Expose some initial interfaces in the mount-related syscalls to select
the desired VFSRootContext, by specifying the VFSRootContext index
number.

For now there's still no way to create a different VFSRootContext, so
the only valid IDs are -1 (for currently attached VFSRootContext) or 1
for the first userspace VFSRootContext.
2024-07-21 11:44:23 +02:00
Liav A.
01e1af732b Kernel/FileSystem: Introduce the VFSRootContext class
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.

The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.

To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.

We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
2024-07-21 11:44:23 +02:00
Ryan Castellucci
a2a6bc5348 Documentation: Fix some minor ESL grammar issues
There are a few instances where comments and documentation have minor
grammar issues likely resulting from English being the author's second
language.

This PR fixes several such cases, changing to idiomatic English and
resolving where it is unclear whether the user or program/code is
being referred to.
2024-07-03 00:17:46 +02:00
Liav A.
ecc9c5409d Kernel: Ignore dirfd if absolute path is given in VFS-related syscalls
To be able to do this, we add a new class called CustodyBase, which can
be resolved on-demand internally in the VirtualFileSystem resolving path
code.

When being resolved, CustodyBase will return a known custody if it was
constructed with such, if that's not the case it will provide the root
custody if the original path is absolute.
Lastly, if that's not the case as well, it will resolve the given dirfd
to provide a Custody object.
2024-06-01 19:25:15 +02:00
implicitfield
4574a8c334 Kernel+LibC+LibCore: Implement mknodat(2) 2024-05-14 22:30:39 +02:00
implicitfield
05cf1327ed Kernel: Make utimensat ignore the dirfd when given an absolute path 2024-05-14 22:30:39 +02:00
Liav A.
15ddc1f17a Kernel+Userland: Reject W->X prot region transition after a prctl call
We add a prctl option which would be called once after the dynamic
loader has finished to do text relocations before calling the actual
program entry point.

This change makes it much more obvious when we are allowed to change
a region protection access from being writable to executable.
The dynamic loader should be able to do this, but after a certain point
it is obvious that such mechanism should be disabled.
2024-05-14 12:41:51 -06:00
Liav A.
e756567341 Kernel+Userland: Convert process syscall region enforce flag to SetOnce
This flag is set only once, and should never reset once it has been set,
making it an ideal SetOnce use-case.
It also simplifies the expected conditions for the enabling prctl call,
as we don't expect a boolean flag, but rather the specific prctl option
will always set (enable) Process' AddressSpace syscall region enforcing.
2024-05-14 12:41:51 -06:00
Dan Klishch
cc5bacf886 Kernel: Allow annotating initially loaded executable segments
This allows marking regions as VirtualMemoryRangeFlags::SyscallCode in
static executables.
2024-05-07 16:36:38 -06:00
Sönke Holz
243d7003a2 Kernel+LibC+LibELF: Move TLS handling to userspace
This removes the allocate_tls syscall and adds an archctl option to set
the fs_base for the current thread on x86-64, since you can't set that
register from userspace. enter_thread_context loads the fs_base for the
next thread on each context switch.
This also moves tpidr_el0 (the thread pointer register on AArch64) to
the register state, so it gets properly saved/restored on context
switches.

The userspace TLS allocation code is kept pretty similar to the original
kernel TLS code, aside from a couple of style changes.

We also have to add a new argument "tls_pointer" to
SC_create_thread_params, as we otherwise can't prevent race conditions
between setting the thread pointer register and signal handling code
that might be triggered before the thread pointer was set, which could
use TLS.
2024-04-19 16:46:47 -06:00
Andrew Kaster
a65c385057 Kernel: Don't try to copy empty Vector in sys$recvmsg
If there's no fds to copy in a message with proper space for an
SCM_RIGHTS set of cmsg headers, then don't try to copy them.

This avoids a Kernel panic when recvmsg-ing, as copy_to_user(p, 0, 0)
hits a VERIFY.
2024-04-19 16:38:55 -04:00
Dan Klishch
5ed7cd6e32 Everywhere: Use east const in more places
These changes are compatible with clang-format 16 and will be mandatory
when we eventually bump clang-format version. So, since there are no
real downsides, let's commit them now.
2024-04-19 06:31:19 -04:00
Sönke Holz
04ca9f393f Kernel/riscv64: Implement create_thread 2024-03-25 14:10:05 -06:00
Sönke Holz
65724efac3 Kernel/riscv64: Implement fork 2024-03-25 14:10:05 -06:00
Sönke Holz
faede8c93a Kernel/riscv64: Implement execve 2024-03-25 14:10:05 -06:00
Idan Horowitz
e38ccebfc8 Kernel: Stop swallowing thread unblocks while process is stopped
This easily led to kernel deadlocks if the stopped thread held an
important global mutex (like the disk cache lock) while blocking.
Resolve this by ensuring stopped threads have a chance to return to the
userland boundary before actually stopping.
2024-02-10 08:42:53 +01:00
Idan Horowitz
458e990b7b Kernel: Stop locking the scheduler spinlock before the ptrace mutex
Locking a mutex while holding a spinlock is always wrong, but in the
case of the scheduler lock, it also causes an assertion failure. (Which
would be triggered by 2 separate threads trying to ptrace at the same
time).
2024-02-10 08:42:53 +01:00
hanaa12G
7abda6a36f Kernel: Add new sysconf option _SC_GETGR_R_SIZE_MAX 2024-01-06 04:59:50 -07:00
Idan Horowitz
519214697b Kernel: Mark sys$getsockname as not needing the big process lock
This syscall does not access any big process lock protected resources.
2023-12-26 19:20:21 +01:00
Idan Horowitz
ed5406e47d Kernel: Mark sys$getpeername as not needing the big process lock
This syscall does not access any big process lock protected resources.
2023-12-26 19:20:21 +01:00
Idan Horowitz
24a60c5a10 Kernel: Mark sys$ioctl as not needing the big process lock
This syscall does not access any big process lock protected resources.
2023-12-26 19:20:21 +01:00
Idan Horowitz
d63667dbf1 Kernel: Mark sys$kill_thread as not needing the big process lock
This syscall does not access any big process lock protected resources.
2023-12-26 19:20:21 +01:00
Idan Horowitz
b44628c1fb Kernel: Mark sys$join_thread as not needing the big process lock
This syscall does not access any big process lock protected resources.
2023-12-26 19:20:21 +01:00
Idan Horowitz
82e6090f47 Kernel: Mark sys$detach_thread as not needing the big process lock
This syscall does not access any big process lock protected resources.
2023-12-26 19:20:21 +01:00
Idan Horowitz
b49a0e2c61 Kernel: Mark sys$create_thread as not needing the big process lock
Now that the master TLS region is spinlock protected, this syscall does
not access any big process lock protected resources.
2023-12-26 19:20:21 +01:00
Idan Horowitz
6a4b93b3e0 Kernel: Protect processes' master TLS with a fine-grained spinlock
This moves it out of the scope of the big process lock, and allows us
to wean some syscalls off it, starting with sys$allocate_tls.
2023-12-26 19:20:21 +01:00
Idan Horowitz
a49b7e92eb Kernel: Shrink instead of expand sigaltstack range to page boundaries
Since the POSIX sigaltstack manpage suggests allocating the stack
region using malloc(), and many heap implementations (including ours)
store heap chunk metadata in memory just before the vended pointer,
we would end up zeroing the metadata, leading to various crashes.
2023-12-24 16:11:35 +01:00
Idan Horowitz
1bea780a7f Kernel: Reject loading ELF files with no loadable segments
If there's no loadable segments then there can't be any code to execute
either. This resolves a crash these kinds of ELF files would cause from
the directly following VERIFY statement.
2023-12-15 21:36:25 +01:00
Idan Horowitz
2a6b492c7f Kernel: Copy over TLS region size and alignment when forking
Previously we would unintentionally leave them zero-initialized,
resulting in any threads created post fork (but without execve) having
invalid thread local storage pointers stored in their FS register.
2023-12-15 21:36:03 +01:00
Daniel Bertalan
45d81dceed Everywhere: Replace ElfW(type) macro usage with Elf_type
This works around a `clang-format-17` bug which caused certain usages to
be misformatted and fail to compile.

Fixes #8315
2023-12-01 10:02:39 +02:00
Liav A
5dba1dedb7 Kernel: Don't warn when running dynamically-linked ELF without PT_INTERP
We could technically copy the dynamic loader to other path and run it
from there, so let's not assume paths.
If the user is so determined to do such thing, then a warning is quite
meaningless.
2023-11-27 09:27:34 -07:00
Idan Horowitz
16a53c811e Kernel: Treat a backlog argument of 0 to listen() as if it was 1
As per POSIX, the behavior of listen() with a backlog value of 0 is
implementation defined: "A backlog argument of 0 may allow the socket
to accept connections, in which case the length of the listen queue may
be set to an implementation-defined minimum value."
Since creating a socket that can't accept any connections seems
relatively useless, and as other platforms (Linux, FreeBSD, etc) chose
to support accepting connections with this backlog value, support it as
well by normalizing it to 1.
2023-11-25 16:34:38 +01:00
Sönke Holz
da88d766b2 Kernel/riscv64: Make the kernel compile
This commits inserts TODOs into all necessary places to make the kernel
compile on riscv64!
2023-11-10 15:51:31 -07:00
Uku Loskit
ecbb1df01b Kernel/Syscalls: Allow root to ptrace any process
Previously root (euid=0) was not able to ptrace any dumpable process
as expected. This change fixes this.
2023-11-06 10:03:07 +01:00
Romain Chardiny
6d31d81309 Kernel: Allow negative value for backlog in sys$listen 2023-11-04 17:35:54 +01:00
Liav A
1b00618fd9 Kernel+Userland: Replace the beep syscall with the new /dev/beep device
There's no need to have separate syscall for this kind of functionality,
as we can just have a device node in /dev, called "beep", that allows
writing tone generation packets to emulate the same behavior.

In addition to that, we remove LibC sysbeep function, as this function
was never being used by any C program nor it was standardized in any
way.
Instead, we move the userspace implementation to LibCore.
2023-11-03 15:19:33 +01:00
kleines Filmröllchen
398d271a46 Kernel: Share Processor class (and others) across architectures
About half of the Processor code is common across architectures, so
let's share it with a templated base class. Also, other code that can be
shared in some ways, like FPUState and TrapFrame functions, is adjusted
here. Functions which cannot be shared trivially (without internal
refactoring) are left alone for now.
2023-10-03 16:08:29 -06:00
Liav A
cbaa3465a8 Kernel: Add jail semantics to methods iterating over thread lists
We should consider whether the selected Thread is within the same jail
or not.
Therefore let's make it clear to callers with jail semantics if a called
method checks if the desired Thread object is within the same jail.

As for Thread::for_each_* methods, currently nothing in the kernel
codebase needs iteration with consideration for jails, so the old
Thread::for_each* were simply renamed to include "ignoring_jails" suffix
in their names.
2023-09-15 11:06:48 -06:00