This patch implements a simple version of the futex (fast userspace
mutex) API in the kernel and uses it to make the pthread_cond_t API's
block instead of busily sched_yield().
An arbitrary userspace address is passed to the kernel as a "token"
that identifies the futex and you can then FUTEX_WAIT and FUTEX_WAKE
that specific userspace address.
FUTEX_WAIT corresponds to pthread_cond_wait() and FUTEX_WAKE is used
for pthread_cond_signal() and pthread_cond_broadcast().
I'm pretty sure I'm missing something in this implementation, but it's
hopefully okay for a start. :^)
This is a little strange, but it's how I understand things should work.
The first thread in a new process now has TID == PID.
Additional threads subsequently spawned in that process all have unique
TID's generated by the PID allocator. TIDs are now globally unique.
The idea of all processes reliably having a main thread was nice in
some ways, but cumbersome in others. More importantly, it didn't match
up with POSIX thread semantics, so let's move away from it.
This thread gets rid of Process::main_thread() and you now we just have
a bunch of Thread objects floating around each Process.
When the finalizer nukes the last Thread in a Process, it will also
tear down the Process.
There's a bunch of more things to fix around this, but this is where we
get started :^)
This patch adds a single "kernel info page" that is mappable read-only
by any process and contains the current time of day.
This is then used to implement a version of gettimeofday() that doesn't
have to make a syscall.
To protect against race condition issues, the info page also has a
serial number which is incremented whenever the kernel updates the
contents of the page. Make sure to verify that the serial number is the
same before and after reading the information you want from the page.
The kernel now supports basic profiling of all the threads in a process
by calling profiling_enable(pid_t). You finish the profiling by calling
profiling_disable(pid_t).
This all works by recording thread stacks when the timer interrupt
fires and the current thread is in a process being profiled.
Note that symbolication is deferred until profiling_disable() to avoid
adding more noise than necessary to the profile.
A simple "/bin/profile" command is included here that can be used to
start/stop profiling like so:
$ profile 10 on
... wait ...
$ profile 10 off
After a profile has been recorded, it can be fetched in /proc/profile
There are various limits (or "bugs") on this mechanism at the moment:
- Only one process can be profiled at a time.
- We allocate 8MB for the samples, if you use more space, things will
not work, and probably break a bit.
- Things will probably fall apart if the profiled process dies during
profiling, or while extracing /proc/profile
This patch makes SharedBuffer use a PurgeableVMObject as its underlying
memory object.
A new syscall is added to control the volatile flag of a SharedBuffer.
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():
- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)
When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)
Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
The main thread of each kernel/user process will take the name of
the process. Extra threads will get a fancy new name
"ProcessName[<tid>]".
Thread backtraces now list the thread name in addtion to tid.
Add the thread name to /proc/all (should it get its own proc
file?).
Add two new syscalls, set_thread_name and get_thread_name.
It's now possible to load a .o file into the kernel via a syscall.
The kernel will perform all the necessary ELF relocations, and then
call the "module_init" symbol in the loaded module.
Add an initial implementation of pthread attributes for:
* detach state (joinable, detached)
* schedule params (just priority)
* guard page size (as skeleton) (requires kernel support maybe?)
* stack size and user-provided stack location (4 or 8 MB only, must be aligned)
Add some tests too, to the thread test program.
Also, LibC: Move pthread declarations to sys/types.h, where they belong.
This can be implemented entirely in userspace by calling tcgetattr().
To avoid screwing up the syscall indexes, this patch also adds a
mechanism for removing a syscall without shifting the index of other
syscalls.
Note that ports will still have to be rebuilt after this change,
as their LibC code will try to make the isatty() syscall on startup.
Have pthread_create() allocate a stack and passing it to the kernel
instead of this work happening in the kernel. The more of this we can
do in userspace, the better.
This patch also unexposes the raw create_thread() and exit_thread()
syscalls since they are now only used by LibPthread anyway.
It's now possible to block until another thread in the same process has
exited. We can also retrieve its exit value, which is whatever value it
passed to pthread_exit(). :^)
This patch adds pthread_create() and pthread_exit(), which currently
simply wrap our existing create_thread() and exit_thread() syscalls.
LibThread is also ported to using LibPthread.
POSIX's openat() is very similar to open(), except you also provide a
file descriptor referring to a directory from which relative paths
should be resolved.
Passing it the magical fd number AT_FDCWD means "resolve from current
directory" (which is indeed also what open() normally does.)
This fixes libarchive's bsdtar, since it was trying to do something
extremely wrong in the absence of openat() support. The issue has
recently been fixed upstream in libarchive:
https://github.com/libarchive/libarchive/issues/1239
However, we should have openat() support anyway, so I went ahead and
implemented it. :^)
Fixes#748.
Instead of the big ugly switch statement, build a lookup table using
the syscall enumeration macro.
This greatly simplifies the syscall implementation. :^)
It's not safe to use a raw pointer for Process::m_tty. A pseudoterminal
pair will disappear when file descriptors are closed, and we'd end up
looking dangly. Just use a RefPtr.
Scheduling priority is now set at the thread level instead of at the
process level.
This is a step towards allowing processes to set different priorities
for threads. There's no userspace API for that yet, since only the main
thread's priority is affected by sched_setparam().
Add the ability to both pass arguments to scripts with shebangs
(./script argument1 argument2) and to specify them in the shebang line
(#!/usr/local/bin/bash -x -e)
Fixes#585
The way it gets the entropy and blasts it to the buffer is pretty
ugly IMHO, but it does work for now. (It should be replaced, by
not truncating a u32.)
It implements an (unused for now) flags argument, like Linux but
instead of OpenBSD's. This is in case we want to distinguish
between entropy sources or any other reason and have to implement
a new syscall later. Of course, learn from Linux's struggles with
entropy sourcing too.
This patch adds three separate per-process fault counters:
- Inode faults
An inode fault happens when we've memory-mapped a file from disk
and we end up having to load 1 page (4KB) of the file into memory.
- Zero faults
Memory returned by mmap() is lazily zeroed out. Every time we have
to zero out 1 page, we count a zero fault.
- CoW faults
VM objects can be shared by multiple mappings that make their own
unique copy iff they want to modify it. The typical reason here is
memory shared between a parent and child process.
We were always returning the full VM range of the partially-unmapped
Region to the range allocator. This caused us to re-use those addresses
for subsequent VM allocations.
This patch also skips creating a new VMObject in partial munmap().
Instead we just make split regions that point into the same VMObject.
This fixes the mysterious GCC ICE on large C++ programs.
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.
Kernel Regions are kept in various OwnPtr<Regions>'s.
Regions now only ever get unmapped when they are destroyed.
This patch makes it possible to *run* text files that start with the
characters "#!" followed by an interpreter.
I've tested this with both the Serenity built-in shell and the Bash
shell, and it works as expected. :^)
The fchdir() function is equivalent to chdir() except that the
directory that is to be the new current working directory is
specified by a file descriptor.
This patch adds support for TLS according to the x86 System V ABI.
Each thread gets a thread-specific memory region, and the GS segment
register always points _to a pointer_ to the thread-specific memory.
In other words, to access thread-local variables, userspace programs
start by dereferencing the pointer at [gs:0].
The Process keeps a master copy of the TLS segment that new threads
should use, and when a new thread is created, they get a copy of it.
It's basically whatever the PT_TLS program header in the ELF says.
This was a workaround to be able to build on case-insensitive file
systems where it might get confused about <string.h> vs <String.h>.
Let's just not support building that way, so String.h can have an
objectively nicer name. :^)
This commit drastically changes how signals are handled.
In the case that an unblocked thread is signaled it works much
in the same way as previously. However, when a blocking syscall
is interrupted, we set up the signal trampoline on the user
stack, complete the blocking syscall, return down the kernel
stack and then jump to the handler. This means that from the
kernel stack's perspective, we only ever get one system call deep.
The signal trampoline has also been changed in order to properly
store the return value from system calls. This is necessary due
to the new way we exit from signaled system calls.
You can now munmap() a part of a region. The kernel will then create
one or two new regions around the "hole" and re-map them using the same
physical pages as before.
This goes towards fixing #175, but not all the way since we don't yet
do munmap() across multiple mappings.
It is now possible to unmount file systems from the VFS via `umount`.
It works via looking up the `fsid` of the filesystem from the `Inode`'s
metatdata so I'm not sure how fragile it is. It seems to work for now
though as something to get us going.
This patch adds the mprotect() syscall to allow changing the protection
flags for memory regions. We don't do any region splitting/merging yet,
so this only works on whole mmap() regions.
Added a "crash -r" flag to verify that we crash when you attempt to
write to read-only memory. :^)
This is not perfect as it uses a lot of VM, but since the buffers are
supposed to be temporary it's not super terrible.
This could be improved by giving back the unused VM to the kernel's
RangeAllocator after finishing the buffer building.
In the userspace, this mimics the Linux pipe2() syscall;
in the kernel, the Process::sys$pipe() now always accepts
a flags argument, the no-argument pipe() syscall is now a
userspace wrapper over pipe2().
It is now possible to mount ext2 `DiskDevice` devices under Serenity on
any folder in the root filesystem. Currently any user can do this with
any permissions. There's a fair amount of assumptions made here too,
that might not be too good, but can be worked on in the future. This is
a good start to allow more dynamic operation under the OS itself.
It is also currently impossible to unmount and such, and devices will
fail to mount in Linux as the FS 'needs to be cleaned'. I'll work on
getting `umount` done ASAP to rectify this (as well as working on less
assumption-making in the mount syscall. We don't want to just be able
to mount DiskDevices!). This could probably be fixed with some `-t`
flag or something similar.