Commit graph

1196 commits

Author SHA1 Message Date
Tom
e3190bd144 Revert "Kernel: Allocate shared memory regions immediately"
This reverts commit fe6b3f99d1.
2021-01-02 20:56:35 +01:00
Andreas Kling
fe6b3f99d1 Kernel: Allocate shared memory regions immediately
Lazily committed shared memory was not working in situations where one
process would write to the memory and another would only read from it.

Since the reading process would never cause a write fault in the shared
region, we'd never notice that the writing process had added real
physical pages to the VMObject. This happened because the lazily
committed pages were marked "present" in the page table.

This patch solves the issue by always allocating shared memory up front
and not trying to be clever about it.
2021-01-02 16:57:31 +01:00
Andreas Kling
5dae85afe7 Kernel: Pass "shared" flag to Region constructor
Before this change, we would sometimes map a region into the address
space with !is_shared(), and then moments later call set_shared(true).

I found this very confusing while debugging, so this patch makes us pass
the initial shared flag to the Region constructor, ensuring that it's in
the correct state by the time we first map the region.
2021-01-02 16:57:31 +01:00
Andreas Kling
9ec9d20e84 Kernel: Fix bad VMObject iteration in sys$purge()
We were fooling ourselves into thinking all VMObjects are anonymous and
then tried to call purge() on them as if they were.
2021-01-02 13:34:29 +01:00
Tom
e87eaf5df0 Kernel: Fix memory corruption when rolling back regions in execve
We need to free the regions before reverting the paging scope to the
original one when rolling back changes due to an error. This fixes
silent memory corruption.
2021-01-01 23:43:44 +01:00
Tom
2f429bd2d5 Kernel: Pass new region owner to Region::clone 2021-01-01 23:43:44 +01:00
Tom
bf9be3ec01 Kernel: More gracefully handle out-of-memory when creating PageDirectory 2021-01-01 23:43:44 +01:00
Tom
476f17b3f1 Kernel: Merge PurgeableVMObject into AnonymousVMObject
This implements memory commitments and lazy-allocation of committed
memory.
2021-01-01 23:43:44 +01:00
Tom
b2a52f6208 Kernel: Implement lazy committed page allocation
By designating a committed page pool we can guarantee to have physical
pages available for lazy allocation in mappings. However, when forking
we will overcommit. The assumption is that worst-case it's better for
the fork to die due to insufficient physical memory on COW access than
the parent that created the region. If a fork wants to ensure that all
memory is available (trigger a commit) then it can use madvise.

This also means that fork now can gracefully fail if we don't have
enough physical pages available.
2021-01-01 23:43:44 +01:00
Tom
e21cc4cff6 Kernel: Remove MAP_PURGEABLE from mmap
This brings mmap more in line with other operating systems. Prior to
this, it was impossible to request memory that was definitely committed,
instead MAP_PURGEABLE would provide a region that was not actually
purgeable, but also not fully committed, which meant that using such memory
still could cause crashes when the underlying pages could no longer be
allocated.

This fixes some random crashes in low-memory situations where non-volatile
memory is mapped (e.g. malloc, tls, Gfx::Bitmap, etc) but when a page in
these regions is first accessed, there is insufficient physical memory
available to commit a new page.
2021-01-01 23:43:44 +01:00
Tom
c3451899bc Kernel: Add MAP_NORESERVE support to mmap
Rather than lazily committing regions by default, we now commit
the entire region unless MAP_NORESERVE is specified.

This solves random crashes in low-memory situations where e.g. the
malloc heap allocated memory, but using pages that haven't been
used before triggers a crash when no more physical memory is available.

Use this flag to create large regions without actually committing
the backing memory. madvise() can be used to commit arbitrary areas
of such regions after creating them.
2021-01-01 23:43:44 +01:00
Tom
bc5d6992a4 Kernel: Memory purging improvements
This adds the ability for a Region to define volatile/nonvolatile
areas within mapped memory using madvise(). This also means that
memory purging takes into account all views of the PurgeableVMObject
and only purges memory that is not needed by all of them. When calling
madvise() to change an area to nonvolatile memory, return whether
memory from that area was purged. At that time also try to remap
all memory that is requested to be nonvolatile, and if insufficient
pages are available notify the caller of that fact.
2021-01-01 23:43:44 +01:00
Andreas Kling
7c3b6b10e4 Kernel: Remove the limited use of AK::TypeTraits we had in the kernel
This was only used for VMObject and we can do without it there. This is
preparation for migrating to dynamic_cast-based helpers in userspace.
2021-01-01 15:32:44 +01:00
Andrew Kaster
a3a9016701 DynamicLoader: Tell the linker to not add a PT_INTERP header
Use the GNU LD option --no-dynamic-linker. This allows uncommenting some
code in the Kernel that gets upset if your ELF interpreter has its own
interpreter.
2021-01-01 02:12:28 +01:00
Linus Groh
91332515a6 Kernel: Add sys$set_coredump_metadata() syscall
This can be used by applications to store information (key/value pairs)
likely useful for debugging, which will then be embedded in the coredump.
2020-12-30 16:28:27 +01:00
Andreas Kling
af28a8ad11 Kernel: Hold InodeVMObject reference while inspecting it in sys$mmap() 2020-12-29 15:43:35 +01:00
Andreas Kling
30dbe9c78a Kernel+LibC: Add a very limited sys$mremap() implementation
This syscall can currently only remap a shared file-backed mapping into
a private file-backed mapping.
2020-12-29 02:20:43 +01:00
Liav A
247517cd4a Kernel: Introduce the DevFS
The DevFS along with DevPtsFS give a complete solution for populating
device nodes in /dev. The main purpose of DevFS is to eliminate the
need of device nodes generation when building the system.

Later on, DevFS will assist with exposing disk partition nodes.
2020-12-27 23:07:44 +01:00
Andreas Kling
0e2b7f9c9a Kernel: Remove the per-process icon_id and sys$set_process_icon()
This was a goofy kernel API where you could assign an icon_id (int) to
a process which referred to a global shbuf with a 16x16 icon bitmap
inside it.

Instead of this, programs that want to display a process icon now
retrieve it from the process executable instead.
2020-12-27 01:16:56 +01:00
AnotherTest
7b5aa06702 Kernel: Allow 'elevating' unveil permissions if implicitly inherited from '/'
This can happen when an unveil follows another with a path that is a
sub-path of the other one:
```c++
unveil("/home/anon/.config/whoa.ini", "rw");
unveil("/home/anon", "r"); // this would fail, as "/home/anon" inherits
                           // the permissions of "/", which is None.
```
2020-12-26 16:10:04 +01:00
AnotherTest
a9184fcb76 Kernel: Implement unveil() as a prefix-tree
Fixes #4530.
2020-12-26 11:54:54 +01:00
Andreas Kling
1cfdaf96c4 Kernel: Reset the process dumpable flag on successful non-setid exec
Once we've committed to a new memory layout and non-setid credentials,
we can reset the dumpable flag.
2020-12-26 01:31:24 +01:00
Andreas Kling
82f86e35d6 Kernel+LibC: Introduce a "dumpable" flag for processes
This new flag controls two things:
- Whether the kernel will generate core dumps for the process
- Whether the EUID:EGID should own the process's files in /proc

Processes are automatically made non-dumpable when their EUID or EGID is
changed, either via syscalls that specifically modify those ID's, or via
sys$execve(), when a set-uid or set-gid program is executed.

A process can change its own dumpable flag at any time by calling the
new sys$prctl(PR_SET_DUMPABLE) syscall.

Fixes #4504.
2020-12-25 19:35:55 +01:00
Andreas Kling
ed5c26d698 AK: Remove custom %w format string specifier
This was a non-standard specifier alias for %04x. This patch replaces
all uses of it with new-style formatting functions instead.
2020-12-25 17:05:05 +01:00
Andreas Kling
89d3b09638 Kernel: Allocate new main thread stack before committing to exec
If the allocation fails (e.g ENOMEM) we want to simply return an error
from sys$execve() and continue executing the current executable.

This patch also moves make_userspace_stack_for_main_thread() out of the
Thread class since it had nothing in particular to do with Thread.
2020-12-25 16:22:01 +01:00
Andreas Kling
2f1712cc29 Kernel: Move ELF auxiliary vector building out of Process class
Process had a couple of members whose only purpose was holding on to
some temporary data while building the auxiliary vector. Remove those
members and move the vector building to a free function in execve.cpp
2020-12-25 15:23:35 +01:00
Andreas Kling
40e9edd798 LibELF: Move AuxiliaryValue into the ELF namespace 2020-12-25 14:48:30 +01:00
Andreas Kling
6c9a6bea1e Kernel+LibELF: Abort ELF executable load sooner when something fails
Make it possible to bail out of ELF::Image::for_each_program_header()
and then do exactly that if something goes wrong during executable
loading in the kernel.

Also make the errors we return slightly more nuanced than just ENOEXEC.
2020-12-25 14:42:42 +01:00
Andreas Kling
791b32e3c6 Kernel: Remove an unnecessary cast in sys$execve() 2020-12-25 14:16:35 +01:00
Andreas Kling
9c640e67ac Kernel: Don't fetch full inode metadata in sys$execve()
We only need the size, so let's not fetch all the metadata.
2020-12-25 14:15:33 +01:00
Andreas Kling
c3eddbcb49 Kernel: Add back missing ELF::Image validity check
If the image is not a valid ELF we should just fail ASAP.
2020-12-25 14:13:44 +01:00
Andreas Kling
4986f268a5 Kernel: Convert dbg() => dbgln() in sys$execve() 2020-12-25 12:51:35 +01:00
Andreas Kling
09129782de Kernel: Simplify ELF loading logic in sys$execve() somewhat
Get rid of the lambda functions and put the logic inline in the program
header traversal loop instead. This makes the code quite a bit shorter
and hopefully makes it easier to see what's going on.
2020-12-25 02:33:57 +01:00
Andreas Kling
1e4c010643 LibELF: Remove ELF::Loader and move everyone to ELF::Image
This commit gets rid of ELF::Loader entirely since its very ambiguous
purpose was actually to load executables for the kernel, and that is
now handled by the kernel itself.

This patch includes some drive-by cleanup in LibDebug and CrashDaemon
enabled by the fact that we no longer need to keep the ref-counted
ELF::Loader around.
2020-12-25 02:14:56 +01:00
Andreas Kling
7551a66f73 Kernel+LibELF: Move sys$execve()'s loading logic from LibELF to Kernel
It was really weird that ELF loading was performed by the ELF::Loader
class instead of just being done by the kernel itself. This patch moves
all the layout logic from ELF::Loader over to sys$execve().

The kernel no longer cares about ELF::Loader and instead only uses an
ELF::Image as an interpreting wrapper around executables.
2020-12-25 01:22:55 +01:00
Itamar
0cb636078a Kernel+LibELF: Allow Non ET_DYN executables to have an interpreter 2020-12-24 21:34:51 +01:00
Itamar
d64d0451e5 Kernel: Fix mmap with specific address for file backed mappings 2020-12-24 21:34:51 +01:00
Andreas Kling
1e21d49e86 Kernel: Fix wrong-looking overflow check in sys$execve()
This was harmless since sizeof(length) and sizeof(strings) are both 4
on x86 but let's check the right things regardless.
2020-12-23 20:34:22 +01:00
Andreas Kling
6bfbc5f5f5 Kernel: Don't allow modifying IOPL via sys$ptrace() or sys$sigreturn()
It was possible to overwrite the entire EFLAGS register since we didn't
do any masking in the ptrace and sigreturn syscalls.

This made it trivial to gain IO privileges by raising IOPL to 3 and
then you could talk to hardware to do all kinds of nasty things.

Thanks to @allesctf for finding these issues! :^)

Their exploit/write-up: https://github.com/allesctf/writeups/blob/master/2020/hxpctf/wisdom2/writeup.md
2020-12-22 19:38:25 +01:00
Andreas Kling
2dfe5751f3 Kernel: Abort core dump generation if any substep fails
And make an effort to propagate errors out from the inner parts.
This fixes an issue where the kernel would infinitely loop in coredump
generation if the TmpFS filled up.
2020-12-22 10:09:41 +01:00
Tom
5f51d85184 Kernel: Improve time keeping and dramatically reduce interrupt load
This implements a number of changes related to time:
* If a HPET is present, it is now used only as a system timer, unless
  the Local APIC timer is used (in which case the HPET timer will not
  trigger any interrupts at all).
* If a HPET is present, the current time can now be as accurate as the
  chip can be, independently from the system timer. We now query the
  HPET main counter for the current time in CPU #0's system timer
  interrupt, and use that as a base line. If a high precision time is
  queried, that base line is used in combination with quering the HPET
  timer directly, which should give a much more accurate time stamp at
  the expense of more overhead. For faster time stamps, the more coarse
  value based on the last interrupt will be returned. This also means
  that any missed interrupts should not cause the time to drift.
* The default system interrupt rate is reduced to about 250 per second.
* Fix calculation of Thread CPU usage by using the amount of ticks they
  used rather than the number of times a context switch happened.
* Implement CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE and use it
  for most cases where precise timestamps are not needed.
2020-12-21 18:26:12 +01:00
Lenny Maiorani
765936ebae
Everywhere: Switch from (void) to [[maybe_unused]] (#4473)
Problem:
- `(void)` simply casts the expression to void. This is understood to
  indicate that it is ignored, but this is really a compiler trick to
  get the compiler to not generate a warning.

Solution:
- Use the `[[maybe_unused]]` attribute to indicate the value is unused.

Note:
- Functions taking a `(void)` argument list have also been changed to
  `()` because this is not needed and shows up in the same grep
  command.
2020-12-21 00:09:48 +01:00
Andreas Kling
34e9df3c5e Kernel: Randomize memory location of the dynamic loader :^)
This should make it a little bit harder for those who would mess with
our loader.
2020-12-20 18:49:24 +01:00
Andreas Kling
02ef3f6343 Kernel: Ptrace should not assert on poke in non-mapped tracee memory 2020-12-20 18:49:24 +01:00
Andreas Kling
9bf02c32c0 Kernel: Activate SUID/SGID credentials earlier in sys$execve()
Switch on the new credentials before loading the new executable into
memory. This ensures that attempts to ptrace() the program from an
unprivileged process will fail.

This covers one bug that was exploited in the 2020 HXP CTF:
https://hxp.io/blog/79/hxp-CTF-2020-wisdom2/

Thanks to yyyyyyy for finding the bug! :^)
2020-12-20 18:49:18 +01:00
Andreas Kling
5505159a94 Kernel: Silence debug spam about select() being interrupted 2020-12-20 16:06:52 +01:00
Andreas Kling
e5eda151b4 Kernel: Silence debug spam when running dynamically linked programs 2020-12-20 16:06:39 +01:00
Andreas Kling
8e79bde2b7 Kernel: Move KBufferBuilder to the fallible KBuffer API
KBufferBuilder::build() now returns an OwnPtr<KBuffer> and can fail.
Clients of the API have been updated to handle that situation.
2020-12-18 19:22:26 +01:00
Tom
c4176b0da1 Kernel: Fix Lock race causing infinite spinning between two threads
We need to account for how many shared lock instances the current
thread owns, so that we can properly release such references when
yielding execution.

We also need to release the process lock when donating.
2020-12-16 23:38:17 +01:00
Andreas Kling
4befc2c282 Kernel: Avoid null dereference in sys$profiling_disable()
If we can't create a profiling coredump object, we shouldn't try to
call write() on it.
2020-12-15 11:25:51 +01:00