We were marking the execing thread as Runnable near the end of
Process::do_exec().
This was necessary for exec in processes that had never been scheduled
yet, which is a specific edge case that only applies to the very first
userspace process (normally SystemServer). At this point, such threads
are in the Invalid state.
In the common case (normal userspace-initiated exec), making the current
thread Runnable meant that we switched away from its current state:
Running. As the thread is indeed running, that's a bogus change!
This created a short time window in which the thread state was bogus,
and any attempt to block the thread would panic the kernel (due to a
bogus thread state in Thread::block() leading to VERIFY_NOT_REACHED().)
Fix this by not touching the thread state in Process::do_exec()
and instead make the first userspace thread Runnable directly after
calling Process::exec() on it in try_create_userspace_process().
It's unfortunate that exec() can be called both on the current thread,
and on a new thread that has never been scheduled. It would be good to
not have the latter edge case, but fixing that will require larger
architectural changes outside the scope of this fix.
This makes sure DeviceManagement::try_create_device will call the
static factory function (if available) instead of directly calling the
constructor, which will allow us to move OOM-fallible calls out of
Device constructors.
If we panic the kernel for a storage-related reason, we might as well be
helpful and print out a list of detected storage devices and their
partitions to help with debugging.
Reasons for such a panic include:
- No boot device with the given name found
- No boot device with the given UUID found
- Failing to open the root filesystem after determining a boot device
Before attempting to remove the device while handling an AHCI port
interrupt, check if m_connected_device is even non-null.
This happened during my bare metal run and caused a kernel panic.
A number of crashes in this `VERIFY_NOT_REACHED` case have been
reported on discord. Lets add some tracing to gather more information
and help diagnose what is the cause of these crashes.
There are many assumptions in the stack that argc is not zero, and
argv[0] points to a valid string. The recent pwnkit exploit on Linux
was able to exploit this assumption in the `pkexec` utility
(a SUID-root binary) to escalate from any user to root.
By convention `execve(..)` should always be called with at least one
valid argument, so lets enforce that semantic to harden the system
against vulnerabilities like pwnkit.
Reference: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
This allows us to enable Write-Combine on e.g. framebuffers,
significantly improving performance on bare metal.
To keep things simple we right now only use one of up to three bits
(bit 7 in the PTE), which maps to the PA4 entry in the PAT MSR, which
we set to the Write-Combine mode on each CPU at boot time.
InodeWatcher::register_inode was already partially fallible, but the
insertion of the inodes and watch descriptions into their respective
hash maps was not. Note that we cannot simply TRY the insertion into
both, as that could result in an inconsistent state, instead we must
remove the inode from the inode hash map if the insertion into the
watch description hash map failed.
There's no need to perform it this early, and until the MemoryManager
is initialized we have very limited kmalloc capacity, so let's try and
keep anything that's not required to be there out of there.
There was a bug while calculating the next index in submit_sync_sqe
function. Use the NVMeQueue's class variable m_qdepth instead of the
hardcoded IO_QUEUE_SIZE.
Before this commit all consume_until overloads aside from the Predicate
one would consume (and ignore) the stop char/string, while the
Predicate overload would not, in order to keep behaviour consistent,
the other overloads no longer consume the stop char/string as well.
Apologies for the enormous commit, but I don't see a way to split this
up nicely. In the vast majority of cases it's a simple change. A few
extra places can use TRY instead of manual error checking though. :^)
If there's nobody listening for the crash signal, fall back to the
normal crash path where we get some debug output about what happened.
Thanks to Idan for suggesting the fix.
Before this change, our dynamic linker's global constructor handler
relied on the GNU linker implicitly including the content of `.ctors`
section inside `.init_array`. The mold linker does not do this, so
global constructors would fail to be called in the mold-built userland.
There is no point in sticking to `.ctors`, as most other systems already
use the superior `.init_array` scheme. This commit changes the kernel
linker script to not discard this new section, and enables it by default
in our toolchain.
We now have a function to install a (currently default) vector
table, meaning that any exceptions (or interrupts for that matter)
will be caught by the processor and routed to one of the vectors
inside the table.
Instead, try to allocate the DMA buffer before trying to construct the
NVMeQueue. This allows us to fail early if we can't allocate the DMA
buffer before allocating and creating the heavier NVMeQueue object.