There's no point in constructing an object just for the sake of keeping
a state that can be touched by anything in the kernel code.
Let's reduce everything to be in a C++ namespace called with the
previous name "VirtualFileSystem" and keep a smaller textual-footprint
struct called "VirtualFileSystemDetails".
This change also cleans up old "friend class" statements that were no
longer needed, and move methods from the VirtualFileSystem code to more
appropriate places as well.
Please note that the method of locking all filesystems during shutdown
is removed, as in that place there's no meaning to actually locking all
filesystems because of running in kernel mode entirely.
This new syscall will be used by the upcoming runc (run-container)
utility.
In addition to that, this syscall allows userspace to neatly copy RAMFS
instances to other places, which was not possible in the past.
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.
The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.
Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
simplify the idea of being jailed to a simple ProtectedValues boolean
flag. This means that we no longer check of matching jail pointers
anywhere in the Kernel code.
To set a process as jailed, a new prctl option was added to set a
Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
A process can either iterate on the global process list, or if it's
attached to a scoped process list, then only over that list.
This essentially replaces the need of checking the Jail pointer of a
process when iterating over process lists.
Expose some initial interfaces in the mount-related syscalls to select
the desired VFSRootContext, by specifying the VFSRootContext index
number.
For now there's still no way to create a different VFSRootContext, so
the only valid IDs are -1 (for currently attached VFSRootContext) or 1
for the first userspace VFSRootContext.
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.
The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.
To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.
We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
AnonymousVMObject::try_clone() computed how many shared cow pages to
commit by counting all VMObject pages that were not shared_zero_pages.
This means that lazy_committed_pages were also being included in the
count. This is a problem because the page fault handling code for
lazy_committed_pages does not allocate from
m_shared_committed_cow_pages. So more pages than necessary were being
committed.
This fixes this overcommitting problem by skipping lazy_committed_pages
when counting how many pages to commit.
This commit introduces VMObject::remap_regions_single_page(). This
method remaps a single page in all regions associated with a VMObject.
This is intended to be a more efficient replacement for remap_regions()
in cases where only a single page needs to be remapped.
This commit also updates the cow page fault handling code to use this
new method.
Writes to a MAP_SHARED | MAP_ANONYMOUS mmap region were not visible to
other processes sharing the mmap region. This was happening because the
page fault handler was not remapping the VMObject's m_regions after
allocating a new page.
This commit fixes the problem by calling remap_regions() after assigning
a new page to the VMObject in the page fault handler. This remapping
only occurs for shared Regions.
This commit makes the following minor changes to handle_zero_fault():
* cleans up a call to static_cast(), replacing it with a reference (a
future commit will also use this reference).
* replaces a call to vmobject() with the new reference mentioned above.
* moves the definition of already_handled to inside the block where
already_handled is used.
After a fork(), page faults on anonymous mmaps can cause a redundant
page fault to occur.
This happens because VMObjects for anonymous mmaps are initially filled
with references to the lazy_committed_page or shared_zero_page. If there
is a fork, VMObject::try_clone() is called and all pages of the VMObject
are marked as cow (via the m_cow_map).
Page faults on a zero/lazy page are handled by handle_zero_fault().
handle_zero_fault() does not update m_cow_map, so if the page was marked
cow before the fault, it will still be marked cow after the fault. This
causes a second (redundant) page fault when the CPU retries the write.
This commit removes the redundant page fault by not marking zero/lazy
pages as cow in m_cow_map.
AddressSpace::try_allocate_split_region() was updating the cow map of
new_region based on the cow map of source_region.
The problem is that both new_region and source_region reference the
same vmobject and the same cow map, so these cow map updates didn't
actually change anything.
This commit:
* removes the cow map updates from try_allocate_split_region()
* removes Region::set_should_cow() since it is no longer used
InodeVMObjects now track dirty and clean pages. This tracking of
dirty and clean pages is used by the msync and purge syscalls.
dirty page tracking works using the following rules:
* when a new InodeVMObject is made, all pages are marked clean.
* writes to clean InodeVMObject pages will cause a page fault,
the fault handler will mark the page as dirty.
* writes to dirty InodeVMObject pages do not cause page faults.
* if msync is called, only dirty pages are flushed to storage (and
marked clean).
* if purge syscall is called, only clean pages are discarded.
This device was a short-lived solution to allow userspace (WindowServer)
to easily support hotplugging mouse devices with presumably very small
modifications on userspace side.
Now that we have a proper mechanism to propagate hotplug events from the
DeviceMapper program to any program that needs to get such events, we no
longer need this device, so let's remove it.
After the previous commit, we are able to create a comprehensive list of
all devices' major number allocations.
To help userspace to distinguish between character and block devices, we
expose 2 sysfs nodes so userspace can decide which list it needs to open
in order to iterate on it.
We used to allocate major numbers quite randomly, with no common place
to look them up if needed.
This commit is changing that by placing all major number allocations
under a new C++ namespace, in the API/MajorNumberAllocation.h file.
We also add the foundations of what is needed before we can publish this
information (allocated numbers for block and char devices) to userspace.
Instead of putting everything in one hash map, let's distinguish between
the devices based on their type.
This change makes the devices semantically separated, and is considered
a preparation before we could expose a comprehensive list of allocations
per major numbers and their purpose.
There is simply no advantage with putting both virtual console devices
and serial TTY devices on the same major number.
Putting them on separate major numbers greatly simplifies the allocation
mechanism on the DeviceMapper code, because it no longer needs to
calculate offsets of minor numbers, and should start from number 0 to
theoretically infinite amount of device nodes.
We don't really need it, and the entire functionality can be organically
intergrated into the VirtualConsole class, to switch between the Virtual
consoles, and manage initialization of all consoles in the global array.
There are a few instances where comments and documentation have minor
grammar issues likely resulting from English being the author's second
language.
This PR fixes several such cases, changing to idiomatic English and
resolving where it is unclear whether the user or program/code is
being referred to.
The methods try_create_with_size() and try_create_purgeable_with_size()
on AnonymousVMObject are almost identical, other than one member
that gets set (m_purgeable). This patch makes
try_create_purgeable_with_size() call try_create_with_size() so that
both methods re-use the same code.
This is done by using a FixedStringBuffer as the foundation to perform
string formatting, which ensures that we avoid memory allocations in
the prekernel stage.
These methods do basically nothing right now, because we don't allocate
memory in the prekernel stage.
It's only here for a later commit when we bring up assertion formatting
and printing.
With this change, we no longer preallocate blocks when an inode's size
is updated, and instead only allocate the minimum amount of blocks when
the inode is actually written to.
This is a large commit, since this is essentially a complete rewrite of
the key low-level functions that handle reading/writing blocks. This is,
however, a necessary prerequisite of being able to write holes.
The previous version of `flush_block_list()` (along with its numerous
helper functions) was entirely reliant on all blocks being sequential.
In contrast to the previous implementation, the new version
of `flush_block_list()` simply writes out the difference between the old
block list and the new block list by calculating the correct indirect
block(s) to update based on the relevant block's logical index.
`compute_block_list()` has also been rewritten, since the estimated
amount of meta blocks was incorrectly calculated for files with holes as
a result of the estimated amount of blocks being a function of the file
size. Since it isn't possible to accurately compute the shape of the
block list without traversing it, we no longer try to perform such a
computation, and instead simply search through all of the allocated
indirect blocks.
`compute_block_list_with_meta_blocks()` has also been removed in favor
of the new `compute_meta_blocks()`, since meta blocks are fundamentally
distinct from data blocks due to there being no mapping between any
logical block index and the physical block index.
Since we now only store blocks that are actually allocated, it is
entirely valid for the block list to be empty, so this commit lifts the
restrictions on accessing inodes with an empty block list.
The methods try_release_clean_pages() and release_all_clean_pages() in
InodeVMObject are almost identical. This commit makes them both use the
same code path.
In the VMObject code there are multiple examples of loops over
the VMObject's regions (using for_each_region()) that call remap()
on each region.
To clean up usage of this pattern, this patch adds a method in
VMObject that does this remapping loop. VMObject code that needs
to remap its regions call the new method.
It can be possible for a request to be blocked on another request, so
this patch allows us to send more requests even when a request is
already pending.
To be able to do this, we add a new class called CustodyBase, which can
be resolved on-demand internally in the VirtualFileSystem resolving path
code.
When being resolved, CustodyBase will return a known custody if it was
constructed with such, if that's not the case it will provide the root
custody if the original path is absolute.
Lastly, if that's not the case as well, it will resolve the given dirfd
to provide a Custody object.
This means that userspace breakpoint traps no longer panic the kernel.
This also causes us to no longer panic on failed assertions in userspace
when using gcc, as gcc compiles __builtin_trap to breakpoint
instructions on RISC-V.
I did a mistake and set the kernel_physical_base value to be just on
the actual linked kernel ELF start offset, while this value should
represent together with KERNEL_MAPPING_BASE the actual higher-half load
address.
By changing this value, we resolve a bug in which disabling KASLR
doesn't work and will cause the prekernel to hang on this statement:
```c++
VERIFY(kernel_load_base >= kernel_mapping_base + 0x200000);
```
With this change, ".*make.*" function family now does error checking
earlier, which improves experience while using clangd. Note that the
change also make them instantiate classes a bit more eagerly, so in
LibVideo/PlaybackManager, we have to first define SeekingStateHandler
and only then make() it.
Co-Authored-By: stelar7 <dudedbz@gmail.com>
As MMIO is placed at fixed physical addressed, and does not need to be
backed by real RAM physical pages, there's no need to use PhysicalPage
instances to track their pages.
This results in slightly reduced allocations, but more importantly
makes MMIO addresses which end up after the normal RAM ranges work,
like 64-bit PCI BARs usually are.
Prepare to remove biglock on PCI::Access in a future commit, so we can
ensure we only lock a spinlock on a precise PCI HostController if needed
instead of the entire subsystem.
We never used these virtual methods outside their own implementation,
so let's stop pretending that we should be able to utilize this for
unknown purpose.
The new baked image is a Prekernel and a Kernel baked together now, so
essentially we no longer need to pass the Prekernel as -kernel and the
actual kernel image as -initrd to QEMU, leaving the option to pass an
actual initrd or initramfs module later on with multiboot.
Before of this change, actually setting the m_access to contain the
HasBeen{Readeable,Writable,Executable} bits was done by the method of
Region set_access_bit which added ORing with (access << 4) when enabling
a certain access bit to achieve this.
Now this is changed and when calling set_{readeable,writable,executable}
methods, they will set an appropriate SetOnce flag that could be checked
later.
We add a prctl option which would be called once after the dynamic
loader has finished to do text relocations before calling the actual
program entry point.
This change makes it much more obvious when we are allowed to change
a region protection access from being writable to executable.
The dynamic loader should be able to do this, but after a certain point
it is obvious that such mechanism should be disabled.
This flag is set only once, and should never reset once it has been set,
making it an ideal SetOnce use-case.
It also simplifies the expected conditions for the enabling prctl call,
as we don't expect a boolean flag, but rather the specific prctl option
will always set (enable) Process' AddressSpace syscall region enforcing.
Nobody uses this functionality. I used this code on my old 2007 ICH7
test machine about a year ago, but bare metal is a small aspect of the
project, so it's safe to assume that nobody really tests this piece of
code.
Therefore, let's drop this for good and focus on more modern hardware.
The following command was used to clang-format these files:
clang-format-18 -i $(find . \
-not \( -path "./\.*" -prune \) \
-not \( -path "./Base/*" -prune \) \
-not \( -path "./Build/*" -prune \) \
-not \( -path "./Toolchain/*" -prune \) \
-not \( -path "./Ports/*" -prune \) \
-type f -name "*.cpp" -o -name "*.mm" -o -name "*.h")
There was a recent release of clang-format version 18.1.5 which fixes
errant spaces around `->` in these files.
LLVM 18 otherwise throws errors, as we use '-mgeneral-regs-only' in the
kernel.
The functions had to be moved into a .S, as there is no
'-mno-general-regs-only' and also no nice way to remove
'-mgeneral-regs-only' for a single .cpp file.
Since CLINT interrupts are wired directly into the hart, instead of
going through an interrupt controller (the PLIC), trying to handle them
through the normal numbered-interrupt mechanism will just complicate it
for no reason.
Instead we now handle them directly in the trap handler.
The USB::Pipe is abstracted from the actual USB host controller
implementation, so don't include the UHCIController.h file.
Also, we missed an include to UserOrKernelBuffer.h, so this is added to
ensure the code can still compile.
We have many places in the kernel code that we have boolean flags that
are only set once, and never reset again but are checked multiple times
before and after the time they're being set, which matches the purpose
of the SetOnce class.
"register asm" variables don't preserve the register value, so the call
to calculate_physical_to_link_time_address_offset in the asm input
operands is allowed to clobber a0.
We were reading the value instead of setting it (as required by the
specification). This worked only when we booted with a bootloader which
initialized NVMe before us.
The default type for integer literals is signed int, so we were
accidentally smearing those bits to the upper 32 bit of the result.
This resulted in extremely unreasonable timeouts.
We were accidentally doing a 16-bit read instead of an 8-bit read,
meaning we would also read the 'CACHE_LINE_SIZE' field immediately
following it, and never actually continue.
The following command was used to clang-format these files:
clang-format-18 -i $(find . \
-not \( -path "./\.*" -prune \) \
-not \( -path "./Base/*" -prune \) \
-not \( -path "./Build/*" -prune \) \
-not \( -path "./Toolchain/*" -prune \) \
-not \( -path "./Ports/*" -prune \) \
-type f -name "*.cpp" -o -name "*.mm" -o -name "*.h")
There are a couple of weird cases where clang-format now thinks that a
pointer access in an initializer list, e.g. `m_member(ptr->foo)`, is a
lambda return statement, and it puts spaces around the `->`.
This simply reads the current cycle count from the cycle CSR.
x86-64 uses the similar rdtsc instruction here, which also may or may
not tick at a constant rate.
This is a large commit because it implements a lot of stuff to make
add_child simpler to get working. This allows us to create new files
on a FAT partition.
This is the first part of write support, it allows for full file
modification, but no creating or removing files yet.
Co-Authored-By: implicitfield <114500360+implicitfield@users.noreply.github.com>
Caching the cluster list allows us to fill the two fields in the
InodeMetadata. While at it, don't cache the metadata as when we
have write support having to keep both InodeMetadata and FATEntry
correct is going to get very annoying.
dbgln() will always take its arguments by reference when possible, which
causes UB when dealing with packed structs. To avoid this, we now
explicitly copy all members whose alignment requirements aren't met.
This removes the allocate_tls syscall and adds an archctl option to set
the fs_base for the current thread on x86-64, since you can't set that
register from userspace. enter_thread_context loads the fs_base for the
next thread on each context switch.
This also moves tpidr_el0 (the thread pointer register on AArch64) to
the register state, so it gets properly saved/restored on context
switches.
The userspace TLS allocation code is kept pretty similar to the original
kernel TLS code, aside from a couple of style changes.
We also have to add a new argument "tls_pointer" to
SC_create_thread_params, as we otherwise can't prevent race conditions
between setting the thread pointer register and signal handling code
that might be triggered before the thread pointer was set, which could
use TLS.
If there's no fds to copy in a message with proper space for an
SCM_RIGHTS set of cmsg headers, then don't try to copy them.
This avoids a Kernel panic when recvmsg-ing, as copy_to_user(p, 0, 0)
hits a VERIFY.
These changes are compatible with clang-format 16 and will be mandatory
when we eventually bump clang-format version. So, since there are no
real downsides, let's commit them now.
Small position independent code model (which we end up using after this
change) is suitable for us since the kernel is not expected to grow more
than 2Gb in size. This might be a bit risky since this model is not
mentioned anywhere except for System V ABI document but experiments show
that the kernel compiled with this change works just fine.
This otherwise caused a race condition between the signal dispatcher
(which sets sepc to the signal trampoline) and sepc being updated in the
trap handler.
We obviously have to keep the sepc set by the signal dispatcher and not
increment it afterwards.
While this clutters Process.cpp a tiny bit, I feel that it's worth it:
- 2x speed on the kcov_loop benchmark. Likely more during fuzzing.
- Overall code complexity is going down with this change.
- By reducing the code reachable from __sanitizer_cov_trace_pc code,
we can now instrument more code.
Sticking this to the function source has multiple benefits:
- We instrument more code, by not excluding entire files.
- NO_SANITIZE_COVERAGE can be used in Header files.
- Keeping the info with the source code, means if a function or
file is moved around, the NO_SANITIZE_COVERAGE moves with it.
This reverts commit 9dbec601b0.
For KCOV to be performant (or at least not even slower) we need to
mmap the PC buffer from both user and kernel space at the same time.
You can't mmap a character device, so this change didn't make sense.
Plus even if we did invent a new method to exfiltrate the coverage
information out of the kernel, it would be incompatible with existing
kernel fuzzers. That would be kind of annoying. 🙃
This simple delay loop uses the time CSR to wait for the given amount
of time. The tick frequency of the CSR is read from the
/cpus/timebase-frequency devicetree property.
Instead, rewrite the region page fault handling code to not use
PageFault::type() on RISC-V.
I split Region::handle_fault into having a RISC-V-specific
implementation, as I am not sure if I cover all page fault handling edge
cases by solely relying on MM's own region metadata.
We should probably also take the processor-provided page fault reason
into account, if we decide to merge these two implementations in the
future.
This commit also removes the unnecessary user_sp RegisterState member.
We never use the kernel stack pointer on entry, so we can simply always
store the stack pointer of the previous privilege mode in sp.
Also remove the sp member from mcontext, as RISC-V doesn't have a
dedicated stack pointer register.
sp is defined to be x2 (x[1] in our case) by the ABI.
I probably accidentally included sp while copying the struct from
aarch64.
sepc has to be incremented before the call to syscall_handler,
as we otherwise would return to the ecall instruction, resulting in an
infinite trap loop.
We can't increment it after syscall_handler, as sepc might get changed
while handling the syscall.
The value of this field is incremented by one, as a value of 0 for this
field means 1 entry supported.
A value of 0xffff for CAP.MQES would incorrectly by truncated to 0x0000,
if we don't increase the bit width of the return type.
RFC9293 states that from the TimeWait state the TCPSocket
should wait the MSL (2mins) for delayed segments to expire
so that their sequence numbers do not clash with a new
connection's sequence numbers using the same ip address
and port number. The wait also ensures the remote TCP peer
has received the ACK to their FIN segment.
Please be aware that I only have NIC with chip version 6 so
this is the only one that I have tested. Rest was implemented
via looking at Linux rtl8169 driver. Also thanks to IdanHo
for some initial work.
Either we mount from a loop device or other source, the user might want
to obfuscate the given source for security reasons, so this option will
ensure this will happen.
If passed during a mount, the source will be hidden when reading from
the /sys/kernel/df node.
Similarly to DevPtsFS, this filesystem is about exposing loop device
nodes easily in /dev/loop, so userspace doesn't need to do anything in
order to use new devices immediately.
Instead of returning a raw pointer, which could be technically invalid
when using it in the caller function, we return a valid RefPtr of such
device.
This ensures that the code in DevPtsFS is now safe from a rare race
condition in which the SlavePTY device is gone but we still have a
pointer to it.
This device is a block device that allows a user to effectively treat an
Inode as a block device.
The static construction method is given an OpenFileDescription reference
but validates that:
- The description has a valid custody (so it's not some arbitrary file).
Failing this requirement will yield EINVAL.
- The description custody points to an Inode which is a regular file, as
we only support (seekable) regular files. Failing this requirement
will yield ENOTSUP.
LoopDevice can be used to mount a regular file on the filesystem like
other supported types of (physical) block devices.
Instead of using C-arrays, and manually counting their lengths, use
AK::Array. And pass these arrays around as spans, instead of as pointer-
and-length pairs.
Such operation is almost equivalent to writing on an Inode, so lock the
Inode m_inode_lock exclusively.
All FileSystem Inode implementations then override a new method called
truncate_locked which should implement the actual truncating.
thread_context_first_enter reuses the context restoring code in the
trap handler, just like other arches already do.
The `ld x2, 1*8(sp)` is unnecessary in the trap handler, as the stack
pointer should be equal to the stack pointer slot in the RegisterState
if the trap is from supervisor mode (and we currently don't support
user traps).
This load will however make us unable to reuse that code for
thread_context_first_enter.
This commit adds two functions which save/restore the entire FPU state.
On RISC-V, you only need to save the floating pointer registers
themselves and the fcsr CSR, which contains the entire state of the F/D
extensions.
Some real hardware apparently uses smaller BAR sizes than sizeof(HBA)
with a completely filled port_regs member.
Change the port_regs array to a flexible array member, so we don't panic
while verifying that the BAR size is large enough to map this struct.
Accesses to this array are already bounds checked against
AHCI::Limits::MaxPorts.
Allowing creation of StorageDevicePartition objects for any arbitrary
BlockDevice objects means that we could technically create a
StorageDevicePartition for another StorageDevicePartition which is
obviously not the intention for this code. Instead, require to pass a
StorageDevice reference to ensure this cannot happen.
It is expected that these class members will be set when the object is
created (so they're set in the class constructor method) and never
change again, as its the driver responsibility to find these values
before creating a StorageDevice object.
This makes it easier to rely on these values later on as we don't expect
them to ever change for a StorageDevice object during its lifetime.
It calculated the disk size with the zero-based max addressable block
value.
For example, for a disk device with a block size of 512 bytes that has 2
LBAs so it can address LBA 0 and LBA 1 (so m_max_addressable_block is 1)
the calculated disk size will be 512 instead of 1024 bytes.
We remove can_read() and can_write(), as both of these methods should be
implemented for proper blocking support.
For our case, the previous code will simply block the user if they tries
to read beyond the max addressable offset, which is not a correct
behavior.
Instead, just do proper EOF guarding when calling read() and write() on
such objects.
Add a method for matehmatical operations when verifying IO operation
boundaries.
Also, make max_addressable_block method non-virtual, since no other
derived class actually has ever overrided this method.
RFC9293 states that a closed socket should reply to all non-RST
packets with an RST. This change implements this behaviour as
specified in section 3.5.2 in bullet point 1.
The former automatically adapts the prefix to binary and octal
output, and is what we already use in the majority of cases.
Patch generated by:
rg -l '0x\{' | xargs sed -i '' -e 's/0x{:/{:#/'
I ran it 4 times (until it stopped changing things) since each
invocation only converted one instance per line.
No behavior change.
According to the Intel Software Developer's Manual Volume 3A section
6.12.1.3, the interrupt gate type means the IF flag is cleared to
prevent nested interruption. The trap gate type does not modify the
IF flag. Thus the gate type argument is important when constructing
an IDTEntry.
On x86_64 and x86, storage_segment (bit 12 counting from 0)
is always 0 according to the Intel Software Developer's Manual,
volume 3A, section 6.11 and section 6.14.1. It has therefore
been removed as a parameter from IDTEntry's constructor and
hardwired to 0.
In a TTY's non-canonical mode, data availability can be configured by
setting VMIN and VTIME to determine the minimum amount of bytes to read
and the timeout between bytes, respectively. Some ports (such as SRB2)
set VMIN to 0 which effectively makes reading a TTY such as stdin a
non-blocking read. We didn't support this, causing ports to hang as soon
as they try to read stdin without any data available.
Add a very duct-tapey implementation for the case where VMIN == 0 by
overwriting the TTY's description's blocking status; 3 FIXMEs are
included to make sure we clean this up some day.
This makes it possible to use MakeIndexSequqnce in functions like:
template<typename T, size_t N>
constexpr auto foo(T (&a)[N])
This means AK/StdLibExtraDetails.h must now include AK/Types.h
for size_t, which means AK/Types.h can no longer include
AK/StdLibExtras.h (which arguably it shouldn't do anyways),
which requires rejiggering some things.
(IMHO Types.h shouldn't use AK::Details metaprogramming at all.
FlatPtr doesn't necessarily have to use Conditional<> and ssize_t could
maybe be in its own header or something. But since it's tangential to
this PR, going with the tried and true "lift things that cause the
cycle up to the top" approach.)
This easily led to kernel deadlocks if the stopped thread held an
important global mutex (like the disk cache lock) while blocking.
Resolve this by ensuring stopped threads have a chance to return to the
userland boundary before actually stopping.
Locking a mutex while holding a spinlock is always wrong, but in the
case of the scheduler lock, it also causes an assertion failure. (Which
would be triggered by 2 separate threads trying to ptrace at the same
time).
This helps ensure no one accidentally accesses m_requests without first
locking it's spinlock. In fact this change fixed such a case, since
process_cq() implicitly assumed the caller locked the lock, which was
not the case for NVMePollQueue::submit_sqe().
Due to an incorrect lambda scope capture declaration, we would copy the
result status at the start of the function, before it actually got
updated with the final status. Capture it by reference instead to
ensure we report the updated result.
Instead of assuming data races won't occur and trying to somehow verify
it with manual un-atomic tracking, we can just use a recursive spinlock
instead of a normal one, to resolve the original deadlock.
Most of the actual logic is identical, with the only real difference
being that one wraps it with an async work item.
Merge the implementations to reduce duplications (which will also
require the fixes in the next commits to only be done once).
Multiple fields in sstatus are defined as WPRI "Reserved Writes Preserve
Values, Reads Ignore Values", which means we have to preserve their
values when writing to other fields in the same CSR.
We don't really need to touch any fields except SIE.
Interrupts are probably already disabled, but just to be safe,
disable them explicitly.
We need to handle the character map to set the code point before we can
reassign the correct key to the queued_event.key. This fixes keyboard
shortcuts using the incorrect keys based on the keyboard layout.
Automarks are similar to bookmarks placed by the terminal, allowing the
user to selectively remove a single command and its output from the
terminal scrollback.
This commit implements a single way to add marks: automatically placing
them when the shell becomes interactive.
To make sure the shell behaves correctly after its expected prompt
position changes, the terminal layer forces a resize event to be passed
to the shell on such (possibly) partial clears; this also has the nice
side effect of fixing the disappearing prompt on the preexisting "clear
including history" action: Fixes#4192.
I just copy-pasted microseconds_since_boot and
set_interrupt_interval_usec from aarch64.
However, on RISC-V, they are not in microseconds.
The TimerRegisters struct is also unused.
current_time and set_compare can also be private and static.
IRQHandler is not the correct class to inherit from, as the timer
is not connected to an IRQController.
Each hart has one of these Timers directly connected to it.
I originally added this header because I misunderstood how
IRQControllers are supposed to be used.
I thought that I would need a IRQController class for the hart-local
interrupt controller, but apparently, this class is supposed to be used
for non-local interrupt controllers like the IOAPIC or RISC-V PLIC.
x86 LAPICs don't use this class either.
I am not sure why 096cecb95e disabled the stack protector and sanitizers
for all files, but this is not necessary.
Only the pre_init code needs to run without them, as that code runs
identity mapped.
SysFS, ProcFS and DevPtsFS were all sending filetype 0 when traversing
their directories, but it is actually very easy to send proper filetypes
in these filesystems.
This patch binds all RAM backed filesystems to use only one enum for
their internal filetype, to simplify the implementation and allow
sharing of code.
Please note that the Plan9FS case is currently not solved as I am not
familiar with this filesystem and its constructs.
The ProcFS mostly keeps track of the filetype, and a fix was needed for
the /proc root directory - all processes exhibit a directory inside it
which makes it very easy to hardcode the directory filetype for them.
There's also the `self` symlink inode which is now exposed as DT_LNK.
As for SysFS, we could leverage the fact everything inherits from the
SysFSComponent class, so we could have a virtual const method to return
the proper filetype.
Most of the files in SysFS are "regular" files though, so the base class
has a non-pure virtual method.
Lastly, the DevPtsFS simply hardcodes '.' and '..' as directory file
type, and everything else is hardcoded to send the character device file
type, as this filesystem is only exposing character pts device files.
This is for some reason needed for riscv64 clang, as otherwise the
kernel.map file would grow too big to fit in its section inside the
kernel image.
None of our other architectures have temporary locals in their
kernel.map.
We initialize the MMU by first setting up the page tables for the
kernel image and the initial kernel stack.
Then we jump to a identity mapped page which makes the newly created
kernel root page table active by setting `satp` and then jumps to
`init`.
This trap handler uses the SBI to print an error message via a newly
introduced panic function, which is necessary as `pre_init` is running
identity mapped.
Also add a header file for `pre_init.cpp` as we wan't to use the panic
and `dbgln` function in the MMU init code as well.
We first try to use the newer "SRST" extension for rebooting and
shutting down and if that fails, we try to shutdown using the legacy
"System Shutdown" extension (which can't reboot, so we always shutdown).
The kernel will halt, if we return from here due to all attempts at
rebooting / shutting down failing.
This device will be used by userspace to read mouse packets from all
mouse devices that are attached to the machine.
This change is a preparation before we can enable seamless hotplug
capabilities in WindowServer for mouse devices, without any major change
on the userspace side.
We do this by implementing the following fixes:
- The Key_Plus is assigned to a proper map entry index now which is 0x4e
both on the keypad and non-keypad keys.
- Shift+Q now prints out "Q" properly on scan code set 2.
- Key BackSlash (or Pipe on shift key being pressed down) is now working
properly as well.
- Key_Pipe (which is "|" for en-US layout) is now working in scan code
set 2.
- Numpad keys as well as the decimal separator key are working again.