This replaces all usages of Cacheable::Yes with MemoryType::Normal and
Cacheable::No with either MemoryType::NonCacheable or MemoryType::IO,
depending on the context.
The Page{Directory,Table}::set_cache_disabled function therefore also
has been replaced with a more appropriate set_memory_type_function.
Adding a memory_type "getter" would not be as easy, as some
architectures may not support all memory types, so getting the memory
type again may be a lossy conversion. The is_cache_disabled function
was never used, so just simply remove it altogether.
There is no difference between MemoryType::NonCacheable and
MemoryType::IO on x86 for now.
Other architectures currently don't respect the MemoryType at all.
This commit reorganizes the BootInfo struct definition so it can be
shared for all architectures.
The existing free extern "C" boot info variables have been removed and
replaced with a global BootInfo struct, 'g_boot_info'.
On x86-64, the BootInfo is directly copied from the Prekernel-provided
struct.
On AArch64 and RISC-V, BootInfo is populated during pre_init.
Writes to SharedInodeVMObjects could cause a Protection Violation if a
page was marked as dirty by a different process.
This happened due to a combination of 2 things:
* handle_dirty_on_write_fault() was skipped if a page was already marked
as dirty
* when a page was marked as dirty, only the Region that caused the page
fault was remapped
This commit:
* fixes the crash by making handle_fault() stop checking if a page was
marked dirty before running handle_dirty_on_write_fault()
* modifies handle_dirty_on_write_fault() so that it always marks the
page as dirty and remaps the page (this avoids a 2nd bug that was
never hit due to the 1st bug)
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.
The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.
Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
simplify the idea of being jailed to a simple ProtectedValues boolean
flag. This means that we no longer check of matching jail pointers
anywhere in the Kernel code.
To set a process as jailed, a new prctl option was added to set a
Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
A process can either iterate on the global process list, or if it's
attached to a scoped process list, then only over that list.
This essentially replaces the need of checking the Jail pointer of a
process when iterating over process lists.
AnonymousVMObject::try_clone() computed how many shared cow pages to
commit by counting all VMObject pages that were not shared_zero_pages.
This means that lazy_committed_pages were also being included in the
count. This is a problem because the page fault handling code for
lazy_committed_pages does not allocate from
m_shared_committed_cow_pages. So more pages than necessary were being
committed.
This fixes this overcommitting problem by skipping lazy_committed_pages
when counting how many pages to commit.
This commit introduces VMObject::remap_regions_single_page(). This
method remaps a single page in all regions associated with a VMObject.
This is intended to be a more efficient replacement for remap_regions()
in cases where only a single page needs to be remapped.
This commit also updates the cow page fault handling code to use this
new method.
Writes to a MAP_SHARED | MAP_ANONYMOUS mmap region were not visible to
other processes sharing the mmap region. This was happening because the
page fault handler was not remapping the VMObject's m_regions after
allocating a new page.
This commit fixes the problem by calling remap_regions() after assigning
a new page to the VMObject in the page fault handler. This remapping
only occurs for shared Regions.
This commit makes the following minor changes to handle_zero_fault():
* cleans up a call to static_cast(), replacing it with a reference (a
future commit will also use this reference).
* replaces a call to vmobject() with the new reference mentioned above.
* moves the definition of already_handled to inside the block where
already_handled is used.
After a fork(), page faults on anonymous mmaps can cause a redundant
page fault to occur.
This happens because VMObjects for anonymous mmaps are initially filled
with references to the lazy_committed_page or shared_zero_page. If there
is a fork, VMObject::try_clone() is called and all pages of the VMObject
are marked as cow (via the m_cow_map).
Page faults on a zero/lazy page are handled by handle_zero_fault().
handle_zero_fault() does not update m_cow_map, so if the page was marked
cow before the fault, it will still be marked cow after the fault. This
causes a second (redundant) page fault when the CPU retries the write.
This commit removes the redundant page fault by not marking zero/lazy
pages as cow in m_cow_map.
AddressSpace::try_allocate_split_region() was updating the cow map of
new_region based on the cow map of source_region.
The problem is that both new_region and source_region reference the
same vmobject and the same cow map, so these cow map updates didn't
actually change anything.
This commit:
* removes the cow map updates from try_allocate_split_region()
* removes Region::set_should_cow() since it is no longer used
InodeVMObjects now track dirty and clean pages. This tracking of
dirty and clean pages is used by the msync and purge syscalls.
dirty page tracking works using the following rules:
* when a new InodeVMObject is made, all pages are marked clean.
* writes to clean InodeVMObject pages will cause a page fault,
the fault handler will mark the page as dirty.
* writes to dirty InodeVMObject pages do not cause page faults.
* if msync is called, only dirty pages are flushed to storage (and
marked clean).
* if purge syscall is called, only clean pages are discarded.
The methods try_create_with_size() and try_create_purgeable_with_size()
on AnonymousVMObject are almost identical, other than one member
that gets set (m_purgeable). This patch makes
try_create_purgeable_with_size() call try_create_with_size() so that
both methods re-use the same code.
The methods try_release_clean_pages() and release_all_clean_pages() in
InodeVMObject are almost identical. This commit makes them both use the
same code path.
In the VMObject code there are multiple examples of loops over
the VMObject's regions (using for_each_region()) that call remap()
on each region.
To clean up usage of this pattern, this patch adds a method in
VMObject that does this remapping loop. VMObject code that needs
to remap its regions call the new method.
As MMIO is placed at fixed physical addressed, and does not need to be
backed by real RAM physical pages, there's no need to use PhysicalPage
instances to track their pages.
This results in slightly reduced allocations, but more importantly
makes MMIO addresses which end up after the normal RAM ranges work,
like 64-bit PCI BARs usually are.
The new baked image is a Prekernel and a Kernel baked together now, so
essentially we no longer need to pass the Prekernel as -kernel and the
actual kernel image as -initrd to QEMU, leaving the option to pass an
actual initrd or initramfs module later on with multiboot.
Before of this change, actually setting the m_access to contain the
HasBeen{Readeable,Writable,Executable} bits was done by the method of
Region set_access_bit which added ORing with (access << 4) when enabling
a certain access bit to achieve this.
Now this is changed and when calling set_{readeable,writable,executable}
methods, they will set an appropriate SetOnce flag that could be checked
later.
This flag is set only once, and should never reset once it has been set,
making it an ideal SetOnce use-case.
It also simplifies the expected conditions for the enabling prctl call,
as we don't expect a boolean flag, but rather the specific prctl option
will always set (enable) Process' AddressSpace syscall region enforcing.
We have many places in the kernel code that we have boolean flags that
are only set once, and never reset again but are checked multiple times
before and after the time they're being set, which matches the purpose
of the SetOnce class.
Instead, rewrite the region page fault handling code to not use
PageFault::type() on RISC-V.
I split Region::handle_fault into having a RISC-V-specific
implementation, as I am not sure if I cover all page fault handling edge
cases by solely relying on MM's own region metadata.
We should probably also take the processor-provided page fault reason
into account, if we decide to merge these two implementations in the
future.
This commit adds minimal support for compiler-instrumentation based
memory access sanitization.
Currently we only support detection of kmalloc redzone accesses, and
kmalloc use-after-free accesses.
Support for inline checks (for improved performance), and for stack
use-after-return and use-after-return detection is left for future PRs.
Our existing AnonymousVMObject cloning flow contains an optimization
wherein purgeable VMObjects which are marked volatile during the clone
are created as a new zero-filled VMObject (as if it was purged), which
lets us skip the expensive COW process.
Unfortunately, one crucial part was missing: Marking the cloned region
as purged, (which is the value returned from madvise when unmarking the
region as volatile) so the userland logic was left unaware of the
effective zero-ing of their memory region, resulting in odd behaviour
and crashes in places like our malloc's large allocation support.
Instead, use the FixedCharBuffer class to ensure we always use a static
buffer storage for these names. This ensures that if a Process or a
Thread were created, there's a guarantee that setting a new name will
never fail, as only copying of strings should be done to that static
storage.
The limits which are set are 32 characters for processes' names and 64
characters for thread names - this is because threads' names could be
more verbose than processes' names.
Once we move to a more proper shutdown procedure, processes other than
the finalizer task must be able to perform cleanup and finalization
duties, not only because the finalizer task itself needs to be cleaned
up by someone. This global variable, mirroring the early boot flags,
allows a future shutdown process to perform cleanup on its own.
Note that while this *could* be considered a weakening in security, the
attack surface is minimal and the results are not dramatic. To exploit
this, an attacker would have to gain a Kernel write primitive to this
global variable (bypassing KASLR among other things) and then gain some
way of calling the relevant functions, all of this only to destroy some
other running process. The same effect can be achieved with LPE which
can often be gained with significantly simpler userspace exploits (e.g.
of setuid binaries).