Setting this bit will cause the CPU to generate a page fault when
writing to read-only memory, even if we're executing in the kernel.
Seemingly the only change needed to make this work was to have the
inode-backed page fault handler use a temporary mapping for writing
the read-from-disk data into the newly-allocated physical page.
To enforce this, we create two separate mappings of the same underlying
physical page. A writable mapping for the kernel, and a read-only one
for userspace (the one returned by sys$get_kernel_info_page.)
Every process keeps its own ELF executable mapped in memory in case we
need to do symbol lookup (for backtraces, etc.)
Until now, it was mapped in a way that made it accessible to the
program, despite the program not having mapped it itself.
I don't really see a need for userspace to have access to this right
now, so let's lock things down a little bit.
This patch makes it inaccessible to userspace and exposes that fact
through /proc/PID/vm (per-region "user_accessible" flag.)
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():
- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)
When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)
Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
This patch makes it possible to make memory regions non-readable.
This is enforced using the "present" bit in the page tables.
A process that hits an not-present page fault in a non-readable
region will be crashed.
The fault was happening when retrieving a current backtrace for the
SystemServer process.
To generate a backtrace, we go into the paging scope of the process,
meaning we temporarily switch to using its page directory as our own.
Because kernel VM is allocated on demand, it's possible for a process's
mappings above the 3GB mark to be out-of-date. Normally this just gets
fixed up transparently by the page fault handler (which simply copies
the PDE from the canonical MM.kernel_page_directory() into the current
process.)
However, if the current kernel *stack* is in a piece of memory that
the backtraced process lacks up-to-date PDE's for, we still get a page
fault, but are unable to handle it, since the CPU wants to push to the
stack as part of calling the page fault handler. So we're screwed and
it's a triple-fault.
Fix this by always updating the kernel VM mappings before switching
into a paging scope. In practical terms, this is a 1KB memcpy() that
happens when generating a backtrace, or doing exec().
Then only allow regions with that bit to be manipulated via munmap()
and mprotect(). This prevents messing with non-mmap()ed regions in
a process's address space (stacks, shared buffers, ...)
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.
The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.
If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.
Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
Now the userspace page allocator will search through physical regions,
and stop the search as it finds an available page.
Also remove an "address of" sign since we don't need that when
counting size of physical regions
Move the kernel image to the 1 MB physical mark. This prevents it from
colliding with stuff like the VGA memory. This was causing us to end
up with the BIOS screen contents sneaking into kernel memory sometimes.
This patch also bumps the kmalloc heap size from 1 MB to 3 MB. It's not
the perfect permanent solution (obviously) but it should get the OOM
monkey off our backs for a while.
After the page fault handler has found the region in which the fault
occurred, do the rest of the work in the region itself.
This patch also makes all fault types consistently crash the process
if a new page is needed but we're all out of pages.
Since the kernel page tables are shared between all processes, there's
no need to (implicitly) flush the TLB for them on every context switch.
Setting the G bit on kernel page tables allows the CPU to keep the
translation caches around.
This patch changes the parameter to Region::map() to be a PageDirectory
since that matches how we think about the memory model:
Regions are views onto VMObjects, and are mapped into PageDirectories.
Each Process has a PageDirectory. The kernel also has a PageDirectory.
Since a Region is merely a "window" onto a VMObject, it can both begin
and end at a distance from the VMObject's boundaries.
Therefore, we should always be computing indices into a VMObject's
physical page array by adding the Region's "first_page_index()".
There was a whole bunch of code that forgot to do that. This fixes
many wrong behaviors for Regions that start part-way into a VMObject.
We were doing a temporary STI/CLI in MemoryManager::zero_page() to be
able to acquire the VMObject's lock before zeroing out a page.
This logic was inherited from the inode fault handler, where we need
to enable interrupts anyway, since we might need to interact with the
underlying storage device.
Zero-fill faults don't actually need to lock the VMObject, since they
are already guaranteed exclusivity by interrupts being disabled when
entering the fault handler.
This is different from inode faults, where a second thread can often
get an inode fault for the same exact page in the same VMObject before
the first fault handler has received a response from the disk.
This is why the lock exists in the first place, to prevent this race.
This fixes an intermittent crash in sys$execve() that was made much
more visible after I made userspace stacks lazily allocated.
This patch adds three separate per-process fault counters:
- Inode faults
An inode fault happens when we've memory-mapped a file from disk
and we end up having to load 1 page (4KB) of the file into memory.
- Zero faults
Memory returned by mmap() is lazily zeroed out. Every time we have
to zero out 1 page, we count a zero fault.
- CoW faults
VM objects can be shared by multiple mappings that make their own
unique copy iff they want to modify it. The typical reason here is
memory shared between a parent and child process.
Instead of allocating and populating a Copy-on-Write bitmap for each
Region up front, wait until we actually clone the Region for sharing
with another process.
In most cases, we never need any CoW bits and we save ourselves a lot
of kmalloc() memory and time.
When splitting an Region that's already the result of an earlier split,
we have to take the Region's offset-in-VMObject into account since it
may be non-zero.
We were just blindly trusting that the bootloader would only give us
page-aligned memory regions. This is apparently not always the case,
so now we can try to repair those regions.
Fixes#601
We were always returning the full VM range of the partially-unmapped
Region to the range allocator. This caused us to re-use those addresses
for subsequent VM allocations.
This patch also skips creating a new VMObject in partial munmap().
Instead we just make split regions that point into the same VMObject.
This fixes the mysterious GCC ICE on large C++ programs.
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.
Kernel Regions are kept in various OwnPtr<Regions>'s.
Regions now only ever get unmapped when they are destroyed.
Put one unused page on each side of VM allocations to make invalid
accesses more likely to generate crashes.
Note that we will not add this guard padding for mmap() at a specific
memory address, only to "mmap it anywhere" requests.