mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-01-23 08:35:19 -05:00
f8c1d4ca55
The documentation for user space pkeys was a bit dated including things such as Amazon and distribution testing information which is irrelevant now. Update the documentation. This also streamlines adding the Supervisor pkey documentation later on. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220419170649.1022246-2-ira.weiny@intel.com
98 lines
3.3 KiB
ReStructuredText
98 lines
3.3 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
======================
|
|
Memory Protection Keys
|
|
======================
|
|
|
|
Memory Protection Keys provide a mechanism for enforcing page-based
|
|
protections, but without requiring modification of the page tables when an
|
|
application changes protection domains.
|
|
|
|
Pkeys Userspace (PKU) is a feature which can be found on:
|
|
* Intel server CPUs, Skylake and later
|
|
* Intel client CPUs, Tiger Lake (11th Gen Core) and later
|
|
* Future AMD CPUs
|
|
|
|
Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
|
|
a "protection key", giving 16 possible keys.
|
|
|
|
Protections for each key are defined with a per-CPU user-accessible register
|
|
(PKRU). Each of these is a 32-bit register storing two bits (Access Disable
|
|
and Write Disable) for each of 16 keys.
|
|
|
|
Being a CPU register, PKRU is inherently thread-local, potentially giving each
|
|
thread a different set of protections from every other thread.
|
|
|
|
There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
|
|
register. The feature is only available in 64-bit mode, even though there is
|
|
theoretically space in the PAE PTEs. These permissions are enforced on data
|
|
access only and have no effect on instruction fetches.
|
|
|
|
Syscalls
|
|
========
|
|
|
|
There are 3 system calls which directly interact with pkeys::
|
|
|
|
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
|
|
int pkey_free(int pkey);
|
|
int pkey_mprotect(unsigned long start, size_t len,
|
|
unsigned long prot, int pkey);
|
|
|
|
Before a pkey can be used, it must first be allocated with
|
|
pkey_alloc(). An application calls the WRPKRU instruction
|
|
directly in order to change access permissions to memory covered
|
|
with a key. In this example WRPKRU is wrapped by a C function
|
|
called pkey_set().
|
|
::
|
|
|
|
int real_prot = PROT_READ|PROT_WRITE;
|
|
pkey = pkey_alloc(0, PKEY_DISABLE_WRITE);
|
|
ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
|
|
ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
|
|
... application runs here
|
|
|
|
Now, if the application needs to update the data at 'ptr', it can
|
|
gain access, do the update, then remove its write access::
|
|
|
|
pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE
|
|
*ptr = foo; // assign something
|
|
pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again
|
|
|
|
Now when it frees the memory, it will also free the pkey since it
|
|
is no longer in use::
|
|
|
|
munmap(ptr, PAGE_SIZE);
|
|
pkey_free(pkey);
|
|
|
|
.. note:: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
|
|
An example implementation can be found in
|
|
tools/testing/selftests/x86/protection_keys.c.
|
|
|
|
Behavior
|
|
========
|
|
|
|
The kernel attempts to make protection keys consistent with the
|
|
behavior of a plain mprotect(). For instance if you do this::
|
|
|
|
mprotect(ptr, size, PROT_NONE);
|
|
something(ptr);
|
|
|
|
you can expect the same effects with protection keys when doing this::
|
|
|
|
pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
|
|
pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
|
|
something(ptr);
|
|
|
|
That should be true whether something() is a direct access to 'ptr'
|
|
like::
|
|
|
|
*ptr = foo;
|
|
|
|
or when the kernel does the access on the application's behalf like
|
|
with a read()::
|
|
|
|
read(fd, ptr, 1);
|
|
|
|
The kernel will send a SIGSEGV in both cases, but si_code will be set
|
|
to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
|
|
the plain mprotect() permissions are violated.
|