This commit fixes a kernel panic that happened when unmounting
a disk due to an invalid memory access.
This was because `DiskCache` initializes two linked lists that use
an argument `KBuffer` as the storage for their elements.
Since the member `KBuffer` was declared after the two lists,
when `DiskCache`'s destructor was called, then `KBuffer`'s destructor
was called before the ones of the two lists, causing a page fault in
the kernel.
This commit reached that goal of "safely discarding" a filesystem by
doing the following:
1. Stop using the s_file_system_map HashMap as it was an unsafe measure
to access pointers of FileSystems. Instead, make sure to register all
FileSystems at the VFS layer, with an IntrusiveList, to avoid problems
related to OOM conditions.
2. Make sure to cleanly remove the DiskCache object from a BlockBased
filesystem, so the destructor of such object will not need to do that in
the destruction point.
3. For ext2 filesystems, don't cache the root inode at m_inode_cache
HashMap. The reason for this is that when unmounting an ext2 filesystem,
we lookup at the cache to see if there's a reference to a cached inode
and if that's the case, we fail with EBUSY. If we keep the m_root_inode
also being referenced at the m_inode_cache map, we have 2 references to
that object, which will lead to fail with EBUSY. Also, it's much simpler
to always ask for a root inode and get it immediately from m_root_inode,
instead of looking up the cache for that inode.
The idea is to enable mounting FileSystem objects across multiple mounts
in contrast to what happened until now - each mount has its own unique
FileSystem object being attached to it.
Considering a situation of mounting a block device at 2 different mount
points at in system, there were a couple of critical flaws due to how
the previous "design" worked:
1. BlockBasedFileSystem(s) that pointed to the same actual device had a
separate DiskCache object being attached to them. Because both instances
were not synchronized by any means, corruption of the filesystem is most
likely achieveable by a simple cache flush of either of the instances.
2. For superblock-oriented filesystems (such as the ext2 filesystem),
lack of synchronization between both instances can lead to severe
corruption in the superblock, which could render the entire filesystem
unusable.
3. Flags of a specific filesystem implementation (for example, with xfs
on Linux, one can instruct to mount it with the discard option) must be
honored across multiple mounts, to ensure expected behavior against a
particular filesystem.
This patch put the foundations to start fix the issues mentioned above.
However, there are still major issues to solve, so this is only a start.
Apologies for the enormous commit, but I don't see a way to split this
up nicely. In the vast majority of cases it's a simple change. A few
extra places can use TRY instead of manual error checking though. :^)
These functions used to return booleans which withheld useful
error information for callers. Internally they would suppress
and convert Error objects. We now log or propagate these errors
up the stack.
When writing into a block via an O_DIRECT open file description, we
would first flush every dirty block *except* the one we're about to
write into.
The purpose of flushing is to ensure coherency when mixing direct and
indirect accesses to the same file. This patch fixes the issue by only
flushing the affected block.
Our behavior was totally backwards and could end up doing a lot of
unnecessary work while avoiding the work that actually mattered.
Since this function always returns a CacheEntry& (after potentially
evicting someone else to make room), let's call it "ensure" instead of
"get" to match how we usually use these terms.
If the data passed to sys$write() is backed by a not-yet-paged-in inode
mapping, we could end up in a situation where we get a page fault when
trying to copy data from userspace.
If that page fault handler tried reading from an inode that someone else
had locked while waiting for the disk cache lock, we'd deadlock.
This patch fixes the issue by copying the userspace data into a local
buffer before acquiring the disk cache lock. This is not ideal since it
incurs an extra copy, and I'm sure we can think of a better solution
eventually.
This was a frequent cause of startup deadlocks on x86_64 for me. :^)
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
Create the disk cache up front, so we can verify it succeeds.
Make the KBuffer allocation fail-able, so we can properly handle
failure when the user asks up to mount a Ext2 filesystem under
OOM conditions.
The file system lock is meant to protect the file system metadata
(super blocks, bitmaps, etc.) Not protect processes from reading
independent parts of the disk at once.
This patch introduces a new lock to protect the *block cache* instead,
which is the real thing that needs synchronization.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
This should allow creating intrusive lists that have smart pointers,
while remaining free (compared to the impl before this commit) when
holding raw pointers :^)
As a sidenote, this also adds a `RawPtr<T>` type, which is just
equivalent to `T*`.
Note that this does not actually use such functionality, but is only
expected to pave the way for #6369, to replace NonnullRefPtrVector<T>
with intrusive lists.
As it is with zero-cost things, this makes the interface a bit less nice
by requiring the type name of what an `IntrusiveListNode` holds (and
optionally its container, if not RawPtr), and also requiring the type of
the container (normally `RawPtr`) on the `IntrusiveList` instance.
This may seem like a no-op change, however it shrinks down the Kernel by a bit:
.text -432
.unmap_after_init -60
.data -480
.debug_info -673
.debug_aranges 8
.debug_ranges -232
.debug_line -558
.debug_str -308
.debug_frame -40
With '= default', the compiler can do more inlining, hence the savings.
I intentionally omitted some opportunities for '= default', because they
would increase the Kernel size.
Since these filesystems operate on an underlying file descriptor
and rely on its offset for correctness, let's use the FS lock to
serialize these operations.
This also means that FS subclasses can rely on block-level read/write
operations being atomic.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
..and allow implicit creation of KResult and KResultOr from ErrnoCode.
This means that kernel functions that return those types can finally
do "return EINVAL;" and it will just work.
There's a handful of functions that still deal with signed integers
that should be converted to return KResults.
This way, if something goes wrong, we get to keep the actual error.
Also, KResults are nodiscard, so we have to deal with that in Ext2FS
instead of just silently ignoring I/O errors(!)
BlockBasedFileSystem::read_block method should get a reference of
a UserOrKernelBuffer.
If we need to force caching a block, we will call other method to do so.
Problem:
- `(void)` simply casts the expression to void. This is understood to
indicate that it is ignored, but this is really a compiler trick to
get the compiler to not generate a warning.
Solution:
- Use the `[[maybe_unused]]` attribute to indicate the value is unused.
Note:
- Functions taking a `(void)` argument list have also been changed to
`()` because this is not needed and shows up in the same grep
command.
This makes misses in the BlockBasedFS's LRU block cache faster by
storing the cache entries in one of two doubly-linked list.
Dirty and clean cache entries are kept in two separate lists, and
move between them when their state changes. This can probably be
improved upon further.