1
0
Fork 0
mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-01-24 01:09:38 -05:00
linux/fs
Yang Xu 1639a49ccd
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.

Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.

When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.

However, there are several key issues with the current implementation:

* S_ISGID stripping logic is entangled with umask stripping.

  If a filesystem doesn't support or enable POSIX ACLs then umask
  stripping is done directly in the vfs before calling into the
  filesystem.
  If the filesystem does support POSIX ACLs then unmask stripping may be
  done in the filesystem itself when calling posix_acl_create().

  Since umask stripping has an effect on S_ISGID inheritance, e.g., by
  stripping the S_IXGRP bit from the file to be created and all relevant
  filesystems have to call posix_acl_create() before inode_init_owner()
  where we currently take care of S_ISGID handling S_ISGID handling is
  order dependent. IOW, whether or not you get a setgid bit depends on
  POSIX ACLs and umask and in what order they are called.

  Note that technically filesystems are free to impose their own
  ordering between posix_acl_create() and inode_init_owner() meaning
  that there's additional ordering issues that influence S_SIGID
  inheritance.

* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
  stripping logic.

  While that may be intentional (e.g. network filesystems might just
  defer setgid stripping to a server) it is often just a security issue.

This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.

So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.

We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).

The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:

sys_mknod()
-> do_mknodat(mode)
   -> .mknod = ovl_mknod(mode)
      -> ovl_create(mode)
         -> vfs_mknod(mode)

get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.

Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.

The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.

All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.

Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.

In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:

* btrfs allows the creation of filesystem objects through various
  ioctls(). Snapshot creation literally takes a snapshot and so the mode
  is fully preserved and S_ISGID stripping doesn't apply.

  Creating a new subvolum relies on inode_init_owner() in
  btrfs_new_subvol_inode() but only creates directories and doesn't
  raise S_ISGID.

* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
  xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
  the actual extents ocfs2 uses a separate ioctl() that also creates the
  target file.

  Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
  inode_init_owner() to strip the S_ISGID bit. This is the only place
  where a filesystem needs to call mode_strip_sgid() directly but this
  is self-inflicted pain.

* spufs doesn't go through the vfs at all and doesn't use ioctl()s
  either. Instead it has a dedicated system call spufs_create() which
  allows the creation of filesystem objects. But spufs only creates
  directories and doesn't allo S_SIGID bits, i.e. it specifically only
  allows 0777 bits.

* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.

Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.

Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e10 ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-21 11:34:16 +02:00
..
9p 9p: fix EBADF errors in cached mode 2022-06-17 06:03:30 +09:00
adfs
affs
afs netfs: do not unlock and put the folio twice 2022-07-14 10:10:12 +02:00
autofs
befs
bfs
btrfs for-5.19-rc7-tag 2022-07-16 13:48:55 -07:00
cachefiles cachefiles: narrow the scope of flushed requests when releasing fd 2022-07-05 16:12:21 +01:00
ceph netfs: do not unlock and put the folio twice 2022-07-14 10:10:12 +02:00
cifs smb3: workaround negprot bug in some Samba servers 2022-07-13 19:59:47 -05:00
coda
configfs
cramfs
crypto
debugfs
devpts
dlm
ecryptfs
efivarfs
efs
erofs Changes since last update: 2022-06-01 11:54:29 -07:00
exfat exfat: use updated exfat_chain directly during renaming 2022-06-09 21:26:32 +09:00
exportfs
ext2 ext2: fix fs corruption when trying to remove a non-empty directory with IO error 2022-06-16 10:55:45 +02:00
ext4 ext4: fix a doubled word "need" in a comment 2022-06-18 19:36:20 -04:00
f2fs f2fs: do not count ENOENT for error case 2022-06-21 08:29:56 -07:00
fat Not a lot of material this cycle. Many singleton patches against various 2022-05-27 11:22:03 -07:00
freevxfs SPDX changes for 5.19-rc1 2022-06-03 10:34:34 -07:00
fscache fscache: Fix invalidation/lookup race 2022-07-05 16:12:55 +01:00
fuse libnvdimm for 5.19 2022-05-27 15:49:30 -07:00
gfs2 Page cache changes for 5.19 2022-05-24 19:55:07 -07:00
hfs
hfsplus
hostfs
hpfs
hugetlbfs hugetlbfs: zero partial pages during fallocate hole punch 2022-06-16 19:11:32 -07:00
iomap Page cache changes for 5.19 2022-05-24 19:55:07 -07:00
isofs
jbd2 fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment 2022-06-16 10:36:09 -04:00
jffs2 This pull request contains fixes for JFFS2, UBI and UBIFS 2022-06-03 14:42:24 -07:00
jfs JFS: One bug fix and some code cleanup 2022-05-27 15:59:21 -07:00
kernfs
ksmbd vfs: fix copy_file_range() regression in cross-fs copies 2022-06-30 15:16:38 -07:00
lockd lockd: fix nlm_close_files 2022-07-11 15:49:56 -04:00
minix
netfs netfs: do not unlock and put the folio twice 2022-07-14 10:10:12 +02:00
nfs NFSv4: Add an fattr allocation to _nfs4_discover_trunking() 2022-06-30 16:13:00 -04:00
nfs_common
nfsd Notable regression fixes: 2022-07-14 12:29:43 -07:00
nilfs2 nilfs2: fix incorrect masking of permission flags for symlinks 2022-07-03 15:42:33 -07:00
nls
notify fanotify: refine the validation checks on non-dir inode mask 2022-06-28 11:18:13 +02:00
ntfs Not a lot of material this cycle. Many singleton patches against various 2022-05-27 11:22:03 -07:00
ntfs3 Ntfs3 for 5.19 2022-06-03 16:57:16 -07:00
ocfs2 fs: move S_ISGID stripping into the vfs_*() helpers 2022-07-21 11:34:16 +02:00
omfs
openpromfs
orangefs
overlayfs ovl: turn of SB_POSIXACL with idmapped layers temporarily 2022-07-08 15:48:31 +02:00
proc Not a lot of material this cycle. Many singleton patches against various 2022-05-27 11:22:03 -07:00
pstore
qnx4
qnx6
quota quota: Prevent memory allocation recursion while holding dq_lock 2022-06-06 10:08:10 +02:00
ramfs
reiserfs
romfs
smbfs_common Add various fsctl structs 2022-05-23 20:24:12 -05:00
squashfs Page cache changes for 5.19 2022-05-24 19:55:07 -07:00
sysfs
sysv Not a lot of material this cycle. Many singleton patches against various 2022-05-27 11:22:03 -07:00
tracefs tracefs: Fix syntax errors in comments 2022-06-17 19:01:28 -04:00
ubifs This pull request contains fixes for JFFS2, UBI and UBIFS 2022-06-03 14:42:24 -07:00
udf Page cache changes for 5.19 2022-05-24 19:55:07 -07:00
ufs
unicode
vboxsf
verity Page cache changes for 5.19 2022-05-24 19:55:07 -07:00
xfs xfs: prevent a UAF when log IO errors race with unmount 2022-07-01 09:09:52 -07:00
zonefs zonefs: fix zonefs_iomap_begin() for reads 2022-06-08 19:13:55 +09:00
aio.c
anon_inodes.c
attr.c fs: account for group membership 2022-06-14 12:18:47 +02:00
bad_inode.c
binfmt_aout.c
binfmt_elf.c
binfmt_elf_fdpic.c
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
buffer.c
char_dev.c
compat_binfmt_elf.c
coredump.c
d_path.c
dax.c libnvdimm for 5.19 2022-05-27 15:49:30 -07:00
dcache.c
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c
exec.c fix race between exit_itimers() and /proc/pid/timers 2022-07-11 09:52:59 -07:00
fcntl.c
fhandle.c
file.c fix the breakage in close_fd_get_file() calling conventions change 2022-06-05 15:03:03 -04:00
file_table.c Descriptor handling cleanups 2022-06-04 18:52:00 -07:00
filesystems.c
fs-writeback.c writeback: Fix inode->i_io_list not be protected by inode->i_lock error 2022-06-06 09:54:30 +02:00
fs_context.c
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c fs: move S_ISGID stripping into the vfs_*() helpers 2022-07-21 11:34:16 +02:00
internal.h Cleanups (and one fix) around struct mount handling. 2022-06-04 19:00:05 -07:00
io-wq.c
io-wq.h
io_uring.c io_uring: check that we have a file table when allocating update slots 2022-07-09 07:02:10 -06:00
ioctl.c
Kconfig
Kconfig.binfmt m68knommu: changes for linux 5.19 2022-05-30 10:56:18 -07:00
kernel_read_file.c
libfs.c
locks.c
Makefile
mbcache.c
mount.h
mpage.c
namei.c fs: move S_ISGID stripping into the vfs_*() helpers 2022-07-21 11:34:16 +02:00
namespace.c Cleanups (and one fix) around struct mount handling. 2022-06-04 19:00:05 -07:00
no-block.c
nsfs.c
open.c RISC-V Patches for the 5.19 Merge Window, Part 1 2022-05-31 14:10:54 -07:00
pipe.c Not a lot of material this cycle. Many singleton patches against various 2022-05-27 11:22:03 -07:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c vfs: fix copy_file_range() regression in cross-fs copies 2022-06-30 15:16:38 -07:00
readdir.c
remap_range.c Revert "vf/remap: return the amount of bytes actually deduplicated" 2022-07-14 15:35:24 -07:00
select.c
seq_file.c
signalfd.c
splice.c
stack.c
stat.c RISC-V Patches for the 5.19 Merge Window, Part 1 2022-05-31 14:10:54 -07:00
statfs.c
super.c
sync.c
sysctls.c
timerfd.c
userfaultfd.c
utimes.c
xattr.c