https://bugzilla.redhat.com/show_bug.cgi?id=812798#c45 Richard W.M. Jones 2014-02-01 08:42:44 EST The actual bug (selinux-policy) is of course fixed. Unfortunately fixing the guestmount + xattr problem is quite a lot more complex than it seems. I will try and find the links to everything when I have more reasonable internet access. The basic problem is it requires us to change the libguestfs API in order to allow multithreaded guestmount to be written, since threads are required in order for guestmount to "answer" the getxattr call that happens during mount when SELinux is enabled.
Thanks for the explanation, Richard. I'm going to investigate the workaround I mentioned where I do file copying via guestmount, and write special code to use libguestfs directly for xattrs.
Let me write an explanation of this bug as best I remember it. This is going to be quite long because it involves describing the history and the current APIs. History lessons --------------- FUSE has several levels of API. Although demo/simple programs use the high level API, libguestfs and any serious FUSE user is going to use the medium or low level APIs because that offers a lot more control. The way we use the medium level API is that we call [ignoring error handling etc]: ch = fuse_mount (the_mount_point, &some_options); f = fuse_new (ch, &some_options, &operations, sizeof operations, opaque); fuse_loop (f); The important point here is that fuse_mount calls into the kernel to create the mountpoint, fuse_new (I believe) just creates a userspace object, and fuse_loop is where filesystem requests get processed. In other words, there is a gap between where the mountpoint is created in the kernel, and when filesystem operations (requests) can be processed by that same mountpoint. If some other userspace process comes along in that time, say reading the mountpoint, the userspace process is blocked until request processing begins. There's no race condition here. As long as fuse_loop is called, sooner or later processes accessing the mountpoint will get unblocked and be able to complete filesystem operations on the new mountpoint. Normally the kernel does not perform any filesystem operations on the mountpoint during the mount. If it did, that would cause a deadlock because fuse_mount would never be able to return, and so fuse_loop would never be entered, and so requests would never start to be processed. Unfortunately, with SELinux this changes. SELinux (for perfectly good reasons) has to determine the SELinux label of the mountpoint. This is stored in the FUSE filesystem, so SELinux has to issue a getxattr call to get that. It has to do it during the mount, since otherwise it is possible that a process could jump in just before SELinux has worked out the label and run some operation against the unlabelled filesystem (ie. it could be a security problem). So SELinux issues this getxattr call during the mount (fuse_mount), resulting in the deadlock described in the previous paragraph. Of course it isn't going to work if SELinux starts unconditionally doing getxattr calls on FUSE filesystems. FUSE filesystems which follow the long-established FUSE API would all deadlock. Therefore SELinux has to distinguish between what we might call "traditional FUSE API filesystems" and "FUSE filesystems that are able to handle xattr during mount". It does this by having two labels which I believe are: fs_noxattr_type(fusefs_t) (traditional API) fs_type(fusefs_t) (can handle xattr during mount) [https://bugzilla.redhat.com/show_bug.cgi?id=812798#c17] So how do we actually handle xattr during mount? ------------------------------------------------ Well it's not easy. Jeff Darcy actually explains it better than I could, so go and read that instead: https://bugzilla.redhat.com/show_bug.cgi?id=811217#c4 Basically it involves having multiple threads or processes, opening /dev/fuse explicitly (which I believe is even lower level than the low-level FUSE API -- we might need to patch libfuse), and passing the fd between threads. More history lessons: the libguestfs API ---------------------------------------- Originally FUSE was implemented in a separate program (guestmount). However we realized soon after that FUSE functionality was pretty useful for all libguestfs API users, and for that reason we reimplemented FUSE support as a libguestfs API: http://libguestfs.org/guestfs.3.html#mount-local guestmount is now just a thin wrapper that does command-line parsing and calls into this API. But you don't need to use guestmount, you can call the API directly as in this example: http://libguestfs.org/guestfs-examples.3.html#example:-the-mount-local-api The API is basically split into two parts: guestfs_mount_local() calls: ch = fuse_mount (the_mount_point, &some_options); f = fuse_new (ch, &some_options, &operations, sizeof operations, g); guestfs_mount_local_run() calls: fuse_loop (f); (cf. traditional FUSE API as described above) The libguestfs API "ignores" threads: Callers have to promise not to reuse the same guestfs_h* handle in two threads at the same time. The mount-local part of the libguestfs API also ignores threads [to some extent, this is not the whole truth]. So how do we do this in libguestfs? ----------------------------------- The current mount-local API model simply does not work for this case. That means we need a new API to handle it. What exactly this new API looks like is not currently very clear to me. First of all we'd need to write a standalone FUSE program which works right with SELinux (or examine glusterfs very carefully). That would give us an idea of what the shape of the new API might be. Threads are going to be an issue here. Also backwards compatibility is going to be an issue. We absolutely can not break existing mount-local API users.
A few more thoughts about this, mainly notes to self ... (1) You can set (eg) user xattrs via FUSE at the moment. For example in a guestmount-ed disk: $ setfattr home -n user.test -v system_u:object_r:home_root_t:s0 $ getfattr -d -m ^user home # file: home user.test="system_u:object_r:home_root_t:s0" I checked with guestmount and the lsetxattr call is passed through to libguestfs: libguestfs: trace: lsetxattr "user.test" "system_u:object_r:home_root_t:s0" 32 "/home" libguestfs: trace: lsetxattr = 0 libguestfs: trace: lgetxattrs "/home" libguestfs: trace: lgetxattrs = <struct guestfs_xattr_list *> (2) However the security.selinux attribute is handled specially by some layer in the host kernel. Writes are not permitted: $ setfattr home -n security.selinux -v system_u:object_r:home_root_t:s0 setfattr: home: Operation not supported And reads always return a fixed value: $ getfattr -d -m ^security home # file: home security.selinux="system_u:object_r:fusefs_t:s0" I checked in guestmount, and libguestfs does not even see a lgetxattr call in this case. (3) If we get SELinux labels working over FUSE, it's not clear to me what will happen if you label a guest file with a label which is not known by the host SELinux policy. (Say for example you need to label a RHEL 6 guest, using a Fedora 20 host). It may be that setting the security.selinux attribute can be made to work (ie. using lsetxattr, but not setfilecon).
I reread the bug history here, and realized that I could make it work by entirely disabling SELinux on the build server - to make it work on the client side. Kind of ironic, but it's OK for now. That de-escalates the priority of this bug a lot for me - I have more work to do to be sure updates work right wrt. SELinux on the client side, which is more important. For (3) - it should work as long as the writing process has "mac_admin".
An alternative workaround for this would be to avoid FUSE, and have programs directly call into the libguestfs API to set the xattrs. That would be *really* painful to do though for OSTree, because it's heavily oriented around writing to the raw filesystem APIs. It might be possible to split the writes so that "normal" stuff goes via FUSE, but all xattrs are done in a second pass where we unmount the FUSE mount, then use the libguestfs API for just for xattrs.
(In reply to Colin Walters from comment #5) > An alternative workaround for this would be to avoid FUSE, and have programs > directly call into the libguestfs API to set the xattrs. That would be > *really* painful to do though for OSTree, because it's heavily oriented > around writing to the raw filesystem APIs. > > It might be possible to split the writes so that "normal" stuff goes via > FUSE, but all xattrs are done in a second pass where we unmount the FUSE > mount, then use the libguestfs API for just for xattrs. A couple of things are happening for 1.28, being driven by ptoscano: - There will be a "relabel this filesystem" API call. If you have an SELinux policy in the guest, then there will just be a single call you have to make to relabel the whole guest filesystem. - We're going to make the whole API thread-safe, which means we can implement a multi-threaded guestmount which implements SELinux labels (note that my reservations in comment 3 about whether SELinux will allow this to work across different guest/host policies may still apply).
(In reply to Richard W.M. Jones from comment #6) > - There will be a "relabel this filesystem" API call. If you have > an SELinux policy in the guest, then there will just be a single > call you have to make to relabel the whole guest filesystem. That will likely work for "mainline", but it's unlikely to work for OSTree-based installs. In the OSTree model there is more than one OS in a physical storage - potentially many. Each with a potentially different SELinux policy. So what I am currently doing is using the "default deployment" (ie the first in the boot order) to relabel the disk and itself: https://git.gnome.org/browse/ostree/commit/?id=e11de9357cea643b45a2e5e3f94d33dbd84d9ca3 Unfortunately the OSTree model invalidates all of the "high level" virt-* tools that are expecting to find exactly one operating system in the physical /. Another good example of this is I can't use the "-i" option to guestmount because there's no /etc/fstab - that's really /ostree/deploy/fedora-atomic/deploy/$deployment/etc/fstab for a given $deployment. That's a conversation to have somewhere else though... > - We're going to make the whole API thread-safe, which means we > can implement a multi-threaded guestmount which implements > SELinux labels (note that my reservations in comment 3 about > whether SELinux will allow this to work across different guest/host > policies may still apply). I think so - assuming a process has self:capability { mac_admin } it can lay down security.selinux values unknown to the system policy. Likewise if we call the raw getxattr() I believe we should see the untranslated value, even if the label isn't known to the system.
There are several issues here and I think it's best to discuss the design on the mailing list. However my brief thoughts are in-line below. (In reply to Colin Walters from comment #7) > (In reply to Richard W.M. Jones from comment #6) > > > - There will be a "relabel this filesystem" API call. If you have > > an SELinux policy in the guest, then there will just be a single > > call you have to make to relabel the whole guest filesystem. > > That will likely work for "mainline", but it's unlikely to work for > OSTree-based installs. The design is by no means set in stone, and so we should discuss what your requirements are and make it so that it works for the existing user [virt-customize/virt-builder] and your use-case too. If we need to have multiple APIs then we can do that too. I will start a thread and CC you & Pino. > In the OSTree model there is more than one OS in a > physical storage - potentially many. Each with a potentially different > SELinux policy. > > So what I am currently doing is using the "default deployment" (ie the first > in the boot order) to relabel the disk and itself: > > https://git.gnome.org/browse/ostree/commit/ > ?id=e11de9357cea643b45a2e5e3f94d33dbd84d9ca3 > > Unfortunately the OSTree model invalidates all of the "high level" virt-* > tools that are expecting to find exactly one operating system in the > physical /. Another good example of this is I can't use the "-i" option to > guestmount because there's no /etc/fstab - that's really > /ostree/deploy/fedora-atomic/deploy/$deployment/etc/fstab for a given > $deployment. Indeed, and another thing that could be fixed. Note that we've already been through this with btrfs -- libguestfs can now (often) recognize multiple btrfs snapshots as different operating systems. Although the '-i' option still won't work as it currently requires a single root. (Could also be fixed ..) > > - We're going to make the whole API thread-safe, which means we > > can implement a multi-threaded guestmount which implements > > SELinux labels (note that my reservations in comment 3 about > > whether SELinux will allow this to work across different guest/host > > policies may still apply). > > I think so - assuming a process has self:capability { mac_admin } it can lay > down security.selinux values unknown to the system policy. Likewise if we > call the raw getxattr() I believe we should see the untranslated value, even > if the label isn't known to the system. OK that's hopeful.
(In reply to Colin Walters from comment #7) > (In reply to Richard W.M. Jones from comment #6) > I think so - assuming a process has self:capability { mac_admin } it can lay > down security.selinux values unknown to the system policy. Correct > Likewise if we > call the raw getxattr() I believe we should see the untranslated value, even > if the label isn't known to the system. Sorta correct. If you have mac admin (both in DAC and SELinux) you will see the unknown label. If you don't, you will see unlabeled_t no matter what userspace interface you use.
Let's move discussion of the SELinux relabelling API to this thread: https://www.redhat.com/archives/libguestfs/2014-May/msg00094.html