Bug 2034630
Summary: | [libvirt] Add support to run virtiofsd inside a user namespace (unprivileged) | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Vivek Goyal <vgoyal> |
Component: | libvirt | Assignee: | Ján Tomko <jtomko> |
libvirt sub component: | Storage | QA Contact: | Lili Zhu <lizhu> |
Status: | CLOSED MIGRATED | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | chhu, danken, dwalsh, fjin, gmaglione, hreitz, jcall, jsuchane, jtomko, jwatt, kkiwi, mkletzan, mszeredi, slopezpa, stefanha, virt-maint, xiagao, xuzhang, yafu |
Version: | 9.0 | Keywords: | MigratedToJIRA, Triaged |
Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-09-22 16:35:56 UTC | Type: | Feature Request |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vivek Goyal
2021-12-21 14:49:44 UTC
Thinking more about idmapped mounts. We probably don't want to use them in the beginning. Because, they will shift back unprivileged id to root. IOW, if user creates a file as "1000" it might be saved as uid "0" back on disk and if one accesses it through "non-idmapped" mount, it will be visible as root owned file. IOW, this takes us back to the risk of VM user dropping setuid root binaries and somehow arranging an unprivileged entity on host to execute it. I guess in the simplest form we need to first wire it up with out idmapped mounts and suggest chown of shared dir as needed. In the simplest form, we probably can run all virtiofsd instances using same uid/gid ranges. This still makes sure there are no real root owned files in shared dir. And if all VMs are running as user "qemu" and different qemu processes don't have any uid/gid based isolation between them, then it probably be ok to run virtiofsd as "qemu" being pseudo root and all processes using same uid/gid mapping. It simplifies setup. One downside is that one virtiofsd can attack another if it manages to break out of sandbox. (In reply to Vivek Goyal from comment #2) Maybe even simpler one would be just running the virtiofsd instance under the same privileges as qemu. That could be a step in the right direction in order to figure out what's needed next. It would already secure some bits and I would imagine give us enough of a headache to make sure it works properly. (In reply to Martin Kletzander from comment #3) > (In reply to Vivek Goyal from comment #2) > Maybe even simpler one would be just running the virtiofsd instance under > the same privileges as qemu. That could be a step in the right direction in > order to figure out what's needed next. It would already secure some bits > and I would imagine give us enough of a headache to make sure it works > properly. IIUC, you are suggesting that run virtiofsd as "qemu" user without user namespaces? If yes, that will not go too far because we lose the ability to switch between arbitrary user ids as needed by guest in many cases (like kata containers). Anyway, with user namespaces, that's what we are effectively doing. qemu user will setup a user namespace and become root inside. That way it is practically running with privileges of qemu on host. But inside namespace it runs as root and is able to switch between uid/gids visible inside the user namespace. (In reply to Vivek Goyal from comment #4) Oh, I see. Would that require all the user IDs being mapped to the IDs in the user namespace? Wouldn't it be the same like just dropping the capabilities for virtiofsd running as root (without user namespaces)? (In reply to Martin Kletzander from comment #5) > (In reply to Vivek Goyal from comment #4) > Oh, I see. Would that require all the user IDs being mapped to the IDs in > the user namespace? Wouldn't it be the same like just dropping the > capabilities for virtiofsd running as root (without user namespaces)? Yes, setting a user namespace will require assigning some subuids/subgids to "qemu" user and then mapping those subuids/subgids in the newly launched user namespace. In theory you can launch new user namespace by just mapping "qemu" user as "root" inside user namespace and no other mappings. But in that case, virtiofsd will fail to switch to other uids/gids and guest will experience all these errors. So to support a filesystem semantics with multiple uids/gids and switching between uids/gids, we will need to create a user namespace with qemu user subuids/subgids mapped into user namespace. I think one open question here is that what range of subuids/subgids qemu user can use so that it does not conflict with other use cases. Right now I don't see a some sort of fixed allocation of subuid/subgid per user. In that case if we arbitrarily choose a range, it can conflict with other use cases. I am copying Dan Walsh. He might know if something has happened in this area. Till then, I guess we can ask system admin to allocate a range of subuids/subgids to qemu user which libvirt will make use of. This involves a manual step on sysadmin's part, which is not good. We will need to get rid of it somehow later. But that's not necessarily a libvirt problem. Its more of a subuid/sugid allocation problem in this sytem/cluster. Dan, has there been any progress on the issue of how to allocate subuids/subgids for a user in an automated way (without conflicting with other users). We might need a chunk of subuid/subgid allocation for "qemu" user. In Podman we suggest that users allocate the top 2 billion UIDs for the range in /etc/subuid and /etc/subgid with the user name containers. podman run --userns=auto Grabs unused (By it) 64k Blocks from that range to assign to its containers. I would suggest that libvirt do the same. It would be nice if there was some way to collaborate, but right now the only standard is the /etc/subuid and /etc/subgid files. With the support of libsubid, now these files can come from the internet, which might make this more easier or more difficult. (In reply to Daniel Walsh from comment #8) > In Podman we suggest that users allocate the top 2 billion UIDs for the > range in /etc/subuid and /etc/subgid with the user name containers. > > podman run --userns=auto > > Grabs unused (By it) 64k Blocks from that range to assign to its containers. > I would suggest that libvirt do the same. It would be nice if there was some > way to collaborate, but > right now the only standard is the /etc/subuid and /etc/subgid files. > > With the support of libsubid, now these files can come from the internet, > which might make this more easier or more difficult. Hmmm..., so uid/gid range is probably 32 bit. So that's around 4 billion uids. Top 2 billion gone to podman. So may be next range can be picked by livirt. Libvirt might not have to grab that big a range as number of VMs launched might not be as many as native containers. So if 64K range is given to each VM and say one might at max launch 512/1024 VMs on this host. So may be 32 million or 64 million uids/subgids could be reserved for qemu user. Anyway, looks like for now, reserving subuid/subgids will be done by sys admin and libvirt will just need to either use user specied subuid/subgid range or automatically pick one subrange. We should probably start simple. And that is allow user to specify subuid/subgid range in virtiofsd configuration. libvirt will just need to verify that user has ownership of these subuids/subgids before mapping them. Automatic selection of subuid/subgid ranges should probably be next step. I played around with 'unshare' and the C impl of virtiofsd (qemu-common-6.2.0-2.fc34.x86_64): * fv_socket_lock tries to mkdir /var/run/virtiofsd to lock the pathname, I would not expect it to do that for unprivileged users * even the chroot sandbox requires capabilities (I have not investigated further what caps are required by the rest of virtiofsd) Libvirt already does uid_map for libvirt_lxc containers: https://libvirt.org/formatdomain.html#container-boot so it should be possible to reuse some of that code. Asking the libvirt user to provide the uidmap should be easily doable. Automatic assignment for the 'qemu' user should be theoretically possible in libvirtd alone since no other program needs to reserve subuids so far. But to make it usable for regular users with unprivileged libvirt, some sort of coordination is needed and virtiofsd can't require all the capabilities it has now (but that might be out of scope of this bug) (In reply to Ján Tomko from comment #10) > I played around with 'unshare' and the C impl of virtiofsd > (qemu-common-6.2.0-2.fc34.x86_64): > Current C version of virtiofsd does not run well inside user namespaces. I think somebody tried it and it did not work. We never put effort to investigate and make it work. Rust virtiofsd works inside user namespaces. And given that's the future, we are not planning to add support of user namespaces to C version of virtiofsd. So any testing we do with user namespace, we need to do with Rust virtiofsd. https://gitlab.com/virtio-fs/virtiofsd > * fv_socket_lock tries to mkdir /var/run/virtiofsd to lock the pathname, > I would not expect it to do that for unprivileged users Agreed. Where are we supposed to store the local state of unprivileged user. We use qemu_get_local_state_pathname(). I assumed that it will automatically give some path say in $HOME/.local/ and the we can save pid file in say $HOME/.local/virtiofsd/<pidfile> Not sure what's the qemu convention for unprivileged users. > * even the chroot sandbox requires capabilities (I have not investigated > further what caps > are required by the rest of virtiofsd) Rust version seems to work reasonably well. We can't enable certain features like --inodes-file-handle. Down the line we will not be use SELinux as well as that will need to set trusted xattr and needs CAP_SYS_ADMIN. IOW, we expect that all features will not work when running unprivileged. So at the end of the date it will be a trade-off between security and functionality/feature-richness. > > Libvirt already does uid_map for libvirt_lxc containers: > https://libvirt.org/formatdomain.html#container-boot > so it should be possible to reuse some of that code. Nice. I am assuing you are referring to <idmap>. This looks good. > > Asking the libvirt user to provide the uidmap should be easily doable. > > Automatic assignment for the 'qemu' user should be theoretically possible in > libvirtd alone since > no other program needs to reserve subuids so far. Sounds reasonable. I think Dan was mentioning that upstream changes are there which autoamtically allocate a small subuid/subgid range to a user upon creation. I am not sure about the details though. > > But to make it usable for regular users with unprivileged libvirt, some sort > of coordination > is needed and virtiofsd can't require all the capabilities it has now (but > that might be out of scope of this bug) Can we get an update? Has this RFE been planned for? How should we prioritize it? (In reply to Klaus Heinrich Kiwi from comment #12) > Can we get an update? Has this RFE been planned for? How should we > prioritize it? assigning medium priority /severity according to my understanding of the issue, and looks like Ján targeted it for RHEL 9.2.0, which is good. Thanks Containers are just allocating the bottom half of all UIDs in /etc/subuid and /etc/subgid files. containers:2147483647:2147483648 There is no way I know of yet to coordinate this other then those two files. Upstream we have few request for this feature, also GNOME Boxes is also waiting for this feature[0] to be able to support virtio-fs [0] https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/292 May I increase the severity of this bug, as it blocks consumption of virtiofs by KubeVirt and OpenShift Virtualziation. (In reply to Dan Kenigsberg from comment #20) > May I increase the severity of this bug, as it blocks consumption of > virtiofs by KubeVirt and OpenShift Virtualziation. I am wondering what's the plan w.r.t usage of user namespaces in Kubevirt. If kubevirt decides to use user namespaces and launch all the pods in a user namepsace of its own, then virtiofsd probably can run in same user namespace (and not setup one of its own separately). I think that will be a simpler model instead of virtiofsd sandboxing itself using a user namespace. Does kubevirt have any plans to start making use of user namespaces for its pods? v1 proposed upstream: https://listman.redhat.com/archives/libvir-list/2023-September/242012.html (In reply to Ján Tomko from comment #23) > v1 proposed upstream: > https://listman.redhat.com/archives/libvir-list/2023-September/242012.html Is the idea to create a user namespace and then launch virtiofsd or libvirt will pass the id mapping to the virtiofsd's --uid-map/--gid-map parameters? (In reply to Ján Tomko from comment #23) > v1 proposed upstream: > https://listman.redhat.com/archives/libvir-list/2023-September/242012.html I'll suggest not using virtiofsd's --uid-map/--gid-map parameters, in the future we want to remove all the sandboxing code to an external tool (but we can discuss it) (In reply to German Maglione from comment #25) > (In reply to Ján Tomko from comment #23) > > v1 proposed upstream: > > https://listman.redhat.com/archives/libvir-list/2023-September/242012.html > > I'll suggest not using virtiofsd's --uid-map/--gid-map parameters, in the > future we want to remove all the sandboxing code to an external tool (but we > can discuss it) The version (or rather - draft) I wrote uses virtiofsd's --uid-map. So far, the user needs to allocate and specify the uids themself. It also allows running as non-root without ID mapping (I haven't noticed when virtiofsd started to support that). As for sandboxing, having libvirt create a user namespace for virtiofsd seems reasonable to me (but unnecessary for the KubeVirt, if they create their own namespace for virtiofsd). But I can't imagine how libvirt would set up seccomp or separate capabilities for different threads of virtiofsd, so from libvirt's point of view this new tool would become a wrapper for virtiofsd (preferably it could find out about it from the virtiofsd.json file) (In reply to Ján Tomko from comment #27) > (In reply to German Maglione from comment #25) > > (In reply to Ján Tomko from comment #23) > > > v1 proposed upstream: > > > https://listman.redhat.com/archives/libvir-list/2023-September/242012.html > > > > I'll suggest not using virtiofsd's --uid-map/--gid-map parameters, in the > > future we want to remove all the sandboxing code to an external tool (but we > > can discuss it) > > The version (or rather - draft) I wrote uses virtiofsd's --uid-map. So far, > the user needs to allocate and specify the uids themself. > > It also allows running as non-root without ID mapping (I haven't noticed when > virtiofsd started to support that). recently, since version 1.6.0 > > > As for sandboxing, having libvirt create a user namespace for virtiofsd seems > reasonable to me (but unnecessary for the KubeVirt, if they create their own I agree with you, I don't think libvirt user NS support is necessary for KubeVirt > namespace for virtiofsd). But I can't imagine how libvirt would set up > seccomp > or separate capabilities for different threads of virtiofsd, so from > libvirt's > point of view this new tool would become a wrapper for virtiofsd (preferably > it > could find out about it from the virtiofsd.json file) (I probably use the term sandboxing too broadly) the idea is to create a user namespace and launch virtiofsd as "fake" root inside it with --sandbox=chroot or --sandbox=none, virtiofsd will drop capabilities and set the seccomp filters, so from a virtiofsd pov it will be running as root. But, if you think virtiofsd should be the one creating the UNS instead of libvirt, we can work on the required changes when we move the sandboxing code to an external tool (will not be soon, btw). (In reply to Vivek Goyal from comment #21) > (In reply to Dan Kenigsberg from comment #20) > > May I increase the severity of this bug, as it blocks consumption of > > virtiofs by KubeVirt and OpenShift Virtualziation. > > I am wondering what's the plan w.r.t usage of user namespaces in Kubevirt. > > If kubevirt decides to use user namespaces and launch all the pods in a user > namepsace of its own, then virtiofsd probably can run in > same user namespace (and not setup one of its own separately). I think that > will be a simpler model instead of virtiofsd sandboxing > itself using a user namespace. > > Does kubevirt have any plans to start making use of user namespaces for its > pods? I may well have misunderstood the nature of this bug. I have raised attention to it only because it is (the only RHEL bug) marked as blocking https://issues.redhat.com/browse/CNV-27131 . If this is not correct, please fix the dependency. I suspect that I should have asked higher priority on https://issues.redhat.com/browse/RHELPLAN-165875 instead. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |