Bug 2034630 - [libvirt] Add support to run virtiofsd inside a user namespace (unprivileged) [NEEDINFO]
Summary: [libvirt] Add support to run virtiofsd inside a user namespace (unprivileged)
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: ---
Assignee: Ján Tomko
QA Contact: Lili Zhu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-21 14:49 UTC by Vivek Goyal
Modified: 2023-08-15 14:35 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Feature Request
Target Upstream Version:
Embargoed:
vgoyal: needinfo? (jtomko)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-27131 0 None None None 2023-08-15 10:47:32 UTC
Red Hat Issue Tracker RHELPLAN-106361 0 None None None 2021-12-21 14:51:56 UTC

Description Vivek Goyal 2021-12-21 14:49:44 UTC
Description of problem:

virtiofsd (rust) now has the capability to run unprivileged (inside a user namepsace). Next step is to figure out how to integrate this capability in libvirt and where will it make most sense.

Thanks to German who configured and tested and provided instructions on how to run virtiofsd unprivileged.

https://listman.redhat.com/archives/virtio-fs/2021-December/msg00054.html

One big advatange of running virtiofsd as non-root is that system security goes up. One can not drop setuid root binaries on the host and somehow manage to execute these. And there are more examples. So if we can run virtiofsd unprivileged, I think that's a huge win in terms of system security w.r.t virtiofsd.

One limitation of running unprivileged is that one can not create block or char device nodes. That's not allowed. 

Hence opening this bug to figure out how this functionality can be integrated in libvirt so that users can enable this unprivileged/rootless mode.

I think if users are running unprivileged VMs, it will make sense to virtiofsd unprivileged. Even in normal case, I think qemu runs as "qemu" user. So it probably will make sense to not run virtiofsd as root and run as "qemu" user instead.

There will probably be few dependencies.

- We need to allocate a range of uid/gid to qemu user.

- We need to select a range dynamically to setup a user namespace. May be a
  range of 64K for each virtiofsd instance. If file system is being shared by
  multiple VMs, they will have to use same uid/gid range.

- Depending on the use case, shared directory will have to be either chowned
  according to uid/gid range of user namespace or one will have to create
  idmapped mount mapping. Now basic idmapped mount support is upstream. So
  that's not a technology barrier anymore.


Opening this bug to carry out the conversation how libvirt and its users can benefit from this new unpriviliged virtiofsd mode and how to integrate it up the stack so that users can benefit from it.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Vivek Goyal 2021-12-21 15:40:51 UTC
Thinking more about idmapped mounts. We probably don't want to use them in the beginning. Because, they will shift back unprivileged id to root. IOW, if user creates a file as "1000" it might be saved as uid "0" back on disk and if one accesses it through "non-idmapped" mount, it will be visible as root owned file.

IOW, this takes us back to the risk of VM user dropping setuid root binaries and somehow arranging an unprivileged entity on host to execute it.

I guess in the simplest form we need to first wire it up with out idmapped mounts and suggest chown of shared dir as needed.

Comment 2 Vivek Goyal 2021-12-21 20:45:44 UTC
In the simplest form, we probably can run all virtiofsd instances using same uid/gid ranges. This still makes sure there are no real root owned files in shared dir. 

And if all VMs are running as user "qemu" and different qemu processes don't have any uid/gid based isolation between 
them, then it probably be ok to run virtiofsd as "qemu" being pseudo root and all processes using same uid/gid mapping.

It simplifies setup. One downside is that one virtiofsd can attack another if it manages to break out of sandbox.

Comment 3 Martin Kletzander 2022-01-21 09:44:30 UTC
(In reply to Vivek Goyal from comment #2)
Maybe even simpler one would be just running the virtiofsd instance under the same privileges as qemu.  That could be a step in the right direction in order to figure out what's needed next.  It would already secure some bits and I would imagine give us enough of a headache to make sure it works properly.

Comment 4 Vivek Goyal 2022-01-21 14:46:52 UTC
(In reply to Martin Kletzander from comment #3)
> (In reply to Vivek Goyal from comment #2)
> Maybe even simpler one would be just running the virtiofsd instance under
> the same privileges as qemu.  That could be a step in the right direction in
> order to figure out what's needed next.  It would already secure some bits
> and I would imagine give us enough of a headache to make sure it works
> properly.

IIUC, you are suggesting that run virtiofsd as "qemu" user without user namespaces?

If yes, that will not go too far because we lose the ability to switch between arbitrary user ids as needed by guest in
many cases (like kata containers).

Anyway, with user namespaces, that's what we are effectively doing. qemu user will setup a user namespace and become root inside. That way it is practically running with privileges of qemu on host. But inside namespace it runs as root and is able to switch between uid/gids visible inside the user namespace.

Comment 5 Martin Kletzander 2022-01-24 08:17:34 UTC
(In reply to Vivek Goyal from comment #4)
Oh, I see.  Would that require all the user IDs being mapped to the IDs in the user namespace?  Wouldn't it be the same like just dropping the capabilities for virtiofsd running as root (without user namespaces)?

Comment 6 Vivek Goyal 2022-01-24 14:03:10 UTC
(In reply to Martin Kletzander from comment #5)
> (In reply to Vivek Goyal from comment #4)
> Oh, I see.  Would that require all the user IDs being mapped to the IDs in
> the user namespace?  Wouldn't it be the same like just dropping the
> capabilities for virtiofsd running as root (without user namespaces)?

Yes, setting a user namespace will require assigning some subuids/subgids to "qemu" user and then mapping those
subuids/subgids in the newly launched user namespace.

In theory you can launch new user namespace by just mapping "qemu" user as "root" inside user namespace and no
other mappings. But in that case, virtiofsd will fail to switch to other uids/gids and guest will experience all
these errors. 

So to support a filesystem semantics with multiple uids/gids and switching between uids/gids, we will need to
create a user namespace with qemu user subuids/subgids mapped into user namespace.

I think one open question here is that what range of subuids/subgids qemu user can use so that it does not 
conflict with other use cases. Right now I don't see a some sort of fixed allocation of subuid/subgid per
user. In that case if we arbitrarily choose a range, it can conflict with other use cases.

I am copying Dan Walsh. He might know if something has happened in this area.

Till then, I guess we can ask system admin to allocate a range of subuids/subgids to qemu user which libvirt will make use of. This involves a manual step on sysadmin's part, which is not good. We will need to get rid of it somehow later. But that's not necessarily a libvirt problem. Its more of a subuid/sugid allocation problem in this sytem/cluster.

Comment 7 Vivek Goyal 2022-01-24 14:05:27 UTC
Dan, has there been any progress on the issue of how to allocate subuids/subgids for a user in an automated way (without conflicting with other users). We might need a chunk of subuid/subgid allocation for "qemu" user.

Comment 8 Daniel Walsh 2022-01-24 14:12:51 UTC
In Podman we suggest that users allocate the top 2 billion UIDs for the range in /etc/subuid and /etc/subgid with the user name containers.

podman run --userns=auto

Grabs unused (By it) 64k Blocks from that range to assign to its containers.  I would suggest that libvirt do the same. It would be nice if there was some way to collaborate, but
right now the only standard is the /etc/subuid and /etc/subgid files.

With the support of libsubid, now these files can come from the internet, which might make this more easier or more difficult.

Comment 9 Vivek Goyal 2022-01-24 14:51:14 UTC
(In reply to Daniel Walsh from comment #8)
> In Podman we suggest that users allocate the top 2 billion UIDs for the
> range in /etc/subuid and /etc/subgid with the user name containers.
> 
> podman run --userns=auto
> 
> Grabs unused (By it) 64k Blocks from that range to assign to its containers.
> I would suggest that libvirt do the same. It would be nice if there was some
> way to collaborate, but
> right now the only standard is the /etc/subuid and /etc/subgid files.
> 
> With the support of libsubid, now these files can come from the internet,
> which might make this more easier or more difficult.


Hmmm..., so uid/gid range is probably 32 bit. So that's around 4 billion uids. Top 2 billion gone to podman. So may be next range can be picked by livirt.

Libvirt might not have to grab that big a range as number of VMs launched might not be as many as native containers.

So if 64K range is given to each VM and say one might at max launch 512/1024 VMs on this host. So may be 32 million
or 64 million uids/subgids could be reserved for qemu user.

Anyway, looks like for now, reserving subuid/subgids will be done by sys admin and libvirt will just need to either use
user specied subuid/subgid range or automatically pick one subrange.

We should probably start simple. And that is allow user to specify subuid/subgid range in virtiofsd configuration. libvirt will just need to verify that user has ownership of these subuids/subgids before mapping them.

Automatic selection of subuid/subgid ranges should probably be next step.

Comment 10 Ján Tomko 2022-02-04 15:36:14 UTC
I played around with 'unshare' and the C impl of virtiofsd (qemu-common-6.2.0-2.fc34.x86_64):

* fv_socket_lock tries to mkdir /var/run/virtiofsd to lock the pathname,
I would not expect it to do that for unprivileged users
* even the chroot sandbox requires capabilities (I have not investigated further what caps
are required by the rest of virtiofsd)

Libvirt already does uid_map for libvirt_lxc containers:
https://libvirt.org/formatdomain.html#container-boot
so it should be possible to reuse some of that code.

Asking the libvirt user to provide the uidmap should be easily doable.

Automatic assignment for the 'qemu' user should be theoretically possible in libvirtd alone since
no other program needs to reserve subuids so far.

But to make it usable for regular users with unprivileged libvirt, some sort of coordination
is needed and virtiofsd can't require all the capabilities it has now (but that might be out of scope of this bug)

Comment 11 Vivek Goyal 2022-02-04 16:51:29 UTC
(In reply to Ján Tomko from comment #10)
> I played around with 'unshare' and the C impl of virtiofsd
> (qemu-common-6.2.0-2.fc34.x86_64):
> 

Current C version of virtiofsd does not run well inside user namespaces. I think somebody tried it and it did not work.
We never put effort to investigate and make it work.

Rust virtiofsd works inside user namespaces. And given that's the future, we are not planning to add support of
user namespaces to C version of virtiofsd.

So any testing we do with user namespace, we need to do with Rust virtiofsd.

https://gitlab.com/virtio-fs/virtiofsd

> * fv_socket_lock tries to mkdir /var/run/virtiofsd to lock the pathname,
> I would not expect it to do that for unprivileged users

Agreed. Where are we supposed to store the local state of unprivileged user. We
use qemu_get_local_state_pathname(). I assumed that it will automatically
give some path say in $HOME/.local/ and the we can save pid file in
say $HOME/.local/virtiofsd/<pidfile>

Not sure what's the qemu convention for unprivileged users.

> * even the chroot sandbox requires capabilities (I have not investigated
> further what caps
> are required by the rest of virtiofsd)

Rust version seems to work reasonably well. We can't enable certain features like
--inodes-file-handle. Down the line we will not be use SELinux as well as 
that will need to set trusted xattr and needs CAP_SYS_ADMIN.

IOW, we expect that all features will not work when running unprivileged. So at
the end of the date it will be a trade-off between security and functionality/feature-richness.

> 
> Libvirt already does uid_map for libvirt_lxc containers:
> https://libvirt.org/formatdomain.html#container-boot
> so it should be possible to reuse some of that code.

Nice. I am assuing you are referring to <idmap>. This looks good.

> 
> Asking the libvirt user to provide the uidmap should be easily doable.
> 
> Automatic assignment for the 'qemu' user should be theoretically possible in
> libvirtd alone since
> no other program needs to reserve subuids so far.

Sounds reasonable. I think Dan was mentioning that upstream changes are there which
autoamtically allocate a small subuid/subgid range to a user upon creation. I am not
sure about the details though.

> 
> But to make it usable for regular users with unprivileged libvirt, some sort
> of coordination
> is needed and virtiofsd can't require all the capabilities it has now (but
> that might be out of scope of this bug)

Comment 12 Klaus Heinrich Kiwi 2022-05-25 15:26:37 UTC
Can we get an update? Has this RFE been planned for? How should we prioritize it?

Comment 13 Klaus Heinrich Kiwi 2022-08-15 17:20:56 UTC
(In reply to Klaus Heinrich Kiwi from comment #12)
> Can we get an update? Has this RFE been planned for? How should we
> prioritize it?

assigning medium priority /severity according to my understanding of the issue, and looks like Ján targeted it for RHEL 9.2.0, which is good. Thanks

Comment 14 Daniel Walsh 2022-08-15 17:37:52 UTC
Containers are just allocating the bottom half of all UIDs in /etc/subuid and /etc/subgid files.

containers:2147483647:2147483648

There is no way I know of yet to coordinate this other then those two files.

Comment 16 German Maglione 2023-03-16 14:01:50 UTC
Upstream we have few request for this feature, also GNOME Boxes is also waiting for this feature[0] to be able to support virtio-fs

[0] https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/292

Comment 20 Dan Kenigsberg 2023-08-15 10:47:32 UTC
May I increase the severity of this bug, as it blocks consumption of virtiofs by KubeVirt and OpenShift Virtualziation.


Note You need to log in before you can comment on or make changes to this bug.