Bug 1593897 - [RFE] Support for consuming pre-setup macvtap devices
Summary: [RFE] Support for consuming pre-setup macvtap devices
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-21 19:20 UTC by Fabian Deutsch
Modified: 2019-09-17 00:13 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-17 00:13:09 UTC
Embargoed:


Attachments (Terms of Use)

Description Fabian Deutsch 2018-06-21 19:20:02 UTC
Description of problem:
In KubeVirt we run libvirt in a pod.

Today the pod requires CAP_NET_ADMIN capabilities because it is seeting up the networking.

In order to also drop this privilege, it would be good if libvirt could just consume a pre-created (macv)tap device.

Or more broadly: libvirt needs to be in a position where it does not require CAP_* anymore when conauming a network device.

The details need to be sorted out, but at least along those lines.

Comment 1 Laine Stump 2018-06-29 22:56:46 UTC
A bit of long-winded background:

Until sometime in 2016 (libvirt 1.3.3 I believe) it was possible to have a guest use an existing tap device by specifying it in the <interface>:


    <interface type='ethernet'>
       <target dev='mytap'/>
       ...
    </interface>

If any setup of the tap device was required, an external script was executed to perform the setup:

    <script path='/etc/qemu-ifup-mynet'/>

Normally the script would be written to create the named device, then link it up to whatever connection was desired. The problem with this was that this was implemented via the qemu netdev "script" option, meaning that qemu had to be run as root in order to perform operations on the device, and libvirt prefers that qemu be run unprivileged (a variation of the same problem you're trying to solve).

commit 9c17d665fd changed that though - it moves creation of the tap device (and execution of the script) into libvirt itself, which is a good thing for people who previously created the device with the qemu-run script (since qemu no longer needs to be run as root), but what it missed was that the new code *always* attempts to create a device with the given name, and fails if said device already exists. This was obviously not what was intended, but was just missed in the reviews; on the other hand, it has been more than 2 years since this change went in and nobody has complained, so at least we didn't break anybody's setup :-)

Since a macvtap device is treated identically to a regular tap device by qemu, I would be tempted to say that we can solve your problem by simply fixing the behavior of <interface type='ethernet'> to check for an existing tap device before attempting to create it (and then you could just give it the name of an existing macvtap device), but there are a couple of issues:

1) libvirt sends all tap devices to qemu as fd's, not as device names. If we're to continue doing that for these pre-created macvtap devices, libvirt would need to open the device to get a handle; I don't know yet if just opening the device requires CAP_NET_ADMIN (I think it doesn't, and if so then this is a non-issue)

2) I noticed at least one difference in the handling of macvtap devices vs. standard tap - libvirt normally creates standard tap devices with IFF_UP (online), but creates macvtap devices with ~IFF_UP (offline), then sets them online just before turning on the guest CPUs. The reason for this is explained in commit 82977058 (in short, because a macvtap device shares the MAC address of the guest, setting it online too early leads to the host originating traffic with the guest's MAC, so switches will learn the location of the device and start sending traffic for that MAC to the connected port before the guest is ready to start). This doesn't matter when you're just starting up a guest, but it causes real problems during migration - until we made this change to libvirt, traffic to a migrating guest with macvtap interfaces would be lost during the migration. I don't know what kubevirt's migration expectations/capabilities are, so I can't say if this will be a problem or not.

(I'm curious why you are so keen to eliminate CAP_NET_ADMIN though - there are a lot of things that libvirt could be doing for you if you leave it enabled. setting the interface online/offline is just one example. A few other things that come to mind -  it can set the MTU of a tap device (and the host bridge it's connected to), set the IP address of the local side of a standard tap device (in case you don't want it connected to a bridge), add routes to the host,  manage a Linux host bridge's MAC table, automatically update the RX filter of a macvtap interface based on changes in the guest. Of course every privilege given to every process is increasing attack surface, but in this case it's only a single process (libvirtd), not every qemu process. Is there a specific reason you're trying to eliminate CAP_NET_ADMIN, or is it just a general desire to reduce attack surface? Remember that if libvirtd isn't doing this network device stuff, some other process will need to do it instead...)

Comment 2 Laine Stump 2018-07-01 03:27:02 UTC
Another thing that may be problematic - unlike tap devices, the MAC address of a macvtap device must match the MAC address of the guest interface (for regular tap devices, the MAC address of the tap device must *not* match the MAC address of the guest interface). This means that, in order to use a "pre-setup" macvtap device, the device must have been configured with the same MAC address as is specified in libvirt's config for the guest interface. This will preclude any possibility of pre-creating a pool of macvtap devices, then using them unmodified for different guests. Instead, the macvtap will need to have its MAC address set by  [whatever it is that *does* have CAP_NET_ADMIN] just prior to telling libvirt to start the guest.

(If you were already planning to create the macvtap devices on demand anyway (just at the next level above libvirt), then this will be another non-issue.)

Comment 3 Martin Kletzander 2018-07-03 10:47:55 UTC
Hopefully I'm not missing some things here, but as far as I understand, from KubeVirt's POV there is no need for libvirt to do anything that requires CAP_*_ADMIN or any capability(*).  Everything is prepared on the cluster level and that's why libvirt would just use pre-existing device.  The address setting, duplication, link state etc. is something that you WANT to have done on cluster level without libvirt changing anything.

I was under the impression that someone needs to open the device if you want the tap to be used.  Since libvirt started doing that to run QEMU unprivileged, it most probably needs a capability (it would also make sense since it enables you to do more, let's say privileged, things).  I wanted to demonstrate that on the ping command, but apparently it doesn't work the same way it did, so I need to investigate before using it for the explanation.  If you have some idea in the meantime, let us know.

Comment 4 Laine Stump 2018-07-06 02:51:18 UTC
A related discussion started on libvir-list today:


  https://www.redhat.com/archives/libvir-list/2018-July/msg00323.html

It made me realize that

1) the "old" behavior with existing tap devices isn't as broken as I'd thought - I was just trying it using a macvtap device, which behaves differently; with an existing standard tap device, <interface type='ethernet'> will use the existing device, the only problematic part is that when qemu terminates, the tap is auto-deleted (even though I *think* I created it properly, I used "tunctl -t mytap").

2) macvtap devices *don't* behave identically to tap - if you try to "create" an existing tap device, the existing device will be re-used, but if you try to "create" an existing macvtap device, you will get an EINVAL.


(the TL;DR of (1) and (2) is essentially that half of what I wrote in comment 1 is wrong)

3) Anyway, we use the TUNSETIFF ioctl to create / get the fd of an existing tap device, and not only does that probably require CAP_NET_ADMIN, it also won't work easily/directly when a process inside a container is trying to get the fd for a tap device that was created outside the container (at least I *think* that's what's being said in the initial email of the above-referenced thread :-)

So definitely (and unfortunately) this isn't as simple as fixing a regression.

I think further discussion should continue on the email thread, and we can post results here.

Comment 5 Fabian Deutsch 2018-07-18 15:43:51 UTC
This list might be related https://www.spinics.net/lists/netdev/msg513558.html

Comment 6 Laine Stump 2019-09-17 00:13:09 UTC
This capability is now upstream, as detailed in Bug 1723367. It will be in libvirt-5.8.0 upstream.


Note You need to log in before you can comment on or make changes to this bug.