Bug 908888

Summary: qemu now requires CAP_COMPROMISE_KERNEL for working PCI passthrough
Product: [Fedora] Fedora Reporter: Laine Stump <laine>
Component: kernelAssignee: Kyle McMartin <kmcmartin>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: ajia, alex.williamson, berrange, clalancette, crobinso, cwei, dyuan, fweimer, gansalmon, honzhang, itamar, jason, jforbes, jonathan, jwboyer, jyang, kernel-maint, kmcmartin, laine, libvirt-maint, madhu.chinakonda, mathis.gavillon, mjg59, pbonzini, peterm, rjones, sassmann, veillard, virt-maint
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-17 18:15:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sysfs cap check on open none

Description Laine Stump 2013-02-07 19:07:42 UTC
Description of problem:


When the configuration for a virtual machine (guest) requests a physical device of the host to be assigned to the guest via PCI passthrough, libvirt (which runs as root, with all capabilities bits set) opens /sys/bus/pci/devices/nnnn:nn:nn.nn/config, then creates a qemu-kvm process running as user qemu with all cap bits cleared, and sends the open fd for the sysfs file to qemu. In the past, qemu was able to perform all required operations for PCI passthrough via this fd without requiring any extra capabilities bits to be set.

Beginning with Fedora 18, there is a new capability bit CAP_COMPROMISE_KERNEL. When cleared, a process is prohibited from doing anything that would fail on a UEFI "Secure Boot" system (whether the current system is SB or not).

Steps to Reproduce:
0. Find a host with a working IOMMU so that PCI passthrough should
   theoretically work, and a reasonable device to assign to a guest
   (for example, a well known network card).

1. set "user = root, group = root, and clear_emulator_capabilities = 0"
   in /etc/libvirt/qemu.conf and restart libvirtd.service

2. configure and debug a guest definition until you have PCI passthrough
   of the device to the guest properly working.

3. Stop the guest, the comment out the 3 lines of qemu.conf mentioned above
   in step 1, and restart libvirtd.service

4. Try to start the guest again. It will fail. /var/log/libvirt/qemu/$guestname.log will contain the error message from qemu.

Step 4 *should* be successful.

Comment 1 Laine Stump 2013-02-07 19:30:07 UTC
I have a set of patches for libvirt that manage to set CAP_COMPROMISE_KERNEL for non-root qemu-kvm, and that does indeed solve the problem.

During discussion in IRC, Dan Berrange pointed out that the need for this cap bit is likely due to the kernel checking for the cap at every read/write of the file, rather than just on open (as seems to be the case most of the rest of the time).

While we could grudgingly patch libvirt to set CAP_COMPROMISE_KERNEL, that opens the host system up to a much wider range of attack from rogue guests than would exist if the kernel check for CAP_COMPROMISE_KERNEL happened only when the sysfs file was opened.

(As aside: yes, we're aware that this means PCI passthrough via qemu's "pci-assign" device will by definition not work on any system using UEFI Secure Boot (since this capability bit is merely preventing us from doing anything that would anyway be prevented at a lower level by SB). newer qemu (starting with 1.3) will have vfio-pci to replace that, and libvirt will hopefully have that soon too. But Fedora 18 is using qemu-1.2, and as far as I know, qemu never rebases within a Fedora release (similar to libvirt).

Comment 2 Laine Stump 2013-02-12 18:53:07 UTC
I just realized that the initial bug description was missing crucial information in the 2nd paragraph:

The current implementation of CAPS_COMPROMISE_KERNEL (which is a patch to the F18 kernel that hasn't yet been applied upstream) checks this capability bit not only when open() is called on the sysfs file in question (/sys/bus/pci/devices/nnnn:nn:nn.nn/config) but also when read/write is called. Since libvirt (which has CAP_COMPROMISE_KERNEL) opens the file and passes the open fd to qemu, it's okay to restrict open(), but qemu (which has *no* cap bits set) needs to read/write this file, so it fails.

Comment 3 Laine Stump 2013-02-18 17:31:43 UTC
Note that there was also an selinux issue: Bug 895161. selinux was essentially duplicating the checks of CAP_COMPROMISE_KERNEL, but the selinux policy has now been modified to allow that for all processes that are svirt_t

Comment 4 Josh Boyer 2013-02-19 16:17:01 UTC
(In reply to comment #2)
> I just realized that the initial bug description was missing crucial
> information in the 2nd paragraph:
> 
> The current implementation of CAPS_COMPROMISE_KERNEL (which is a patch to
> the F18 kernel that hasn't yet been applied upstream) checks this capability
> bit not only when open() is called on the sysfs file in question
> (/sys/bus/pci/devices/nnnn:nn:nn.nn/config) but also when read/write is
> called. Since libvirt (which has CAP_COMPROMISE_KERNEL) opens the file and
> passes the open fd to qemu, it's okay to restrict open(), but qemu (which
> has *no* cap bits set) needs to read/write this file, so it fails.

Actually, it doesn't check the capability on open() at all.  That is essentially the problem.

The sysfs filesystem shares a single open function for most of the files under /sys.  It's part of the core sysfs functionality, and there are no existing hooks to override the open function so you cannot easily place per-file cap checks on files for open().  The cap checks in place are in read/write because those files are bin_attr types, and those do have hooks for read/write.

Kyle was looking at providing a capability hook so that if it existed it the core sysfs open function would check it, and if not nothing additional was done.  Not sure how that wound up.

Comment 5 Kyle McMartin 2013-02-19 16:47:30 UTC
Created attachment 699527 [details]
sysfs cap check on open

Attaching unfinished diff for history reasons. (This was what was in the kernel build Laine tested.)

Comment 8 Daniel Berrangé 2013-03-27 14:31:25 UTC
I see these patches are now posted upstream http://thread.gmane.org/gmane.linux.kernel/1459165

Can anyone confirm whether the upstream posting suffers the same flaw of requiring CAP_COMPROMISE_KERNEL for write(), instead of only for open() ?

Comment 9 Josh Boyer 2013-03-27 14:34:40 UTC
(In reply to comment #8)
> I see these patches are now posted upstream
> http://thread.gmane.org/gmane.linux.kernel/1459165
> 
> Can anyone confirm whether the upstream posting suffers the same flaw of
> requiring CAP_COMPROMISE_KERNEL for write(), instead of only for open() ?

Yes, the patches posted upstream are identical to what is currently in Fedora.

Comment 10 Daniel Berrangé 2013-03-27 14:38:21 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > I see these patches are now posted upstream
> > http://thread.gmane.org/gmane.linux.kernel/1459165
> > 
> > Can anyone confirm whether the upstream posting suffers the same flaw of
> > requiring CAP_COMPROMISE_KERNEL for write(), instead of only for open() ?
> 
> Yes, the patches posted upstream are identical to what is currently in
> Fedora.

Urgh. Can someone involved in LKML upstream highlight that these kernel patches cause a serious regression in userspace virt functionality & should not be merged without first fixing the sysfs code to check on open().

Comment 11 Josh Boyer 2013-03-27 15:05:26 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #8)
> > > I see these patches are now posted upstream
> > > http://thread.gmane.org/gmane.linux.kernel/1459165
> > > 
> > > Can anyone confirm whether the upstream posting suffers the same flaw of
> > > requiring CAP_COMPROMISE_KERNEL for write(), instead of only for open() ?
> > 
> > Yes, the patches posted upstream are identical to what is currently in
> > Fedora.
> 
> Urgh. Can someone involved in LKML upstream highlight that these kernel
> patches cause a serious regression in userspace virt functionality & should
> not be merged without first fixing the sysfs code to check on open().

Noted.  Adding the check in open isn't trivial.

I doubt those patches are going upstream anytime soon anyway.  They're on the third round of posting, and the capability they rely on isn't merged yet and its name is still being bikeshedded.

Comment 12 Josh Boyer 2013-03-27 15:33:35 UTC
Laine, Daniel, just so I'm clear on things, you expect pci passthrough to break in the SB case right?  Laine said as much in comment #1, and said that vfio will eventually be used.

So if we're mostly just concerned about the cap check breaking the non-SB case, then there is a somewhat quick fix that can be done here.  I doubt such a fix will be suitable as an upstream solution, but we can just replace that cap checks in the read/write functions with efi_enabled(EFI_SECURE_BOOT).  That will evaluate to false on non-SB machines.

Comment 13 Daniel Berrangé 2013-03-27 16:13:49 UTC
(In reply to comment #12)
> Laine, Daniel, just so I'm clear on things, you expect pci passthrough to
> break in the SB case right?  Laine said as much in comment #1, and said that
> vfio will eventually be used.

In the immediate term concerned with making sure we get working PCI passthrough when  SB is not enabled, to solve the regression wrt previous Fedora's.

vfio is our long term new architecture for PCI passthrough, which we're expecting to be compatible with SB enabled machines.

I'm not fussed if traditional PCI passthrough does not work when SB is enabled, but it would be a nice bonus if it did, to ease the transition to vfio.

> So if we're mostly just concerned about the cap check breaking the non-SB
> case, then there is a somewhat quick fix that can be done here.  I doubt
> such a fix will be suitable as an upstream solution, but we can just replace
> that cap checks in the read/write functions with
> efi_enabled(EFI_SECURE_BOOT).  That will evaluate to false on non-SB
> machines.

So you're saying replace

  if (!capable(CAP_COMPROMISE_KERNEL))
     ...EPERM..

with 

  if (efi_enabled(EFI_SECURE_BOOT) && !capable(CAP_COMPROMISE_KERNEL)
     ...EPERM..

if so that would certainly be fine with us as a immediate fix for Fedora, while upstream figures out the long term fix.

Comment 14 Josh Boyer 2013-03-27 16:46:06 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Laine, Daniel, just so I'm clear on things, you expect pci passthrough to
> > break in the SB case right?  Laine said as much in comment #1, and said that
> > vfio will eventually be used.
> 
> In the immediate term concerned with making sure we get working PCI
> passthrough when  SB is not enabled, to solve the regression wrt previous
> Fedora's.
> 
> vfio is our long term new architecture for PCI passthrough, which we're
> expecting to be compatible with SB enabled machines.
> 
> I'm not fussed if traditional PCI passthrough does not work when SB is
> enabled, but it would be a nice bonus if it did, to ease the transition to
> vfio.
> 
> > So if we're mostly just concerned about the cap check breaking the non-SB
> > case, then there is a somewhat quick fix that can be done here.  I doubt
> > such a fix will be suitable as an upstream solution, but we can just replace
> > that cap checks in the read/write functions with
> > efi_enabled(EFI_SECURE_BOOT).  That will evaluate to false on non-SB
> > machines.
> 
> So you're saying replace
> 
>   if (!capable(CAP_COMPROMISE_KERNEL))
>      ...EPERM..
> 
> with 
> 
>   if (efi_enabled(EFI_SECURE_BOOT) && !capable(CAP_COMPROMISE_KERNEL)
>      ...EPERM..

Well, or just:

  if (efi_enabled(EFI_SECURE_BOOT))
    ...EPERM...

there's no reason to have the cap check still if you're using efi_enabled, since you already know the cap is unavailable in SB mode (the kernel refuses to grant it to anyone).

Comment 15 Daniel Berrangé 2013-03-27 17:02:22 UTC
> > So you're saying replace
> > 
> >   if (!capable(CAP_COMPROMISE_KERNEL))
> >      ...EPERM..
> > 
> > with 
> > 
> >   if (efi_enabled(EFI_SECURE_BOOT) && !capable(CAP_COMPROMISE_KERNEL)
> >      ...EPERM..
> 
> Well, or just:
> 
>   if (efi_enabled(EFI_SECURE_BOOT))
>     ...EPERM...
> 
> there's no reason to have the cap check still if you're using efi_enabled,
> since you already know the cap is unavailable in SB mode (the kernel refuses
> to grant it to anyone).

Ah ok, that makes sense.

Comment 16 Josh Boyer 2013-03-27 20:25:54 UTC
So I built a kernel with Kyle's patch, and assigned an ethernet controller to the guest I have here.  After booting, it can clearly see the device including the config space:

00:06.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
        Subsystem: Lenovo Device 21ce
        Physical Slot: 6
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 11
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit-
                Address: 00000000  Data: 0000
00: 86 80 02 15 03 00 10 00 04 00 00 02 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 aa 17 ce 21
30: 00 00 00 00 c8 00 00 00 00 00 00 00 0a 01 00 00

The e1000e driver doesn't load in the guest, failing with a -5.  I'm not sure why that is, but it seems the file in question is accessible to qemu after it is started and /proc/<pid>/status clearly does not have any capabilities set.

Laine, you said you had tested Kyle's patch and it didn't work at all.  I'm seeing different results.  Could you or Daniel explain what might possibly be going wrong here, since the config space is readable by qemu?

Comment 17 Laine Stump 2013-03-28 01:51:13 UTC
What I had seen was that the guest booted successfully (with no errors encountered by either libvirt or qemu), but in the end there was no device visible in the guest; it seemed as if all the [reads/writes/whatever qemu is doing] were successful, but apparently either a read returned some different data, or perhaps a write accepted data from qemu but didn't actually do anything with it.

If you have a new kernel handy, I can try it again; it's possible it was some strange anomaly.


(BTW, I'm using an Intel 82576 card (sr-iov using igb driver) on AMD hardware.)

Comment 18 Josh Boyer 2013-03-28 11:57:16 UTC
I got the patch working.  Or at least the device both shows up and the e1000e driver loads in my VM.

Kyle's patch added the cap check for the <sysfs device path>/resourceX files, and the vpd file (if it exists).  Kyle noticed yesterday that while QEMU is passed the fd for the config space, it opens the resourceX files by itself.  With the checks he added there, those open calls were likely failing.

I guess the question becomes is putting the cap check only on config sufficient, or do the resourceX files need it as well?  (The patch also adds the check for legacy_io and legacy_mem files, but that seems to work fine).

Matthew, any thoughts here?

Comment 19 Josh Boyer 2013-03-28 12:00:05 UTC
Oh.  So one other thing I just noticed.  If you are an unprivileged user, you can no longer run lspci because pcilib can't open the config file.  That's probably worse than pci-passthru not working to be honest.  Certainly more people would notice.

[jwboyer@zod ~]$ lspci 
pcilib: Cannot open /sys/bus/pci/devices/0000:0e:00.0/config
lspci: Unable to read the standard configuration space header of device 0000:0e:00.0
pcilib: Cannot open /sys/bus/pci/devices/0000:0d:00.0/config
lspci: Unable to read the standard configuration space header of device 0000:0d:00.0
<etc etc>

Comment 20 Laine Stump 2013-03-28 18:36:43 UTC
1) if the resourceX files were previously readable by an unprivileged qemu, then they need to remain readable by unprivileged qemu. Adding in the checks at open would mean that either libvirt and qemu needed to come up with an mechanism for having them opened by libvirt and passed to qemu, or that libvirt would need to enable CAPS_COMPROMISE_KERNEL for qemu (which is now an *even worse* idea in light of learning that this name is probably not what will end up upstream).

2) can you make a scratch build for me to try out on my F18 system?

3) lspci works from a non-root shell for me here. Is the failure you note in Comment 19 something that's new with the patch that fixes our problem?

Comment 21 Matthew Garrett 2013-03-28 18:45:53 UTC
The resourceX files *need* to be constrained by this capability, so if there's an expectation that an unprivileged qemu can open and read/write these files then qemu is going to need the capability.

Comment 22 Josh Boyer 2013-03-28 18:46:56 UTC
(In reply to comment #20)
> 1) if the resourceX files were previously readable by an unprivileged qemu,
> then they need to remain readable by unprivileged qemu. Adding in the checks
> at open would mean that either libvirt and qemu needed to come up with an
> mechanism for having them opened by libvirt and passed to qemu, or that
> libvirt would need to enable CAPS_COMPROMISE_KERNEL for qemu (which is now
> an *even worse* idea in light of learning that this name is probably not
> what will end up upstream).

Kyle moved the check to open instead of write as it was in the original patch (note: read did not have a check) after you suggested moving the check to open in comment #1.  The resource files we covered via write in the previous patch, and I'm assuming Matthew did it that way on purpose.  I doubt it's acceptable to drop the check for resourceX files.

Unless a new idea presents itself, libvirt/qemu would have to make some change either way.

> 2) can you make a scratch build for me to try out on my F18 system?

I can.  I doubt it's going to do anything more than what the scratch build you already tried did though.

> 3) lspci works from a non-root shell for me here. Is the failure you note in
> Comment 19 something that's new with the patch that fixes our problem?

Yes.  Namely, if you do capabilities checks on file open instead of file write, then lspci can't open the file at all and it fails.

Comment 23 Josh Boyer 2013-03-28 19:17:12 UTC
> > 2) can you make a scratch build for me to try out on my F18 system?
> 
> I can.  I doubt it's going to do anything more than what the scratch build
> you already tried did though.

Here you go (when it finishes building):

http://koji.fedoraproject.org/koji/taskinfo?taskID=5185791

It doesn't include the change I described about dropping the check on the resourcesX files because those checks are needed.  So it is a build with Kyle's patch.

Comment 24 Laine Stump 2013-04-02 14:42:24 UTC
(In reply to comment #22)
> (In reply to comment #20)
> > 1) if the resourceX files were previously readable by an unprivileged qemu,
> > then they need to remain readable by unprivileged qemu. Adding in the checks
> > at open would mean that either libvirt and qemu needed to come up with an
> > mechanism for having them opened by libvirt and passed to qemu, or that
> > libvirt would need to enable CAPS_COMPROMISE_KERNEL for qemu (which is now
> > an *even worse* idea in light of learning that this name is probably not
> > what will end up upstream).
> 
> Kyle moved the check to open instead of write as it was in the original
> patch (note: read did not have a check) after you suggested moving the check
> to open in comment #1.  The resource files we covered via write in the
> previous patch, and I'm assuming Matthew did it that way on purpose.  I
> doubt it's acceptable to drop the check for resourceX files.
> 
> Unless a new idea presents itself, libvirt/qemu would have to make some
> change either way.

Especially in light of the fact that upstream has now refused the name CAPS_COMPROMISE_KERNEL, and is now considering a different name, it's completely not acceptable for libvirt to add CAPS_COMPROMISE_KERNEL to the qemu process.

So that would leave us with only the choice of requiring libvirt to open these files and pass their fd's to qemu. There is no method in qemu to do that though.

My understanding was that these kernel patches which were supposedly put in to enable Secure Boot must not (on a non-Secure Boot machine) prohibit *anything* that was previously allowed; doing otherwise breaks the ABI and applications that depend on it.

Comment 25 Josh Boyer 2013-04-02 14:54:47 UTC
(In reply to comment #24)
> (In reply to comment #22)
> > (In reply to comment #20)
> > > 1) if the resourceX files were previously readable by an unprivileged qemu,
> > > then they need to remain readable by unprivileged qemu. Adding in the checks
> > > at open would mean that either libvirt and qemu needed to come up with an
> > > mechanism for having them opened by libvirt and passed to qemu, or that
> > > libvirt would need to enable CAPS_COMPROMISE_KERNEL for qemu (which is now
> > > an *even worse* idea in light of learning that this name is probably not
> > > what will end up upstream).
> > 
> > Kyle moved the check to open instead of write as it was in the original
> > patch (note: read did not have a check) after you suggested moving the check
> > to open in comment #1.  The resource files we covered via write in the
> > previous patch, and I'm assuming Matthew did it that way on purpose.  I
> > doubt it's acceptable to drop the check for resourceX files.
> > 
> > Unless a new idea presents itself, libvirt/qemu would have to make some
> > change either way.
> 
> Especially in light of the fact that upstream has now refused the name
> CAPS_COMPROMISE_KERNEL, and is now considering a different name, it's
> completely not acceptable for libvirt to add CAPS_COMPROMISE_KERNEL to the
> qemu process.

Yeah, the name is irrelevant.  Some are even advocating for just using CAP_SYS_RAWIO everywhere, which at least eliminates a new cap.

> So that would leave us with only the choice of requiring libvirt to open
> these files and pass their fd's to qemu. There is no method in qemu to do
> that though.
> 
> My understanding was that these kernel patches which were supposedly put in
> to enable Secure Boot must not (on a non-Secure Boot machine) prohibit
> *anything* that was previously allowed; doing otherwise breaks the ABI and
> applications that depend on it.

So, yes the Secure Boot patches aren't supposed to impact things not using Secure Boot.  However, this is really more of a general capabilities issue than anything.  If you introduce a new capability, it is almost impossible not to change how things work.

It should be noted that how libvirt and QEMU are getting by today might very well be a mistake/oversight upstream.  The pciconfig_{read,write} syscalls actually require CAP_SYS_ADMIN.  The fact that the equivalent sysfs files don't have these checks seems rather odd to me.

Comment 26 Daniel Berrangé 2013-04-03 14:40:23 UTC
(In reply to comment #25)
> (In reply to comment #24)
> > So that would leave us with only the choice of requiring libvirt to open
> > these files and pass their fd's to qemu. There is no method in qemu to do
> > that though.
> > 
> > My understanding was that these kernel patches which were supposedly put in
> > to enable Secure Boot must not (on a non-Secure Boot machine) prohibit
> > *anything* that was previously allowed; doing otherwise breaks the ABI and
> > applications that depend on it.
> 
> So, yes the Secure Boot patches aren't supposed to impact things not using
> Secure Boot.  However, this is really more of a general capabilities issue
> than anything.  If you introduce a new capability, it is almost impossible
> not to change how things work.
> 
> It should be noted that how libvirt and QEMU are getting by today might very
> well be a mistake/oversight upstream.  The pciconfig_{read,write} syscalls
> actually require CAP_SYS_ADMIN.  The fact that the equivalent sysfs files
> don't have these checks seems rather odd to me.

If there is a security issue with the way caps checking is done on these sysfs files, then that changes the discussion somewhat. Such an issue probably ought to be dealt with / fixed separately from the SecureBoot work, so we can figure out the best solution without getting distracted by the more controversial SecureBoot proposals.

Comment 27 Paolo Bonzini 2013-04-04 08:32:40 UTC
> The pciconfig_{read,write} syscalls actually require CAP_SYS_ADMIN.  The fact 
> that the equivalent sysfs files don't have these checks seems rather odd to me.

The syscalls are equivalent to opening the sysfs file, doing a read/write and closing the it.  Requiring CAP_SYS_ADMIN or CAP_SYS_RAWIO to open the sysfs file would make sense.  Right now sysfs does not have that functionality (no ->open callback), but adding that infrastructure would help fixing this bug as well.

Of course, this would be a backwards-incompatible change.  I think it would not be acceptable when opening for reads.  It could be acceptable when opening for writes.

Comment 28 Josh Boyer 2013-04-09 14:31:36 UTC
I've started a scratch build with the workaround from comment #12 here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5231855

It isn't fully complete yet, but the x86_64 kernel is done.  Please test and let us know if that allows device assignment to work for you in the non-SB case.

Comment 29 Laine Stump 2013-04-09 18:45:17 UTC
I grabbed the build that Josh points to in Comment 28 and tried it out. It does definitely solve the problem; I can now configure qemu to run as non-root and drop all capabilities (i.e. I don't have to add CAP_COMPROMISE_KERNEL, or anything else), and PCI passthrough works as it did in the past.

Thumbs up from me to push this into F18.

Thanks!

Comment 30 Josh Boyer 2013-04-09 19:02:24 UTC
(In reply to comment #29)
> I grabbed the build that Josh points to in Comment 28 and tried it out. It
> does definitely solve the problem; I can now configure qemu to run as
> non-root and drop all capabilities (i.e. I don't have to add
> CAP_COMPROMISE_KERNEL, or anything else), and PCI passthrough works as it
> did in the past.

Thanks for testing.  I'd say it works around the problem more than solves it.  The work around isn't going to be acceptable to upstream, so the root of the issues still remain.

> Thumbs up from me to push this into F18.

I plan on doing that today.  I'm moving this bug to rawhide and leaving it open until some kind of final solution presents itself though.

Comment 31 Josh Boyer 2013-04-10 13:13:57 UTC
As mentioned, the workaround is in this F18 update:

https://admin.fedoraproject.org/updates/kernel-3.8.6-203.fc18

However, I did not mark it as containing the fix for this bug because bodhi would close it out.

Comment 32 Laine Stump 2013-06-10 17:37:22 UTC
Did these patches not make it into F19? (and rawhide?)

I just got F19 installed on my SRIOV machine and found that, once again, I have to run qemu as root with no capabilities dropped in order for PCI passthrough to work.

Comment 33 Laine Stump 2013-06-13 15:47:07 UTC
This needinfo is here to remind kyle that he was going to provide an F19 kernel build for me to try.

Comment 34 Kyle McMartin 2013-06-25 16:50:48 UTC
Sorry for the hold up, been busy with an internal transition...

http://koji.fedoraproject.org/koji/taskinfo?taskID=5539911

Can you test if this fixes the "secure boot disabled but passthru is still broken" case with this changeset?

Comment 35 Laine Stump 2013-06-26 20:18:04 UTC
Unfortunately that kernel actually makes matters worse :-(.

When I boot with kernel 3.9.7-300.bz908888, PCI device assignment will not work at all, even if the qemu process is running as root with no capabilities dropped.

When I reboot with the previous kernel 3.9.6-301, I am back to the previous situation - PCI device assignment *does* work as long as the qemu process is running as root, with no capabilities dropped. If I run qemu as a non-privileged user, with no extra capabilities, then it still fails.

If there's some info you'd like me to gather, I'd be happy to do that, or if you'd like access to the machine we can figure out how to do that as well. (Do you have access to a machine running F19 that doesn't use secureboot and supports device assignment? I can help you setup libvirt for a quick and simple test so you don't have to keep waiting for me to get results)

Comment 36 Kyle McMartin 2013-06-26 21:00:54 UTC
Can you be more specific with "will not work at all"? Did config space access work?

http://koji.fedoraproject.org/koji/taskinfo?taskID=5549005

How about this build?

Comment 37 Laine Stump 2013-06-26 21:45:35 UTC
Sorry for the lack of detail in my description. I really am coming at this from the outside. Below you'll see the qemu commandline and the error messages that qemu produces. Is there something I can do manually to get the answer to the question "Did config space access work"?

I'll try out the new build and let you know the result.

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/bin/qemu-kvm -name RHEL6 -S -machine pc-1.2,accel=kvm,usb=off -m 3907 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid cf9c00de-8615-9213-2e0d-5edc89e3b369 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/RHEL6.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive file=/var/lib/libvirt/images/RHEL6.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:f9:c5:5e,bus=pci.0,addr=0x6 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device pci-assign,configfd=25,host=02:00.0,id=hostdev0,bus=pci.0,addr=0x3 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
Domain id=2 is tainted: high-privileges
char device redirected to /dev/pts/1 (label charserial0)
qemu-system-x86_64: -device pci-assign,configfd=25,host=02:00.0,id=hostdev0,bus=pci.0,addr=0x3: assigned_dev_register_regions: Error: Couldn't mmap 0xfcba0000!
qemu-system-x86_64: -device pci-assign,configfd=25,host=02:00.0,id=hostdev0,bus=pci.0,addr=0x3: Device initialization failed.
qemu-system-x86_64: -device pci-assign,configfd=25,host=02:00.0,id=hostdev0,bus=pci.0,addr=0x3: Device 'kvm-pci-assign' could not be initialized
2013-06-26 18:25:10.670+0000: shutting down

Comment 38 Laine Stump 2013-06-28 06:23:32 UTC
With the 3.9.7-300.bz908888.2 build it fails in a different way:

(commandline is  same as above, qemu is run as root, with all capabilities. selinux is permissive)

I did reboot with 3.9.6-301 again to verify that nothing else has changed.

qemu-system-x86_64: -device pci-assign,configfd=25,host=02:00.0,id=hostdev0,bus=pci.0,addr=0x3: pci-assign: Cannot read from host /sys/bus/pci/devices/0000:02:00.0/rom
Device option ROM contents are probably invalid (check dmesg).
Skip option ROM probe with rombar=0, or load from file with romfile=
((null):2743): SpiceWorker-Warning **: red_worker.c:11477:dev_destroy_primary_surface: double destroy of primary surface
((null):2743): SpiceWorker-Warning **: red_worker.c:9663:red_create_surface: condition `surface->context.canvas' reached
qemu: hardware error: pci write failed, ret = -1 errno = 1

CPU #0:
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000623
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000
XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000
XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000
XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000
2013-06-28 00:33:10.266+0000: shutting down

Comment 40 Kyle McMartin 2013-06-28 21:43:54 UTC
OK. So comment37 is the BARs failing to get MMAP'd because qemu opens them. So we need to fix qemu to get those passed in from the privileged libvirt.

Comment38 is strange, because the kernel you tested there doesn't check the resource files at all, so I have no idea why that's failing.

I'm on PTO Monday & Friday next week, but maybe we can figure out some easier way to test this on Tuesday.

Comment 41 bing 2013-08-14 06:18:51 UTC
The updated F18 kernel-3.10.4-100.fc18.x86_64 seems to re-introduce the issue again, the previous kernel-3.9.11-200.fc18.x86_64 works fine. (i.e. PCI passthrough can be used without being root)

Comment 42 Cole Robinson 2013-08-15 01:16:14 UTC
Alex, this is a lingering bug with PCI passthrough in Fedora, maybe there's something you can add or help with debugging.

Comment 43 MG 2013-08-21 15:37:36 UTC
*** Bug 999482 has been marked as a duplicate of this bug. ***

Comment 44 Cole Robinson 2013-11-17 18:15:22 UTC
Since this bug has been silent for a while, and rawhide libvirt now uses VFIO by default which works with out of the box configuration, closing. If someone is still affected, please reopen.

Comment 45 Daniel Berrangé 2013-11-18 10:00:10 UTC
FYI  CAP_COMPROMISE_KERNEL has been killed in F20, so this should work even with legacy PCI device assignment.