Bug 1247578
Summary: | [Docs] VFIO/hostdev_passthrough: Host reboot occur when powering off VM with GPU attached. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nisim Simsolo <nsimsolo> | ||||||||||
Component: | Documentation | Assignee: | rhev-docs <rhev-docs> | ||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | rhev-docs <rhev-docs> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 3.6.0 | CC: | bugs, chayang, gklein, hhuang, huding, istein, juzhang, knoel, lbopf, lsurette, mavital, mgoldboi, michal.skrivanek, mpoledni, nsimsolo, rbalakri, rhev-docs, virt-maint, xfu, yeylon, ykaul, ylavi | ||||||||||
Target Milestone: | ovirt-3.6.3 | ||||||||||||
Target Release: | 3.6.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2016-02-16 03:56:45 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | Docs | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 825045, 1154205, 1172230 | ||||||||||||
Attachments: |
|
Description
Nisim Simsolo
2015-07-28 11:27:25 UTC
Created attachment 1056999 [details]
log collector
There is not a lot of info regarding the reboot in the logs. One relevant information is that the guest is RHEL 6.7, can you also add results with another guest OS? There is not any explicit reboot from VDSM's side, the failure seems to be lower in the stack. moving to qemu-kvm team for investigation of host crash (In reply to Nisim Simsolo from comment #0) > Description of problem: > When powering off VM with GPU attached the host reboot occur. ... > Version-Release number of selected component (if applicable): > engine: ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch > > Host (kernel 3.10.0-229.el7.x86_64): > vdsm-4.17.0-1054.git562e711.el7.noarch > sanlock-3.2.2-2.el7.x86_64 > libvirt-client-1.2.8-16.el7_1.3.x86_64 > qemu-kvm-ev-2.1.2-23.el7_1.3.1.x86_64 > > Hardware: > AMD based desktop > GPU - NVIDIA Corporation GM107GL [Quadro K2200] (rev a2) > CPU - AMD FX(tm)-8350 Eight-Core Processor > Motherboard: Asus SABERTOOTH 990FX R2.0 > > How reproducible: > Consistently > > Steps to Reproduce: > 1. Create windows 7 VM. > 2. Attach GPU to VM (doesn't matter if GPU audio device is attached also or > not). > 3. Run VM. > 4. Wait for VM to run and then power off VM from webadmin UI. > > Actual results: > Host reboot occur. > > Expected results: > VM should be powered off without affecting the host. Ok, so we have a RHEL7.1 host and a Windows 7 VM... (In reply to Nisim Simsolo from comment #1) > Created attachment 1056999 [details] > log collector But these are logs from a RHEL6.7 guest. How are they related? I'm unable to reproduce with a Quadro K4000 in a Gigabyte 990FX/Phenom system. The GPU in the guest works as expected and abruptly powering off the VM does not cause a host crash or reboot. Please provide guest XML and libvirt log for the domain. Log collection on the host would be useful as well. Created attachment 1065215 [details]
sosreport 7.2
Occurred again. this time i can confirm this bug is relevant only for linux VMs (so far i tested it only on rhel7). Same issue using windows VM does not occur. Setup versions: engine: 3.6.0-0.11.master.el6 host: vdsm-4.17.2-1.el7ev.noarch sanlock-3.2.4-1.el7.x86_64 qemu-kvm-rhev-2.3.0-18.el7.x86_64 libvirt-client-1.2.17-4.el7.x86_64 Issue occurred at : 2015-Aug-20, 14:06 VM name: rhel7_amd Host: amd-vfio There are at least two problems evident here, first we only support assignment of secondary graphics to a guest, in the provided sosreport, the K2200 is the only graphics device in the host system. Second, we recommend to customers to use pci-stub.ids= to prevent host drivers from binding to GPUs intended for assignment. The dmesg clearly shows nouveau in use by the host. In addition to adding another graphics card for host primary graphics, the option pci-stub.ids=10de:13ba,10de:0fbc should be added to the host kernel commandline to prevent host drivers from using the assigned GPU. Also, only the Nvidia proprietary drivers are supported in the guest. The nouveau driver is not supported and should be blacklisted in the guest. There's no indication here in the bz what driver is being used in the guest (which appears to be rhel6, not rhel7). Tested RHEL6.7 and RHEL7.1 guests on a RHEL7.2 AMD 990FX based host with Quadro K4000 assignment, binding to pci-stub in host, as recommended to customers, and blacklisting nouveau in guest, using only the latest driver directly from NVIDIA in the guest. I cannot reproduce a host reboot. When powering off the VM, nothing out of the ordinary happens. If I destroy the VM, I get spurious interrupts on the host from the legacy interrupt, nothing more. occurred again, this time using host with RHEL7.2 Beta (Maipo) and kernel 3.10.0-306.0.1.el7.x86_64. host builds: qemu-kvm-rhev-2.3.0-21.el7.x86_64 vdsm-4.17.3-1.el7ev.noarch sanlock-3.2.4-1.el7.x86_64 libvirt-client-1.2.17-5.el7.x86_64 engine: rhevm-3.6.0-0.12.master.el6 Host is AMD host with Nvidia Quadro K2200, attached to windows 7 VM. Trying to power off VM ended with host reboot (does not always happen). VM name is: win7_amd issue started at: 2015-09-02 14:47:04,434 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-47) [430aa112] Correlation ID: 430aa112, Job ID: 5cfa9708-9601-467c-86fe-289869023034, Call Stack: null, Custom Event ID: -1, Message: Failed to power off VM win7_amd (Host: amd-vfio.tlv.redhat.com, User: admin@internal). engine.log and VDSM log attached. SOS report uploaded to my google drive: https://drive.google.com/a/redhat.com/folderview?id=0B0zN-i4uOuoBfkVQQTctcEJ2QjVzejlqa0VTbWJXa2pSendleUJ0UWNJOElDR18tWUlwcGs&usp=sharing As for PCI stub, GPU is not binded to PCI stub because there is no requirement for doing it in feature page. but GPU is not binded to nouveau as well. Created attachment 1069416 [details]
engine.log 02/09/2015
Created attachment 1069417 [details]
vdsm.log 02/09/2015
(In reply to Nisim Simsolo from comment #10) > As for PCI stub, GPU is not binded to PCI stub because there is no > requirement for doing it in feature page. but GPU is not binded to nouveau > as well. Then the feature page is wrong, nouveau must be avoided in the host and is unsupported in the guest. Only assignment of secondary devices in the host is supported. Until these configuration issues are resolved, this bug is not worth investigating. Martin, According to feature page (http://www.ovirt.org/Features/hostdev_passthrough): "The detach_detachable() call takes care of detaching the device from host (unbinding it from current drivers and binding to vfio, or pci-stub if old KVM is used - this behaviour is handled by libvirt's detachFlags call) and correctly setting permissions for /dev/vfio iommu group endpoint." If there is a need for unbinding GPU from host kernel driver and bind it to pci-stub, and also add nouveau to blacklist, feature page should be updated accordingly. GPUs are unique, host drivers are not keen to release the device and acting as the primary graphics device on the host complicates things further. The nouveau driver in the guest is not supported and really has no valid use case for a customer. Additionally nouveau has been known to trigger issues resulting in host crashes, especially on newer cards that are not well supported by nouveau. Finally, if we want to have any hope of diagnosing a host reboot, please provide a serial console log from the host system or at least a crash dump. AFAICT, there is no information relevant to the host kernel reboot in any of the provided logs. Alex, thanks for explanation. As far as I understand, everything should be fine except for GPU where we will have additional instructions to manually append pci-stub.ids to cmdline and possibly blacklist the nouveau driver. Nisim, I'll update the wiki as soon as I'm sure how to formulate it. Added to wiki with reference to vfio-pci blog: http://www.ovirt.org/Features/hostdev_passthrough#GPU_passthrough No code change needed,but we need to update docs with this info Omer - how do we expect users to perform any of the above in RHEVH? The host type (rhel/rhevh) doesn't matter, all the configuration is done through rhevm (ui/api), once the hardware support IOMMU, user can see what devices are available on the host, and attach them to vms. Omer - http://www.ovirt.org/Features/hostdev_passthrough#GPU_passthrough seems to suggest that for GPU a bit of extra effort is required. (In reply to Yaniv Kaul from comment #21) > Omer - http://www.ovirt.org/Features/hostdev_passthrough#GPU_passthrough > seems to suggest that for GPU a bit of extra effort is required. yes, there is additional manual configuration, which can be especially tricky on rhevh iommu kernel parameter needs to be added for any vfio support (we have https://gerrit.ovirt.org/#/c/41507 but there was not much enthusiasm about it, so for now this is manual) specific nvidia options, blacklisting, need to be added to both host and guest for nvidia GPU passthough Nisim, can you confirm it works for you following the additional configuration? The procedure in feature page is not working and unclear (I expect more detailed procedure such as where is kernel cmdline located) Follwing the next steps, according to feature page, shows that GPU is still attached to nouveau: #lspci -n -Find GPU controller and audio device: 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K2200] (rev a2) 01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1) # lspci -n -s 01:00 01:00.0 0300: 10de:13ba (rev a2) 01:00.1 0403: 10de:0fbc (rev a1) -Add vendor:device ids to kernel cmdline: #vi /etc/default/grub - Add the next line to GRUB_CMDLINE_LINUX: pci-stub.ids=10de:13ba,10de:0fbc - In my case, the whole line looks like this: GRUB_CMDLINE_LINUX="nofb splash=quiet console=tty0 console=ttyS0,115200 crashkernel=auto biosdevname=0 rhgb quiet amd_iommu=on pci-stub.ids=10de:13ba,10de:0fbc" - Refresh grub config: #grub2-mkconfig - Reboot host #lspci -nnk: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Quadro K2200] [10de:13ba] (rev a2) Subsystem: NVIDIA Corporation Device [10de:1097] Kernel driver in use: nouveau As you can see, GPU is not detached from nouveau as expected. (In reply to Nisim Simsolo from comment #24) > As you can see, GPU is not detached from nouveau as expected. Hm. I'm afraid there is some issue with either your guest or the procedure indeed. Still, this doesn't really help with passthrough testing as detaching NVIDIA from nouveau driver is a prerequisite Updated the wiki with better explanation for driver blacklisting. Bug verified this time using correct pci-stub procedure and nouveau black listing (as explained in wiki) and on different VM OS (win and linux). Exact scenario of pci stubbing and blacklisting added also to test plan setup preparation. Verification version: rhevm-3.6.0-0.18.el6 sanlock-3.2.4-1.el7.x86_64 vdsm-4.17.8-1.el7ev.noarch qemu-kvm-rhev-2.3.0-24.el7.x86_64 libvirt-client-1.2.17-5.el7.x86_64 Martin, Would you please add here a link to the relevant documentation, or the documentation content itself, that you would have liked to be added to the admin guide? Thanks, Ilanit. http://www.ovirt.org/Features/hostdev_passthrough#GPU_passthrough has now more detailed information useful to the doc team this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015. Please review this bug and if not a blocker, please postpone to a later release. All bugs not postponed on GA release will be automatically re-targeted to - 3.6.1 if severity >= high - 4.0 if severity < high Documentation around GPU passthrough and related limitations is being tracked in bug 1285799. I am closing this bug as a duplicate. *** This bug has been marked as a duplicate of bug 1285799 *** |