Hide Forgot
Description of problem: The most noticable symptom is that all the virsh/libvirt actions are timing out. After resetting the blade everything is fine for a couple of hours. When the issue comes i have the following back trace: Aug 30 15:21:16 compute-15 kernel: vfio_pci_disable: Failed to reset device 0000:08:1d.6 (-11) Aug 30 15:24:15 compute-15 kernel: INFO: task libvirtd:5027 blocked for more than 120 seconds. Aug 30 15:24:15 compute-15 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 30 15:24:15 compute-15 kernel: libvirtd D ffff8807f56e5c00 0 5027 1 0x00000080 Aug 30 15:24:15 compute-15 kernel: ffff8807e169bb08 0000000000000082 ffff881fecb5e780 ffff8807e169bfd8 Aug 30 15:24:15 compute-15 kernel: ffff8807e169bfd8 ffff8807e169bfd8 ffff881fecb5e780 ffff881fecb5e780 Aug 30 15:24:15 compute-15 kernel: ffff881fe33a1b80 ffff881fe33a1b88 ffffffff00000000 ffff881fe33a1b90 Aug 30 15:24:15 compute-15 kernel: Call Trace: Aug 30 15:24:15 compute-15 kernel: [<ffffffff8163b119>] schedule+0x29/0x70 Aug 30 15:24:15 compute-15 kernel: [<ffffffff8163c8d5>] rwsem_down_write_failed+0x115/0x220 Aug 30 15:24:15 compute-15 kernel: [<ffffffff81301b63>] call_rwsem_down_write_failed+0x13/0x20 Aug 30 15:24:15 compute-15 kernel: [<ffffffff812f5318>] ? kobject_release+0x98/0x1b0 Aug 30 15:24:15 compute-15 kernel: [<ffffffff8163a36d>] ? down_write+0x2d/0x30 Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abee3>] blocking_notifier_chain_unregister+0x23/0xe0 Aug 30 15:24:15 compute-15 kernel: [<ffffffff814f6ec2>] iommu_group_unregister_notifier+0x12/0x20 Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091c226>] vfio_group_unlock_and_free+0x26/0x40 [vfio] Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091c2f1>] vfio_group_release+0xb1/0xe0 [vfio] Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091c240>] ? vfio_group_unlock_and_free+0x40/0x40 [vfio] Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091e2b6>] kref_put_mutex.part.3+0x36/0x42 [vfio] Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091d41a>] vfio_iommu_group_notifier+0x34a/0x360 [vfio] Aug 30 15:24:15 compute-15 kernel: [<ffffffff81641b5c>] notifier_call_chain+0x4c/0x70 Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd1d>] __blocking_notifier_call_chain+0x4d/0x70 Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd56>] blocking_notifier_call_chain+0x16/0x20 Aug 30 15:24:15 compute-15 kernel: [<ffffffff814f6d1c>] iommu_bus_notifier+0x8c/0xe0 Aug 30 15:24:15 compute-15 kernel: [<ffffffff81641b5c>] notifier_call_chain+0x4c/0x70 Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd1d>] __blocking_notifier_call_chain+0x4d/0x70 Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd56>] blocking_notifier_call_chain+0x16/0x20 Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f63f0>] __device_release_driver+0xd0/0xf0 Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f6433>] device_release_driver+0x23/0x30 Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f4d9d>] driver_unbind+0xbd/0xe0 Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f42d4>] drv_attr_store+0x24/0x40 Aug 30 15:24:15 compute-15 kernel: [<ffffffff812593d6>] sysfs_write_file+0xc6/0x140 Aug 30 15:24:15 compute-15 kernel: [<ffffffff811de7bd>] vfs_write+0xbd/0x1e0 Aug 30 15:24:15 compute-15 kernel: [<ffffffff811eed8d>] ? putname+0x3d/0x60 Aug 30 15:24:15 compute-15 kernel: [<ffffffff811df25f>] SyS_write+0x7f/0xe0 Aug 30 15:24:15 compute-15 kernel: [<ffffffff81646189>] system_call_fastpath+0x16/0x1b Hardware: HP gen9; 3 controller + 16 compute
Looks like this should be fixed in 7.3 (in commit 1b1b4f1518c43d9660a3ab86c9e2fa5698848843). Please retest with kernel-3.10.0-377.el7 or later.
(In reply to Neil Horman from comment #3) > Looks like this should be fixed in 7.3 (in commit > 1b1b4f1518c43d9660a3ab86c9e2fa5698848843). Please retest with > kernel-3.10.0-377.el7 or later. Neil, this commit only removes the warning message and does not solve the hung libvirtd process.
AFAIK, we do not support suspending a VM (S3/S4) with an assigned device attached. That said, rarely is the hung task backtrace all that useful, it's blocked, but why is it blocked. libvirt is trying to unbind the device from vfio-pci, which is an operation that will block as long as the vfio device is in use. So who is using it. Is the QEMU process still running? A workaround might be to change from managed='yes' to managed='no' in the VM xml for the assigned device, if the device is pre-bound to vfio-pci using 'virsh nodedev-detach $DEV' then libvirt won't try to return it to the host driver, avoiding this whole unbind issue (hopefully). That said, I don't know if suspend will then work, because it's not supported.
Thanks Alex, I'm creating kbase for it.