Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1373805

Summary:	VM suspend times out due to a driver issue
Product:	Red Hat Enterprise Linux 7	Reporter:	Pablo Iranzo Gómez <pablo.iranzo>
Component:	kernel	Assignee:	Ivan Vecera <ivecera>
kernel sub component:	NIC Drivers	QA Contact:	Ma Yuying <yuma>
Status:	CLOSED CANTFIX	Docs Contact:
Severity:	high
Priority:	high	CC:	alex.williamson, ivecera, jshortt, network-qe, nhorman, pablo.iranzo
Version:	7.2
Target Milestone:	rc
Target Release:	7.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-21 13:10:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pablo Iranzo Gómez 2016-09-07 07:42:54 UTC

Description of problem:

The most noticable symptom is that all the virsh/libvirt actions are timing out. After resetting the blade everything is fine for a couple of hours.
When the issue comes i have the following back trace:

Aug 30 15:21:16 compute-15 kernel: vfio_pci_disable: Failed to reset device 0000:08:1d.6 (-11)
Aug 30 15:24:15 compute-15 kernel: INFO: task libvirtd:5027 blocked for more than 120 seconds.
Aug 30 15:24:15 compute-15 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 15:24:15 compute-15 kernel: libvirtd        D ffff8807f56e5c00     0  5027      1 0x00000080
Aug 30 15:24:15 compute-15 kernel: ffff8807e169bb08 0000000000000082 ffff881fecb5e780 ffff8807e169bfd8
Aug 30 15:24:15 compute-15 kernel: ffff8807e169bfd8 ffff8807e169bfd8 ffff881fecb5e780 ffff881fecb5e780
Aug 30 15:24:15 compute-15 kernel: ffff881fe33a1b80 ffff881fe33a1b88 ffffffff00000000 ffff881fe33a1b90
Aug 30 15:24:15 compute-15 kernel: Call Trace:
Aug 30 15:24:15 compute-15 kernel: [<ffffffff8163b119>] schedule+0x29/0x70
Aug 30 15:24:15 compute-15 kernel: [<ffffffff8163c8d5>] rwsem_down_write_failed+0x115/0x220
Aug 30 15:24:15 compute-15 kernel: [<ffffffff81301b63>] call_rwsem_down_write_failed+0x13/0x20
Aug 30 15:24:15 compute-15 kernel: [<ffffffff812f5318>] ? kobject_release+0x98/0x1b0
Aug 30 15:24:15 compute-15 kernel: [<ffffffff8163a36d>] ? down_write+0x2d/0x30
Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abee3>] blocking_notifier_chain_unregister+0x23/0xe0
Aug 30 15:24:15 compute-15 kernel: [<ffffffff814f6ec2>] iommu_group_unregister_notifier+0x12/0x20
Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091c226>] vfio_group_unlock_and_free+0x26/0x40 [vfio]
Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091c2f1>] vfio_group_release+0xb1/0xe0 [vfio]
Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091c240>] ? vfio_group_unlock_and_free+0x40/0x40 [vfio]
Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091e2b6>] kref_put_mutex.part.3+0x36/0x42 [vfio]
Aug 30 15:24:15 compute-15 kernel: [<ffffffffa091d41a>] vfio_iommu_group_notifier+0x34a/0x360 [vfio]
Aug 30 15:24:15 compute-15 kernel: [<ffffffff81641b5c>] notifier_call_chain+0x4c/0x70
Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd1d>] __blocking_notifier_call_chain+0x4d/0x70
Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd56>] blocking_notifier_call_chain+0x16/0x20
Aug 30 15:24:15 compute-15 kernel: [<ffffffff814f6d1c>] iommu_bus_notifier+0x8c/0xe0
Aug 30 15:24:15 compute-15 kernel: [<ffffffff81641b5c>] notifier_call_chain+0x4c/0x70
Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd1d>] __blocking_notifier_call_chain+0x4d/0x70
Aug 30 15:24:15 compute-15 kernel: [<ffffffff810abd56>] blocking_notifier_call_chain+0x16/0x20
Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f63f0>] __device_release_driver+0xd0/0xf0
Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f6433>] device_release_driver+0x23/0x30
Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f4d9d>] driver_unbind+0xbd/0xe0
Aug 30 15:24:15 compute-15 kernel: [<ffffffff813f42d4>] drv_attr_store+0x24/0x40
Aug 30 15:24:15 compute-15 kernel: [<ffffffff812593d6>] sysfs_write_file+0xc6/0x140
Aug 30 15:24:15 compute-15 kernel: [<ffffffff811de7bd>] vfs_write+0xbd/0x1e0
Aug 30 15:24:15 compute-15 kernel: [<ffffffff811eed8d>] ? putname+0x3d/0x60
Aug 30 15:24:15 compute-15 kernel: [<ffffffff811df25f>] SyS_write+0x7f/0xe0
Aug 30 15:24:15 compute-15 kernel: [<ffffffff81646189>] system_call_fastpath+0x16/0x1b


Hardware: HP gen9; 3 controller + 16 compute

Comment 3 Neil Horman 2016-09-07 15:38:34 UTC

Looks like this should be fixed in 7.3 (in commit 1b1b4f1518c43d9660a3ab86c9e2fa5698848843).  Please retest with  kernel-3.10.0-377.el7 or later.

Comment 4 Ivan Vecera 2016-09-07 16:00:37 UTC

(In reply to Neil Horman from comment #3)
> Looks like this should be fixed in 7.3 (in commit
> 1b1b4f1518c43d9660a3ab86c9e2fa5698848843).  Please retest with 
> kernel-3.10.0-377.el7 or later.

Neil,
this commit only removes the warning message and does not solve the hung libvirtd process.

Comment 7 Alex Williamson 2016-09-13 19:56:57 UTC

AFAIK, we do not support suspending a VM (S3/S4) with an assigned device attached.

That said, rarely is the hung task backtrace all that useful, it's blocked, but why is it blocked.  libvirt is trying to unbind the device from vfio-pci, which is an operation that will block as long as the vfio device is in use.  So who is using it.  Is the QEMU process still running?

A workaround might be to change from managed='yes' to managed='no' in the VM xml for the assigned device, if the device is pre-bound to vfio-pci using 'virsh nodedev-detach $DEV' then libvirt won't try to return it to the host driver, avoiding this whole unbind issue (hopefully).  That said, I don't know if suspend will then work, because it's not supported.

Comment 8 Pablo Iranzo Gómez 2016-09-16 10:50:40 UTC

Thanks Alex, I'm creating kbase for it.