Bug 1121540
Summary: | Hot-unplugging a busy virtio-rng device from Linux guest causes rng device stuck | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Amos Kong <akong> |
Component: | kernel | Assignee: | jason wang <jasowang> |
kernel sub component: | KVM | QA Contact: | Virtualization Bugs <virt-bugs> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | ailan, amit.shah, ghammer, huding, jasowang, juzhang, knoel, mazhang, mkenneth, rbalakri, rpacheco, virt-maint, xfu, xhan |
Version: | 7.0 | Keywords: | Reopened |
Target Milestone: | rc | ||
Target Release: | 7.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-09-07 07:50:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 917953, 1127062, 1300916 |
Description
Amos Kong
2014-07-21 07:47:03 UTC
With RHEL6 guest, when we hot-remove the device from QEMU monitor, the dd process in guest will exit and the device can be hot-removed from QEMU. So I move this bug to RHEL7 kernel. Likely a dup of bug 1081431. Note that I can't reproduce this bug upstream, or in my RHEL7.0 VM. Also, from the other bug, RHEL6 rmmod does succeed, but the guest panics after some time passes. (In reply to Amit Shah from comment #3) > Likely a dup of bug 1081431. Note that I can't reproduce this bug upstream, > or in my RHEL7.0 VM. Also, from the other bug, RHEL6 rmmod does succeed, > but the guest panics after some time passes. I can't reproduce this bug in latest upstream kernel. I can reproduce this bug in kernel-3.10.0-123.el7.x86_64 and kernel-3.10.0-140.el7.x86_64 *** This bug has been marked as a duplicate of bug 1081431 *** It's not a duplicated bug of bug 1081431. Posted a fix to Upstream: http://lists.linuxfoundation.org/pipermail/virtualization/2014-August/027049.html When we try to hot-remove a busy virtio-rng device from QEMU monitor, the device can't be hot-removed. Because virtio-rng driver hangs at wait_for_completion_killable(). This patch fixed the hang by completing have_data completion before unregistering a virtio-rng device. Cc: stable.org I found _another_ hotunplug issue only in rhel7 kernel (upstream + amit 4 patches + PATCH [1] works well). hot-remove a busy device, it's fail. Kill reading process (dd), still can't hot-remove the device. This is the difference with the hotplug issue I fixed in comment #6 by PATCH [1]. I try to backport mutiple dev support + core/rng fixes (amit 4 patches + my patch[1]) to rhel7 kernel. Then those two files (drivers/char/hw_random/core.c, drivers/char/hw_random/virtio-rng.c) are _almost completely same_ as Upstream. But this hotplug issue still exists. Strange ;/ In the same time, I also found another Bug 1127062. [1] [PATCH] virtio-rng: complete have_data completion in removing device Posted a 2nd version to upstream: [PATCH v2] virtio-rng: fix stuck of hot-unplugging busy device http://marc.info/?l=kvm&m=141026125503138&w=2 Test result: 1. Hotplug remove virtio-rng0, dd process will exit with an error: "dd: error reading ‘/dev/hwrng’: No such device" virtio-rng0 disappear from 'info pci' 2. Re-read by dd, hotplug virtio-rng1, dd process exit with same error, virtio-rng1 disappear Test result of 3.10.0-145.el7.x86_64 & 3.10.0-161.el7.x86_64: (this problem doesn't exist in latest upstream, core.c and virtio-rng.c are same as internal, it means we have some internal bug in other part) Start guest, and directly hotunplug the rng device, wait some minutes, we can get this kernel message: [ 360.634054] INFO: task kworker/0:4:598 blocked for more than 120 seconds. [ 360.636207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 360.638569] kworker/0:4 D ffff88007fc14600 0 598 2 0x00000080 [ 360.640817] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 360.642530] ffff88007b8f9a80 0000000000000046 ffff88007b8f9fd8 0000000000014600 [ 360.644658] ffff88007b8f9fd8 0000000000014600 ffff88007b8f0000 ffffffff81991d60 [ 360.646440] ffffffff81991d64 ffff88007b8f0000 00000000ffffffff ffffffff81991d68 [ 360.648233] Call Trace: [ 360.648804] [<ffffffff815ef479>] schedule_preempt_disabled+0x29/0x70 [ 360.650286] [<ffffffff815ed1a5>] __mutex_lock_slowpath+0xc5/0x1c0 [ 360.651717] [<ffffffff815ec60f>] mutex_lock+0x1f/0x2f [ 360.652905] [<ffffffff813b0549>] hwrng_unregister+0x19/0x110 [ 360.654236] [<ffffffffa02630b1>] remove_common+0x41/0x70 [virtio_rng] [ 360.655743] [<ffffffffa026310e>] virtrng_remove+0xe/0x10 [virtio_rng] [ 360.657245] [<ffffffffa0022093>] virtio_dev_remove+0x23/0x80 [virtio] [ 360.658750] [<ffffffff813c770f>] __device_release_driver+0x7f/0xf0 [ 360.660181] [<ffffffff813c77a3>] device_release_driver+0x23/0x30 [ 360.661570] [<ffffffff813c6f18>] bus_remove_device+0x108/0x180 [ 360.662951] [<ffffffff813c3475>] device_del+0x135/0x1d0 [ 360.664193] [<ffffffff813c352e>] device_unregister+0x1e/0x60 [ 360.665517] [<ffffffffa00224b6>] unregister_virtio_device+0x16/0x30 [virtio] [ 360.668633] [<ffffffffa00b456b>] virtio_pci_remove+0x2b/0x70 [virtio_pci] [ 360.671687] [<ffffffff812fcbfb>] pci_device_remove+0x3b/0xb0 [ 360.674482] [<ffffffff813c770f>] __device_release_driver+0x7f/0xf0 [ 360.677398] [<ffffffff813c77a3>] device_release_driver+0x23/0x30 [ 360.680253] [<ffffffff812f5cd4>] pci_stop_bus_device+0x94/0xa0 [ 360.683016] [<ffffffff812f5dc2>] pci_stop_and_remove_bus_device+0x12/0x20 [ 360.685993] [<ffffffff81313046>] disable_slot+0x76/0xd0 [ 360.688631] [<ffffffff81313e23>] acpiphp_disable_and_eject_slot+0x23/0xa0 [ 360.691589] [<ffffffff81313f4b>] hotplug_event+0xab/0x260 [ 360.694253] [<ffffffff8131412a>] hotplug_event_work+0x2a/0x60 [ 360.696979] [<ffffffff813317d3>] acpi_hotplug_work_fn+0x1c/0x27 [ 360.699724] [<ffffffff81089a1b>] process_one_work+0x17b/0x460 [ 360.702437] [<ffffffff8108a7eb>] worker_thread+0x11b/0x400 [ 360.705092] [<ffffffff8108a6d0>] ? rescuer_thread+0x400/0x400 [ 360.707791] [<ffffffff81091bbf>] kthread+0xcf/0xe0 [ 360.710342] [<ffffffff81091af0>] ? kthread_create_on_node+0x140/0x140 [ 360.713198] [<ffffffff815f90ec>] ret_from_fork+0x7c/0xb0 [ 360.715812] [<ffffffff81091af0>] ? kthread_create_on_node+0x140/0x140 I backported two patches to internal, the but stuck still exists. [PATCH] virtio-rng: skip reading when we start to remove the device [PATCH] virtio-rng: fix stuck of hot-unplugging busy device The stuck message doesn't exist in upstream before applied my two patches. *** Bug 1146437 has been marked as a duplicate of this bug. *** Bug was fixed in Upstream: [PATCH 1/2] virtio-rng: fix stuck of hot-unplugging busy device [PATCH 2/2] virtio-rng: skip reading when we start to remove the device [PATCH 1/6] hw_random: place mutex around read functions and buffers. [PATCH 2/6] hw_random: move some code out mutex_lock for avoiding underlying deadlock [PATCH 3/6] hw_random: use reference counts on each struct hwrng. [PATCH 4/6] hw_random: fix unregister race. [PATCH 5/6] hw_random: don't double-check old_rng. [PATCH 6/6] hw_random: don't init list element we're about to add to list. [PATCH 1/5] hwrng: core - Use struct completion for cleanup_done [PATCH 2/5] hwrng: core - Fix current_rng init/cleanup race yet again [PATCH 3/5] hwrng: core - Do not register device opportunistically [PATCH 4/5] hwrng: core - Drop current rng in set_current_rng [PATCH 5/5] hwrng: core - Move hwrng_init call into set_current_rng (In reply to Amos Kong from comment #16) > Bug was fixed in Upstream: > > [PATCH 1/2] virtio-rng: fix stuck of hot-unplugging busy device > [PATCH 2/2] virtio-rng: skip reading when we start to remove the device > > [PATCH 1/6] hw_random: place mutex around read functions and buffers. > [PATCH 2/6] hw_random: move some code out mutex_lock for avoiding underlying > deadlock > [PATCH 3/6] hw_random: use reference counts on each struct hwrng. > [PATCH 4/6] hw_random: fix unregister race. > [PATCH 5/6] hw_random: don't double-check old_rng. > [PATCH 6/6] hw_random: don't init list element we're about to add to list. > > [PATCH 1/5] hwrng: core - Use struct completion for cleanup_done > [PATCH 2/5] hwrng: core - Fix current_rng init/cleanup race yet again > [PATCH 3/5] hwrng: core - Do not register device opportunistically > [PATCH 4/5] hwrng: core - Drop current rng in set_current_rng > [PATCH 5/5] hwrng: core - Move hwrng_init call into set_current_rng The commit list here is the same as 1127062. Close as duplicated. *** This bug has been marked as a duplicate of bug 1127062 *** |