Bug 1121540

Summary:	Hot-unplugging a busy virtio-rng device from Linux guest causes rng device stuck
Product:	Red Hat Enterprise Linux 7	Reporter:	Amos Kong <akong>
Component:	kernel	Assignee:	jason wang <jasowang>
kernel sub component:	KVM	QA Contact:	Virtualization Bugs <virt-bugs>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	ailan, amit.shah, ghammer, huding, jasowang, juzhang, knoel, mazhang, mkenneth, rbalakri, rpacheco, virt-maint, xfu, xhan
Version:	7.0	Keywords:	Reopened
Target Milestone:	rc
Target Release:	7.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-09-07 07:50:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	917953, 1127062, 1300916

Description Amos Kong 2014-07-21 07:47:03 UTC

Description of problem:
Try to hot-unplugged a busy rng device from linux guest, the device can't be removed, device still can't be removed after making the device to be free.

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.427.el6.x86_64
guest kernel: 3.10.0-123.el7

How reproducible:
100%

Steps to Reproduce:
1. launch guest with one rng device
   qemu-kvm ... -object rng-random,filename=/dev/urandom,id=rng0   -device virtio-rng-pci,id=device-rng0,rng=rng0
2. read random data from guest rng device
   guest) # dd if=/dev/hwrng of=/dev/null &
3. try to hot-unplug device from monitor
   (qemu) device_del device-rng0
4. check if device still exists
   (qemu) info pci
5. kill dd process in guest
6. repeat step 3, 4
7, repeat step 2

Actual results:

step4: busy rng device can't be hot-unplugged
step6: unused device can't be hot-unplugged, and we can't read data from this device in guest

Expected results:
device can be hot-unplugged in step 6

Comment 1 Amos Kong 2014-07-21 07:53:29 UTC

With RHEL6 guest, when we hot-remove the device from QEMU monitor, the dd process in guest will exit and the device can be hot-removed from QEMU.

So I move this bug to RHEL7 kernel.

Comment 3 Amit Shah 2014-07-21 09:41:39 UTC

Likely a dup of bug 1081431.  Note that I can't reproduce this bug upstream, or in my RHEL7.0 VM.  Also, from the other bug, RHEL6 rmmod does succeed, but the guest panics after some time passes.

Comment 4 Amos Kong 2014-07-28 07:28:09 UTC

(In reply to Amit Shah from comment #3)
> Likely a dup of bug 1081431.  Note that I can't reproduce this bug upstream,
> or in my RHEL7.0 VM.  Also, from the other bug, RHEL6 rmmod does succeed,
> but the guest panics after some time passes.

I can't reproduce this bug in latest upstream kernel.

I can reproduce this bug in kernel-3.10.0-123.el7.x86_64 and kernel-3.10.0-140.el7.x86_64

Comment 5 Amos Kong 2014-08-04 04:26:23 UTC


*** This bug has been marked as a duplicate of bug 1081431 ***

Comment 6 Amos Kong 2014-08-05 17:41:21 UTC

It's not a duplicated bug of bug 1081431.


Posted a fix to Upstream:
http://lists.linuxfoundation.org/pipermail/virtualization/2014-August/027049.html

When we try to hot-remove a busy virtio-rng device from QEMU monitor,
the device can't be hot-removed. Because virtio-rng driver hangs at
wait_for_completion_killable().

This patch fixed the hang by completing have_data completion before
unregistering a virtio-rng device.

Cc: stable.org

Comment 7 Amos Kong 2014-08-06 04:30:40 UTC

I found _another_ hotunplug issue only in rhel7 kernel (upstream + amit 4 patches + PATCH [1] works well).

hot-remove a busy device, it's fail. Kill reading process (dd), still can't hot-remove the device. This is the difference with the hotplug issue I fixed in comment #6 by PATCH [1].

I try to backport mutiple dev support + core/rng fixes (amit 4 patches + my patch[1]) to rhel7 kernel.

Then those two files (drivers/char/hw_random/core.c, drivers/char/hw_random/virtio-rng.c) are _almost completely same_ as Upstream. But this hotplug issue still exists. Strange ;/


In the same time, I also found another Bug 1127062.


[1] [PATCH] virtio-rng: complete have_data completion in removing device

Comment 8 Amos Kong 2014-09-09 11:35:44 UTC

Posted a 2nd version to upstream:

[PATCH v2] virtio-rng: fix stuck of hot-unplugging busy device

http://marc.info/?l=kvm&m=141026125503138&w=2

Test result:

1. Hotplug remove virtio-rng0, dd process will exit with an error:
    "dd: error reading ‘/dev/hwrng’: No such device"
   virtio-rng0 disappear from 'info pci'

2. Re-read by dd, hotplug virtio-rng1, dd process exit with same
   error, virtio-rng1 disappear

Comment 9 Amos Kong 2014-09-17 11:52:57 UTC

Test result of 3.10.0-145.el7.x86_64 & 3.10.0-161.el7.x86_64:
(this problem doesn't exist in latest upstream, core.c and virtio-rng.c are same as internal, it means we have some internal bug in other part)

Start guest, and directly hotunplug the rng device, wait some minutes, we can get this kernel message:

[  360.634054] INFO: task kworker/0:4:598 blocked for more than 120 seconds.
[  360.636207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.638569] kworker/0:4     D ffff88007fc14600     0   598      2 0x00000080
[  360.640817] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[  360.642530]  ffff88007b8f9a80 0000000000000046 ffff88007b8f9fd8 0000000000014600
[  360.644658]  ffff88007b8f9fd8 0000000000014600 ffff88007b8f0000 ffffffff81991d60
[  360.646440]  ffffffff81991d64 ffff88007b8f0000 00000000ffffffff ffffffff81991d68
[  360.648233] Call Trace:
[  360.648804]  [<ffffffff815ef479>] schedule_preempt_disabled+0x29/0x70
[  360.650286]  [<ffffffff815ed1a5>] __mutex_lock_slowpath+0xc5/0x1c0
[  360.651717]  [<ffffffff815ec60f>] mutex_lock+0x1f/0x2f
[  360.652905]  [<ffffffff813b0549>] hwrng_unregister+0x19/0x110
[  360.654236]  [<ffffffffa02630b1>] remove_common+0x41/0x70 [virtio_rng]
[  360.655743]  [<ffffffffa026310e>] virtrng_remove+0xe/0x10 [virtio_rng]
[  360.657245]  [<ffffffffa0022093>] virtio_dev_remove+0x23/0x80 [virtio]
[  360.658750]  [<ffffffff813c770f>] __device_release_driver+0x7f/0xf0
[  360.660181]  [<ffffffff813c77a3>] device_release_driver+0x23/0x30
[  360.661570]  [<ffffffff813c6f18>] bus_remove_device+0x108/0x180
[  360.662951]  [<ffffffff813c3475>] device_del+0x135/0x1d0
[  360.664193]  [<ffffffff813c352e>] device_unregister+0x1e/0x60
[  360.665517]  [<ffffffffa00224b6>] unregister_virtio_device+0x16/0x30 [virtio]
[  360.668633]  [<ffffffffa00b456b>] virtio_pci_remove+0x2b/0x70 [virtio_pci]
[  360.671687]  [<ffffffff812fcbfb>] pci_device_remove+0x3b/0xb0
[  360.674482]  [<ffffffff813c770f>] __device_release_driver+0x7f/0xf0
[  360.677398]  [<ffffffff813c77a3>] device_release_driver+0x23/0x30
[  360.680253]  [<ffffffff812f5cd4>] pci_stop_bus_device+0x94/0xa0
[  360.683016]  [<ffffffff812f5dc2>] pci_stop_and_remove_bus_device+0x12/0x20
[  360.685993]  [<ffffffff81313046>] disable_slot+0x76/0xd0
[  360.688631]  [<ffffffff81313e23>] acpiphp_disable_and_eject_slot+0x23/0xa0
[  360.691589]  [<ffffffff81313f4b>] hotplug_event+0xab/0x260
[  360.694253]  [<ffffffff8131412a>] hotplug_event_work+0x2a/0x60
[  360.696979]  [<ffffffff813317d3>] acpi_hotplug_work_fn+0x1c/0x27
[  360.699724]  [<ffffffff81089a1b>] process_one_work+0x17b/0x460
[  360.702437]  [<ffffffff8108a7eb>] worker_thread+0x11b/0x400
[  360.705092]  [<ffffffff8108a6d0>] ? rescuer_thread+0x400/0x400
[  360.707791]  [<ffffffff81091bbf>] kthread+0xcf/0xe0
[  360.710342]  [<ffffffff81091af0>] ? kthread_create_on_node+0x140/0x140
[  360.713198]  [<ffffffff815f90ec>] ret_from_fork+0x7c/0xb0
[  360.715812]  [<ffffffff81091af0>] ? kthread_create_on_node+0x140/0x140

Comment 10 Amos Kong 2014-09-17 12:27:16 UTC

I backported two patches to internal, the but stuck still exists.

[PATCH] virtio-rng: skip reading when we start to remove the device
[PATCH] virtio-rng: fix stuck of hot-unplugging busy device

The stuck message doesn't exist in upstream before applied my two patches.

Comment 13 Amos Kong 2014-09-25 17:15:50 UTC

*** Bug 1146437 has been marked as a duplicate of this bug. ***

Comment 16 Amos Kong 2015-05-21 23:34:21 UTC

Bug was fixed in Upstream:

[PATCH 1/2] virtio-rng: fix stuck of hot-unplugging busy device
[PATCH 2/2] virtio-rng: skip reading when we start to remove the device

[PATCH 1/6] hw_random: place mutex around read functions and buffers.
[PATCH 2/6] hw_random: move some code out mutex_lock for avoiding underlying deadlock
[PATCH 3/6] hw_random: use reference counts on each struct hwrng.
[PATCH 4/6] hw_random: fix unregister race.
[PATCH 5/6] hw_random: don't double-check old_rng.
[PATCH 6/6] hw_random: don't init list element we're about to add to list.

[PATCH 1/5] hwrng: core - Use struct completion for cleanup_done
[PATCH 2/5] hwrng: core - Fix current_rng init/cleanup race yet again
[PATCH 3/5] hwrng: core - Do not register device opportunistically
[PATCH 4/5] hwrng: core - Drop current rng in set_current_rng
[PATCH 5/5] hwrng: core - Move hwrng_init call into set_current_rng

Comment 19 jason wang 2015-09-07 07:50:54 UTC

(In reply to Amos Kong from comment #16)
> Bug was fixed in Upstream:
> 
> [PATCH 1/2] virtio-rng: fix stuck of hot-unplugging busy device
> [PATCH 2/2] virtio-rng: skip reading when we start to remove the device
> 
> [PATCH 1/6] hw_random: place mutex around read functions and buffers.
> [PATCH 2/6] hw_random: move some code out mutex_lock for avoiding underlying
> deadlock
> [PATCH 3/6] hw_random: use reference counts on each struct hwrng.
> [PATCH 4/6] hw_random: fix unregister race.
> [PATCH 5/6] hw_random: don't double-check old_rng.
> [PATCH 6/6] hw_random: don't init list element we're about to add to list.
> 
> [PATCH 1/5] hwrng: core - Use struct completion for cleanup_done
> [PATCH 2/5] hwrng: core - Fix current_rng init/cleanup race yet again
> [PATCH 3/5] hwrng: core - Do not register device opportunistically
> [PATCH 4/5] hwrng: core - Drop current rng in set_current_rng
> [PATCH 5/5] hwrng: core - Move hwrng_init call into set_current_rng

The commit list here is the same as 1127062. Close as duplicated.

*** This bug has been marked as a duplicate of bug 1127062 ***