1367369 – Both guest and qemu hang after doing block stream when guest rebooting

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1367369 - Both guest and qemu hang after doing block stream when guest rebooting

Summary: Both guest and qemu hang after doing block stream when guest rebooting

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	John Snow
QA Contact:	Qianqian Zhu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-16 09:42 UTC by Qianqian Zhu
Modified:	2017-08-02 03:27 UTC (History)
CC List:	11 users (show)
Fixed In Version:	qemu-kvm-rhev-2.9.0-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-01 23:32:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
complete dmesg (29.00 KB, text/plain) 2017-02-28 02:48 UTC, Qianqian Zhu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:2392	0	normal	SHIPPED_LIVE	Important: qemu-kvm-rhev security, bug fix, and enhancement update	2017-08-01 20:04:36 UTC

Description Qianqian Zhu 2016-08-16 09:42:02 UTC

Description of problem:
Both guest and qemu hang after doing block stream when guest rebooting.
And both them will back to normal after streaming finished.
It will not hang until guest finish loading virtio-pci 0x6 on boot phase, and the address of my virtio disk is 0x7.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-20.el7.x86_64
qemu-img-rhev-2.6.0-20.el7.x86_64
kernel-devel-3.10.0-491.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Launch guest:
 /usr/libexec/qemu-kvm -name linux -cpu SandyBridge -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 7bef3814-631a-48bb-bae8-2b1de75f7a13 -nodefaults -monitor stdio -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot order=c,menu=on -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/home/RHEL-Server-7.3-64-virtio.qcow2,if=none,cache=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on -spice port=5901,disable-ticketing -vga qxl -global qxl-vga.revision=3 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3C:D9:2B:09:AB:44,bus=pci.0,addr=0x3 -qmp tcp::5555,server,nowait

2.Do snapshot:
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive-virtio-disk0","snapshot-file": "/home/sn1", "format": "qcow2", "mode": "absolute-paths" } }
3.Reboot guest:
(qemu) system_reset
4.Block stream:
{ "execute": "block-stream", "arguments": { "device": "drive-virtio-disk0", "speed":1000000000, "on-error": "report" } }


Actual results:
Guest and qemu hang when guest boot and finish loading virtio-pci 0x6, and come back until streaming completed.

Expected results:
Guest and qemu should not hang during block streaming.

Additional info:
It was able to reproduced with qemu-kvm-rhev-2.3.0-31.el7_2.8.x86_64

Comment 2 John Snow 2016-08-31 19:27:20 UTC

Too late in the 7.3 cycle to fix this for this release. Will investigate for 7.4.

Comment 3 John Snow 2017-02-27 23:49:39 UTC

qianqianzhu, Can you elaborate for me?

What do you mean when you say "It will not hang until guest finish loading virtio-pci 0x6 on boot phase, and the address of my virtio disk is 0x7."?


If I understand you correctly, the timeline looks like this:

1. Make snapshot
2. Reboot
3. Issue "block stream" immediately after step #2.
-- Block stream is now happening while guest tries to boot
-- Guest appears to freeze during bringup (SeaBIOS or Linux freezes?)
-- The guest appears to be frozen after initializing the virtio-pci device?
        (What text output are you using to determine this?)
-- Block stream finishes.
-- Guest unfreezes and boot finishes.

Is that accurate?

Comment 4 Qianqian Zhu 2017-02-28 02:41:22 UTC

(In reply to John Snow from comment #3)
> qianqianzhu, Can you elaborate for me?
> 
> What do you mean when you say "It will not hang until guest finish loading
> virtio-pci 0x6 on boot phase, and the address of my virtio disk is 0x7."?
> 
> 
> If I understand you correctly, the timeline looks like this:
> 
> 1. Make snapshot
> 2. Reboot
> 3. Issue "block stream" immediately after step #2.
> -- Block stream is now happening while guest tries to boot
> -- Guest appears to freeze during bringup (SeaBIOS or Linux freezes?)
> -- The guest appears to be frozen after initializing the virtio-pci device?
>         (What text output are you using to determine this?)
> -- Block stream finishes.
> -- Guest unfreezes and boot finishes.
> 
> Is that accurate?

Hi John,

 The steps you listed are correct, guest is booting linux not in seabios when frozen, and the guest dmesg is like:

[    2.771360] [drm:qxl_pci_probe [qxl]] *ERROR* qxl too old, doesn't support client_monitors_config, use xf86-video-qxl in user mode
[    2.773122] qxl: probe of 0000:00:02.0 failed with error -22
[   47.712170] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
[   47.719395] virtio-pci 0000:00:06.0: irq 24 for MSI/MSI-X
[   47.719417] virtio-pci 0000:00:06.0: irq 25 for MSI/MSI-X

It freezes after [    2.773122] every time, when block stream finished, it continue [   47.712170] and below steps. So I think it hangs when it is trying to load device virtio-pci 0x6.

Thanks,
Qianqian

Comment 5 Qianqian Zhu 2017-02-28 02:48:47 UTC

Created attachment 1258246 [details]
complete dmesg

Here is the complete boot messages.

Comment 6 John Snow 2017-02-28 15:52:45 UTC

qianqianzhu: Thank you for the clarification and the boot log! I'll test this out today.

Comment 7 John Snow 2017-03-07 01:40:32 UTC

Wow, yeah, confirmed. Easy to reproduce. Even without issuing further QMP commands on boot, QEMU and the guest will both freeze.

Comment 8 John Snow 2017-03-08 00:42:57 UTC

Problem appears to be that virtio_blk_data_plane_stop is called with the BQL held, and then issues a bdrv_drained_begin->bdrv_drain_recurse which will not resolve until the block_stream job has finished.

The guest writes to the VIRTIO_PCI_COMMON_STATUS register for the virtio-pci device to trigger virtio_pci_stop_ioeventfd, which causes the drain which locks until the block stream process finishes.

This process uses bdrv_drain instead of bdrv_drain_all. bdrv_drain, unlike bdrv_drain_all, does not attempt to pause any relevant jobs, so the drain is not allowed to complete until the job finishes.

Fixing it would hopefully be as simple as adding a job pause around bdrv_drained_begin and bdrv_drained_end, but since these functions are used where jobs are added between the drain begin/end, we need to take care not to attempt to resume newly created jobs.

I'll have to poke at this a little bit, but hopefully it's not too hard.

Comment 9 Paolo Bonzini 2017-03-10 14:14:59 UTC

Yeah, pause/resume would be a good workaround.  The full solution is much more complex, see https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02016.html for some discussion between me and Kevin.

Comment 10 John Snow 2017-03-30 00:06:45 UTC

upstream:

600ac6a0ef5c06418446ef2f37407bddcc51b21c blockjob: add devops to blockjob backends
f4d9cc88ee69a5b04a843424e50f466e36fcad4e block-backend: add drained_begin / drained_end ops
e3796a245ad0efa65ca8d2dc6424562a8fbaeb6a blockjob: add block_job_start_shim

Included in 2.9.0-rc2; will need to be backported unless we rebase to rc2+.

Comment 11 John Snow 2017-04-26 22:56:44 UTC

Branch has been rebased and should now include a fix.

Comment 14 Qianqian Zhu 2017-05-03 10:46:49 UTC

Verified on:
qemu-kvm-rhev-2.9.0-1.el7.x86_64
kernel-3.10.0-640.el7.x86_64

Step same as comment 0.
Results:
qemu works well and guest reboot succeed during block stream.

Moving to VERIFIED therefore.

Comment 16 errata-xmlrpc 2017-08-01 23:32:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 17 errata-xmlrpc 2017-08-02 01:09:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 18 errata-xmlrpc 2017-08-02 02:01:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 19 errata-xmlrpc 2017-08-02 02:42:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 20 errata-xmlrpc 2017-08-02 03:07:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 21 errata-xmlrpc 2017-08-02 03:27:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Note You need to log in before you can comment on or make changes to this bug.