Bug 1367369 - Both guest and qemu hang after doing block stream when guest rebooting
Summary: Both guest and qemu hang after doing block stream when guest rebooting
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: John Snow
QA Contact: Qianqian Zhu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-16 09:42 UTC by Qianqian Zhu
Modified: 2017-08-02 03:27 UTC (History)
11 users (show)

Fixed In Version: qemu-kvm-rhev-2.9.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-01 23:32:13 UTC


Attachments (Terms of Use)
complete dmesg (29.00 KB, text/plain)
2017-02-28 02:48 UTC, Qianqian Zhu
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:2392 normal SHIPPED_LIVE Important: qemu-kvm-rhev security, bug fix, and enhancement update 2017-08-01 20:04:36 UTC

Description Qianqian Zhu 2016-08-16 09:42:02 UTC
Description of problem:
Both guest and qemu hang after doing block stream when guest rebooting.
And both them will back to normal after streaming finished.
It will not hang until guest finish loading virtio-pci 0x6 on boot phase, and the address of my virtio disk is 0x7.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-20.el7.x86_64
qemu-img-rhev-2.6.0-20.el7.x86_64
kernel-devel-3.10.0-491.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Launch guest:
 /usr/libexec/qemu-kvm -name linux -cpu SandyBridge -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 7bef3814-631a-48bb-bae8-2b1de75f7a13 -nodefaults -monitor stdio -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot order=c,menu=on -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/home/RHEL-Server-7.3-64-virtio.qcow2,if=none,cache=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on -spice port=5901,disable-ticketing -vga qxl -global qxl-vga.revision=3 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3C:D9:2B:09:AB:44,bus=pci.0,addr=0x3 -qmp tcp::5555,server,nowait

2.Do snapshot:
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive-virtio-disk0","snapshot-file": "/home/sn1", "format": "qcow2", "mode": "absolute-paths" } }
3.Reboot guest:
(qemu) system_reset
4.Block stream:
{ "execute": "block-stream", "arguments": { "device": "drive-virtio-disk0", "speed":1000000000, "on-error": "report" } }


Actual results:
Guest and qemu hang when guest boot and finish loading virtio-pci 0x6, and come back until streaming completed.

Expected results:
Guest and qemu should not hang during block streaming.

Additional info:
It was able to reproduced with qemu-kvm-rhev-2.3.0-31.el7_2.8.x86_64

Comment 2 John Snow 2016-08-31 19:27:20 UTC
Too late in the 7.3 cycle to fix this for this release. Will investigate for 7.4.

Comment 3 John Snow 2017-02-27 23:49:39 UTC
qianqianzhu, Can you elaborate for me?

What do you mean when you say "It will not hang until guest finish loading virtio-pci 0x6 on boot phase, and the address of my virtio disk is 0x7."?


If I understand you correctly, the timeline looks like this:

1. Make snapshot
2. Reboot
3. Issue "block stream" immediately after step #2.
-- Block stream is now happening while guest tries to boot
-- Guest appears to freeze during bringup (SeaBIOS or Linux freezes?)
-- The guest appears to be frozen after initializing the virtio-pci device?
        (What text output are you using to determine this?)
-- Block stream finishes.
-- Guest unfreezes and boot finishes.

Is that accurate?

Comment 4 Qianqian Zhu 2017-02-28 02:41:22 UTC
(In reply to John Snow from comment #3)
> qianqianzhu, Can you elaborate for me?
> 
> What do you mean when you say "It will not hang until guest finish loading
> virtio-pci 0x6 on boot phase, and the address of my virtio disk is 0x7."?
> 
> 
> If I understand you correctly, the timeline looks like this:
> 
> 1. Make snapshot
> 2. Reboot
> 3. Issue "block stream" immediately after step #2.
> -- Block stream is now happening while guest tries to boot
> -- Guest appears to freeze during bringup (SeaBIOS or Linux freezes?)
> -- The guest appears to be frozen after initializing the virtio-pci device?
>         (What text output are you using to determine this?)
> -- Block stream finishes.
> -- Guest unfreezes and boot finishes.
> 
> Is that accurate?

Hi John,

 The steps you listed are correct, guest is booting linux not in seabios when frozen, and the guest dmesg is like:

[    2.771360] [drm:qxl_pci_probe [qxl]] *ERROR* qxl too old, doesn't support client_monitors_config, use xf86-video-qxl in user mode
[    2.773122] qxl: probe of 0000:00:02.0 failed with error -22
[   47.712170] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
[   47.719395] virtio-pci 0000:00:06.0: irq 24 for MSI/MSI-X
[   47.719417] virtio-pci 0000:00:06.0: irq 25 for MSI/MSI-X

It freezes after [    2.773122] every time, when block stream finished, it continue [   47.712170] and below steps. So I think it hangs when it is trying to load device virtio-pci 0x6.

Thanks,
Qianqian

Comment 5 Qianqian Zhu 2017-02-28 02:48:47 UTC
Created attachment 1258246 [details]
complete dmesg

Here is the complete boot messages.

Comment 6 John Snow 2017-02-28 15:52:45 UTC
qianqianzhu: Thank you for the clarification and the boot log! I'll test this out today.

Comment 7 John Snow 2017-03-07 01:40:32 UTC
Wow, yeah, confirmed. Easy to reproduce. Even without issuing further QMP commands on boot, QEMU and the guest will both freeze.

Comment 8 John Snow 2017-03-08 00:42:57 UTC
Problem appears to be that virtio_blk_data_plane_stop is called with the BQL held, and then issues a bdrv_drained_begin->bdrv_drain_recurse which will not resolve until the block_stream job has finished.

The guest writes to the VIRTIO_PCI_COMMON_STATUS register for the virtio-pci device to trigger virtio_pci_stop_ioeventfd, which causes the drain which locks until the block stream process finishes.

This process uses bdrv_drain instead of bdrv_drain_all. bdrv_drain, unlike bdrv_drain_all, does not attempt to pause any relevant jobs, so the drain is not allowed to complete until the job finishes.

Fixing it would hopefully be as simple as adding a job pause around bdrv_drained_begin and bdrv_drained_end, but since these functions are used where jobs are added between the drain begin/end, we need to take care not to attempt to resume newly created jobs.

I'll have to poke at this a little bit, but hopefully it's not too hard.

Comment 9 Paolo Bonzini 2017-03-10 14:14:59 UTC
Yeah, pause/resume would be a good workaround.  The full solution is much more complex, see https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02016.html for some discussion between me and Kevin.

Comment 10 John Snow 2017-03-30 00:06:45 UTC
upstream:

600ac6a0ef5c06418446ef2f37407bddcc51b21c blockjob: add devops to blockjob backends
f4d9cc88ee69a5b04a843424e50f466e36fcad4e block-backend: add drained_begin / drained_end ops
e3796a245ad0efa65ca8d2dc6424562a8fbaeb6a blockjob: add block_job_start_shim

Included in 2.9.0-rc2; will need to be backported unless we rebase to rc2+.

Comment 11 John Snow 2017-04-26 22:56:44 UTC
Branch has been rebased and should now include a fix.

Comment 14 Qianqian Zhu 2017-05-03 10:46:49 UTC
Verified on:
qemu-kvm-rhev-2.9.0-1.el7.x86_64
kernel-3.10.0-640.el7.x86_64

Step same as comment 0.
Results:
qemu works well and guest reboot succeed during block stream.

Moving to VERIFIED therefore.

Comment 16 errata-xmlrpc 2017-08-01 23:32:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 17 errata-xmlrpc 2017-08-02 01:09:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 18 errata-xmlrpc 2017-08-02 02:01:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 19 errata-xmlrpc 2017-08-02 02:42:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 20 errata-xmlrpc 2017-08-02 03:07:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 21 errata-xmlrpc 2017-08-02 03:27:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392


Note You need to log in before you can comment on or make changes to this bug.