1609137 – [data plane] Guest hang after the drive-backup

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1609137 - [data plane] Guest hang after the drive-backup

Summary: [data plane] Guest hang after the drive-backup

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Virtualization Maintenance
QA Contact:	Gu Nini
Docs Contact:
URL:
Whiteboard:
Depends On:	1601212 1637976
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-27 05:53 UTC by Gu Nini
Modified:	2019-12-23 06:26 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-23 06:26:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Gu Nini 2018-07-27 05:53:07 UTC

Description of problem:
When data plane used, after the drive-backup job completes, the guest hang.

Version-Release number of selected component (if applicable):
Host kernel: 3.10.0-926.el7.x86_64
qemu-kvm-rhev-2.12.0-8.el7.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Boot up a guest with data-plane enabled disk.

    -object iothread,id=iothread1 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x6,iothread=iothread1 \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/$1 \
    -device scsi-hd,drive=drive_image1,id=image1 \
    -drive id=drive_image2,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/$2 \
    -device scsi-hd,drive=drive_image2,id=image2 \

2. Do driver-backup full on the disk in qmp.
{ "execute": "drive-backup", "arguments": { "device": "drive_image2", "target":"full_backup.img","format":"qcow2","sync":"full"}}


Actual results:
After the drive-backup finishes, the guest hang, i.e. no response in hmp and qmp:
{ "execute": "drive-backup", "arguments": { "device": "drive_image2", "target":"full_backup.img","format":"qcow2","sync":"full"}}
{"timestamp": {"seconds": 1532669038, "microseconds": 281738}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669038, "microseconds": 281850}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image2"}}
{"return": {}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529463}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529535}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529629}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image2", "len": 10737418240, "offset": 10737418240, "speed": 0, "type": "backup"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529671}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529700}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image2"}}
{ "execute": "query-block"}


Expected results:
The drive-backup could finish without any problem.

Additional info:
# gdb -batch -ex bt -p 10440
[New LWP 10542]
[New LWP 10504]
[New LWP 10501]
[New LWP 10500]
[New LWP 10499]
[New LWP 10495]
[New LWP 10442]
[New LWP 10441]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f2d5b8022cf in ppoll () from /lib64/libc.so.6
#0  0x00007f2d5b8022cf in ppoll () at /lib64/libc.so.6
#1  0x000055b2a64368eb in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x000055b2a64368eb in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x000055b2a6438635 in aio_poll (ctx=0x55b2a7af17c0, blocking=blocking@entry=true) at util/aio-posix.c:629
#4  0x000055b2a63b213a in bdrv_flush (bs=bs@entry=0x55b2a7d64800) at block/io.c:2560
#5  0x000055b2a6362beb in bdrv_unref (bs=0x55b2a7d64800) at block.c:3322
#6  0x000055b2a6362beb in bdrv_unref (bs=0x55b2a7d64800) at block.c:3510
#7  0x000055b2a6362beb in bdrv_unref (bs=0x55b2a7d64800) at block.c:4558
#8  0x000055b2a63661b4 in block_job_remove_all_bdrv (job=job@entry=0x55b2a7b31b80) at blockjob.c:177
#9  0x000055b2a6366203 in block_job_free (job=0x55b2a7b31b80) at blockjob.c:94
#10 0x000055b2a636764d in job_unref (job=0x55b2a7b31b80) at job.c:367
#11 0x000055b2a6367858 in job_finalize_single (job=0x55b2a7b31b80) at job.c:654
#12 0x000055b2a6367858 in job_finalize_single (job=0x55b2a7b31b80) at job.c:722
#13 0x000055b2a6366ec0 in job_txn_apply (fn=0x55b2a6367750 <job_finalize_single>, lock=true, txn=<optimized out>) at job.c:150
#14 0x000055b2a63c2b7d in backup_complete (job=<optimized out>, opaque=0x55b2a97f4410) at block/backup.c:391
#15 0x000055b2a6366d62 in job_defer_to_main_loop_bh (opaque=0x55b2a99aaaa0) at job.c:968
#16 0x000055b2a6435451 in aio_bh_poll (bh=0x55b2a7eb9950) at util/async.c:90
#17 0x000055b2a6435451 in aio_bh_poll (ctx=ctx@entry=0x55b2a7af17c0) at util/async.c:118
#18 0x000055b2a64384f0 in aio_dispatch (ctx=0x55b2a7af17c0) at util/aio-posix.c:436
#19 0x000055b2a643532e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#20 0x00007f2d73ead049 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#21 0x000055b2a64377f7 in main_loop_wait () at util/main-loop.c:215
#22 0x000055b2a64377f7 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#23 0x000055b2a64377f7 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#24 0x000055b2a60dba67 in main () at vl.c:1963
#25 0x000055b2a60dba67 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4768

Comment 3 Fam Zheng 2018-08-21 01:39:48 UTC

I cannot reproduce it on my machine. If it's only backup, we can move it to 7.7. Does the hang happen with other block jobs (mirror, commit, etc.)?

Comment 5 Fam Zheng 2018-08-23 07:39:24 UTC

This is not specific to drive-backup. We should fix it. Upstream patch:

https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04234.html

Note You need to log in before you can comment on or make changes to this bug.