Bug 1609137

Summary: [data plane] Guest hang after the drive-backup
Product: Red Hat Enterprise Linux 7 Reporter: Gu Nini <ngu>
Component: qemu-kvm-rhevAssignee: Virtualization Maintenance <virt-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Gu Nini <ngu>
Severity: high Docs Contact:
Priority: high    
Version: 7.6CC: aliang, areis, chayang, coli, juzhang, ngu, qzhang, virt-maint, xianwang
Target Milestone: rcKeywords: TestOnly
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-23 06:26:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1601212, 1637976    
Bug Blocks:    

Description Gu Nini 2018-07-27 05:53:07 UTC
Description of problem:
When data plane used, after the drive-backup job completes, the guest hang.

Version-Release number of selected component (if applicable):
Host kernel: 3.10.0-926.el7.x86_64
qemu-kvm-rhev-2.12.0-8.el7.x86_64


How reproducible:
100%


Steps to Reproduce:
1. Boot up a guest with data-plane enabled disk.

    -object iothread,id=iothread1 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x6,iothread=iothread1 \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/$1 \
    -device scsi-hd,drive=drive_image1,id=image1 \
    -drive id=drive_image2,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/$2 \
    -device scsi-hd,drive=drive_image2,id=image2 \

2. Do driver-backup full on the disk in qmp.
{ "execute": "drive-backup", "arguments": { "device": "drive_image2", "target":"full_backup.img","format":"qcow2","sync":"full"}}


Actual results:
After the drive-backup finishes, the guest hang, i.e. no response in hmp and qmp:
{ "execute": "drive-backup", "arguments": { "device": "drive_image2", "target":"full_backup.img","format":"qcow2","sync":"full"}}
{"timestamp": {"seconds": 1532669038, "microseconds": 281738}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669038, "microseconds": 281850}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image2"}}
{"return": {}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529463}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529535}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529629}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image2", "len": 10737418240, "offset": 10737418240, "speed": 0, "type": "backup"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529671}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image2"}}
{"timestamp": {"seconds": 1532669178, "microseconds": 529700}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image2"}}
{ "execute": "query-block"}


Expected results:
The drive-backup could finish without any problem.

Additional info:
# gdb -batch -ex bt -p 10440
[New LWP 10542]
[New LWP 10504]
[New LWP 10501]
[New LWP 10500]
[New LWP 10499]
[New LWP 10495]
[New LWP 10442]
[New LWP 10441]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f2d5b8022cf in ppoll () from /lib64/libc.so.6
#0  0x00007f2d5b8022cf in ppoll () at /lib64/libc.so.6
#1  0x000055b2a64368eb in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x000055b2a64368eb in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x000055b2a6438635 in aio_poll (ctx=0x55b2a7af17c0, blocking=blocking@entry=true) at util/aio-posix.c:629
#4  0x000055b2a63b213a in bdrv_flush (bs=bs@entry=0x55b2a7d64800) at block/io.c:2560
#5  0x000055b2a6362beb in bdrv_unref (bs=0x55b2a7d64800) at block.c:3322
#6  0x000055b2a6362beb in bdrv_unref (bs=0x55b2a7d64800) at block.c:3510
#7  0x000055b2a6362beb in bdrv_unref (bs=0x55b2a7d64800) at block.c:4558
#8  0x000055b2a63661b4 in block_job_remove_all_bdrv (job=job@entry=0x55b2a7b31b80) at blockjob.c:177
#9  0x000055b2a6366203 in block_job_free (job=0x55b2a7b31b80) at blockjob.c:94
#10 0x000055b2a636764d in job_unref (job=0x55b2a7b31b80) at job.c:367
#11 0x000055b2a6367858 in job_finalize_single (job=0x55b2a7b31b80) at job.c:654
#12 0x000055b2a6367858 in job_finalize_single (job=0x55b2a7b31b80) at job.c:722
#13 0x000055b2a6366ec0 in job_txn_apply (fn=0x55b2a6367750 <job_finalize_single>, lock=true, txn=<optimized out>) at job.c:150
#14 0x000055b2a63c2b7d in backup_complete (job=<optimized out>, opaque=0x55b2a97f4410) at block/backup.c:391
#15 0x000055b2a6366d62 in job_defer_to_main_loop_bh (opaque=0x55b2a99aaaa0) at job.c:968
#16 0x000055b2a6435451 in aio_bh_poll (bh=0x55b2a7eb9950) at util/async.c:90
#17 0x000055b2a6435451 in aio_bh_poll (ctx=ctx@entry=0x55b2a7af17c0) at util/async.c:118
#18 0x000055b2a64384f0 in aio_dispatch (ctx=0x55b2a7af17c0) at util/aio-posix.c:436
#19 0x000055b2a643532e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#20 0x00007f2d73ead049 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#21 0x000055b2a64377f7 in main_loop_wait () at util/main-loop.c:215
#22 0x000055b2a64377f7 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#23 0x000055b2a64377f7 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#24 0x000055b2a60dba67 in main () at vl.c:1963
#25 0x000055b2a60dba67 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4768

Comment 3 Fam Zheng 2018-08-21 01:39:48 UTC
I cannot reproduce it on my machine. If it's only backup, we can move it to 7.7. Does the hang happen with other block jobs (mirror, commit, etc.)?

Comment 5 Fam Zheng 2018-08-23 07:39:24 UTC
This is not specific to drive-backup. We should fix it. Upstream patch:

https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg04234.html