Bug 1609561

Summary: qemu-kvm: Failed to lock byte and crashes while migration
Product: Red Hat Enterprise Linux 7 Reporter: Polina <pagranat>
Component: qemu-kvm-rhevAssignee: Jeff Cody <jcody>
Status: CLOSED DUPLICATE QA Contact: CongLi <coli>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.5CC: chayang, coli, hhan, juzhang, michal.skrivanek, michen, mzamazal, ngu, pkrempa, pvilayat, qzhang, tburke, virt-maint, xianwang, yuhuang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-15 17:43:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1481022    
Attachments:
Description Flags
libvirt_qemu_vdsm_engine_logs.tar.gz none

Description Polina 2018-07-29 12:48:55 UTC
Created attachment 1471356 [details]
libvirt_qemu_vdsm_engine_logs.tar.gz

Description of problem:


Version-Release number of selected component (if applicable):
rhv-release-4.2.5-5-001.noarch
libvirt-3.9.0-14.el7_5.6.x86_64
vdsm-4.20.35-1.el7ev.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64

How reproducible:100%

Steps to Reproduce:
1.run VM associated with gluster disk.
2.block gluster storage, three volumes, like 
    iptables -I INPUT -s gluster01.scl.lab.tlv.redhat.com -j DROP
    iptables -I INPUT -s gluster02.scl.lab.tlv.redhat.com -j DROP
    iptables -I INPUT -s gluster03.scl.lab.tlv.redhat.com -j DROP
3. After a while, the host turns into Non-Operational. at this stage, engine tries to migrate the VM. 

Actual results: the process of migration crashes (please see the qemu.log attached)

Expected results: VM must be reported as i/o error paused while storage is blocked


Additional info:

Comment 2 Peter Krempa 2018-07-30 08:18:20 UTC
The VM has only one disk with a backing file (the second one is an empty cdrom):

 -drive if=none,id=drive-ua-18beca16-e731-4caa-b05d-f6f758930198,readonly=on,werror=report,rerror=report
 -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-18beca16-e731-4caa-b05d-f6f758930198,id=ua-18beca16-e731-4caa-b05d-f6f758930198
 -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_compute-ge-he-4-volume1/e0dc227a-3fc9-4454-adbd-e6d502167951/images/bb538f7d-8a8d-4aff-9086-b14366d13050/95b8ad11-5204-4057-ab5c-ccc0a51c5428,format=qcow2,if=none,id=drive-ua-bb538f7d-8a8d-4aff-9086-b14366d13050,serial=bb538f7d-8a8d-4aff-9086-b14366d13050,cache=none,werror=stop,rerror=stop,aio=threads
 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-bb538f7d-8a8d-4aff-9086-b14366d13050,id=ua-bb538f7d-8a8d-4aff-9086-b14366d13050,bootindex=1

As the file is using an local path it seems that the host is using the gluster volume via the glusterfs FUSE driver.

QEMU then logs:

2018-07-29 11:25:44.598+0000: initiating migration
Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:642:
2018-07-29T11:25:47.638863Z qemu-kvm: Failed to lock byte 100

and aborts. Libvit then treats the VM as crashed.

2018-07-29 11:25:47.887+0000: shutting down, reason=crashed

Reassigning to qemu.

Comment 3 Han Han 2018-07-31 05:06:47 UTC
Not reproduced in libvirt.
Version:libvirt-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 glusterfs-3.12.2-15.el7rhgs.x86_64

1. Prepare a glusterfs server
2. Mount the glusterfs to /mnt on migrate src and dst host
3. Prepare a VM with following xml:
...
<iothreads>1</iothreads>
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/mnt/A.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0d' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver error_policy='report'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
...

4. Use iptable to block gluster connection
# iptables -I INPUT -s $GLUSTER_SERVER -j DROP

5. Try to migrate the VM
According to the following log of vdsm, I use --live --auto-converge to migrate the VM:
migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false'

# virsh migrate fuse qemu+ssh://XX.XX.XX.XX/system --verbose --live --auto-converge
error: internal error: qemu unexpectedly closed the monitor: 2018-07-31T05:04:03.953420Z qemu-kvm: load of migration failed: Input/output error

Check qemu log, no 'crash' found.
The qemu log of dst:
2018-07-31T05:04:03.621566Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/3 (label charserial0)
2018-07-31T05:04:03.953420Z qemu-kvm: load of migration failed: Input/output error
2018-07-31 05:04:03.974+0000: shutting down, reason=failed


The qemu log of src:
2018-07-31 05:04:05.246+0000: initiating migration

Comment 4 Michal Skrivanek 2018-08-03 11:28:50 UTC
maybe you can try to block it only in the middle of migration?

Comment 5 CongLi 2018-08-07 11:32:17 UTC
Tested this bug on:
qemu-kvm-rhev-2.12.0-9.el7.x86_64

Steps:
1.run VM associated with gluster disk.
    -drive id=system_disk,if=none,snapshot=off,aio=threads,cache=writeback,format=qcow2,file=/mnt/gluster/rhel76-64-virtio-scsi.qcow2 \
    -device virtio-blk-pci,id=system,drive=system_disk,bus=pci.0 \

2. block gluster storage
    iptables -I INPUT -s $gluster_server_ip_addr -j DROP

3. migrate VM from src to dst, and check migration status in src host

4. quit src qemu.

Results:
1. After step 3, migrate failed due to connection timeout.
(qemu) qemu-kvm: Failed to connect socket: Connection timed out
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off 
Migration status: failed (Failed to connect socket: Connection timed out)
total time: 0 milliseconds

2. After step 4, quit qemu in src, qemu core dump.
(qemu) q
Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:680:
qemu-kvm: Failed to lock byte 100
qemu.sh: line 21:  2893 Aborted                 (core dumped) /usr/libexec/qemu-kvm -S -name 'avocado-vt-vm1' -sandbox off -machine pc -nodefaults -vga std -device virtio-net-pci,mac=9a:50:51:52:53:54,id=idQaGkcH,vectors=4,netdev=id1nw46d,bus=pci.0,addr=0x5 -netdev tap,id=id1nw46d,vhost=on -m 8192 -smp 8,cores=4,threads=1,sockets=2 -cpu host,+kvm_pv_unhalt -vnc :0 -rtc base=utc,clock=host,driftfix=slew -boot menu=off,strict=off,order=cdn,once=c -enable-kvm -monitor stdio -qmp tcp:localhost:4444,server,nowait -drive id=system_disk,if=none,snapshot=off,aio=threads,cache=writeback,format=qcow2,file=/mnt/gluster/rhel76-64-virtio-scsi.qcow2 -device virtio-blk-pci,id=system,drive=system_disk,bus=pci.0


Dump info:
(gdb) bt
#0  0x00007f92ab708207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f92ab7098f8 in __GI_abort () at abort.c:90
#2  0x00005653b6739b44 in error_handle_fatal (errp=<optimized out>, err=0x5653b93fb440) at util/error.c:38
#3  0x00005653b6739c1d in error_setv (errp=0x5653b73bacc0 <error_abort>, src=0x5653b68b0d8f "block/file-posix.c", line=680, func=0x5653b68b1730 <__func__.26520> "raw_apply_lock_bytes", err_class=
    ERROR_CLASS_GENERIC_ERROR, fmt=<optimized out>, ap=0x7ffe66b7fdb0, suffix=0x0) at util/error.c:71
#4  0x00005653b6739dae in error_setg_internal (errp=errp@entry=0x5653b73bacc0 <error_abort>, src=src@entry=0x5653b68b0d8f "block/file-posix.c", line=<optimized out>, func=<optimized out>, fmt=<optimized out>)
    at util/error.c:95
#5  0x00005653b66a449d in raw_apply_lock_bytes (fd=19, perm_lock_bits=11, shared_perm_lock_bits=18446744073709551594, unlock=unlock@entry=false, errp=errp@entry=0x5653b73bacc0 <error_abort>) at block/file-posix.c:686
#6  0x00005653b66a461b in raw_handle_perm_lock (bs=<optimized out>, op=
    RAW_PL_PREPARE, new_perm=11, new_shared=21, errp=0x5653b73bacc0 <error_abort>) at block/file-posix.c:775
#7  0x00005653b665b82f in bdrv_check_perm (bs=bs@entry=0x5653b9291400, q=0x0, 
    q@entry=0x28c6efc60b1c3800, cumulative_perms=11, cumulative_shared_perms=21, ignore_children=ignore_children@entry=0x5653b9260320 = {...}, errp=errp@entry=0x5653b73bacc0 <error_abort>) at block.c:1697
#8  0x00005653b665b68b in bdrv_check_update_perm (bs=0x5653b9291400, q=0x28c6efc60b1c3800, 
    q@entry=0x0, new_used_perm=new_used_perm@entry=11, new_shared_perm=new_shared_perm@entry=21, ignore_children=ignore_children@entry=0x5653b9260320 = {...}, errp=errp@entry=0x5653b73bacc0 <error_abort>) at block.c:1883
#9  0x00005653b665b91c in bdrv_check_perm (errp=0x5653b73bacc0 <error_abort>, ignore_children=0x5653b9260320 = {...}, shared=21, perm=11, q=0x0, c=0x5653b9187400) at block.c:1896
#10 0x00005653b665b91c in bdrv_check_perm (bs=bs@entry=0x5653b928e000, q=q@entry=0x0, cumulative_perms=cumulative_perms@entry=0, cumulative_shared_perms=cumulative_shared_perms@entry=31, ignore_children=ignore_children@entry=0x0, errp=0x5653b73bacc0 <error_abort>) at block.c:1713
#11 0x00005653b665bd5b in bdrv_replace_child (child=child@entry=0x5653b9187360, new_bs=new_bs@entry=0x0)
    at block.c:2079
---Type <return> to continue, or q <return> to quit---
#12 0x00005653b665bdf7 in bdrv_detach_child (child=0x5653b9187360) at block.c:2153
#13 0x00005653b665f249 in bdrv_root_unref_child (child=<optimized out>) at block.c:2164
#14 0x00005653b66a0c01 in blk_remove_bs (blk=blk@entry=0x5653b928c000) at block/block-backend.c:776
#15 0x00005653b66a0c5b in blk_remove_all_bs () at block/block-backend.c:474
#16 0x00005653b665c4ff in bdrv_close_all () at block.c:3380
#17 0x00005653b63dae6b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4776

Comment 8 Han Han 2018-08-13 07:30:40 UTC
(In reply to Michal Skrivanek from comment #4)
> maybe you can try to block it only in the middle of migration?

Since Congli had reproduced this on qemu-kvm-rhev in commment5. Cancel the needinfo.

Comment 9 Jeff Cody 2018-08-15 17:43:08 UTC
This appears to be a dupe of BZ #1600599.

*** This bug has been marked as a duplicate of bug 1600599 ***