1609561 – qemu-kvm: Failed to lock byte and crashes while migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1609561 - qemu-kvm: Failed to lock byte and crashes while migration

Summary: qemu-kvm: Failed to lock byte and crashes while migration

Keywords:
Status:	CLOSED DUPLICATE of bug 1600599
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Cody
QA Contact:	CongLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1481022
TreeView+	depends on / blocked

Reported:	2018-07-29 12:48 UTC by Polina
Modified:	2021-12-10 16:48 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-15 17:43:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
libvirt_qemu_vdsm_engine_logs.tar.gz (13.09 MB, application/x-gzip) 2018-07-29 12:48 UTC, Polina	no flags	Details
View All

Description Polina 2018-07-29 12:48:55 UTC

Created attachment 1471356 [details]
libvirt_qemu_vdsm_engine_logs.tar.gz

Description of problem:


Version-Release number of selected component (if applicable):
rhv-release-4.2.5-5-001.noarch
libvirt-3.9.0-14.el7_5.6.x86_64
vdsm-4.20.35-1.el7ev.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64

How reproducible:100%

Steps to Reproduce:
1.run VM associated with gluster disk.
2.block gluster storage, three volumes, like 
    iptables -I INPUT -s gluster01.scl.lab.tlv.redhat.com -j DROP
    iptables -I INPUT -s gluster02.scl.lab.tlv.redhat.com -j DROP
    iptables -I INPUT -s gluster03.scl.lab.tlv.redhat.com -j DROP
3. After a while, the host turns into Non-Operational. at this stage, engine tries to migrate the VM. 

Actual results: the process of migration crashes (please see the qemu.log attached)

Expected results: VM must be reported as i/o error paused while storage is blocked


Additional info:

Comment 2 Peter Krempa 2018-07-30 08:18:20 UTC

The VM has only one disk with a backing file (the second one is an empty cdrom):

 -drive if=none,id=drive-ua-18beca16-e731-4caa-b05d-f6f758930198,readonly=on,werror=report,rerror=report
 -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-18beca16-e731-4caa-b05d-f6f758930198,id=ua-18beca16-e731-4caa-b05d-f6f758930198
 -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_compute-ge-he-4-volume1/e0dc227a-3fc9-4454-adbd-e6d502167951/images/bb538f7d-8a8d-4aff-9086-b14366d13050/95b8ad11-5204-4057-ab5c-ccc0a51c5428,format=qcow2,if=none,id=drive-ua-bb538f7d-8a8d-4aff-9086-b14366d13050,serial=bb538f7d-8a8d-4aff-9086-b14366d13050,cache=none,werror=stop,rerror=stop,aio=threads
 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-bb538f7d-8a8d-4aff-9086-b14366d13050,id=ua-bb538f7d-8a8d-4aff-9086-b14366d13050,bootindex=1

As the file is using an local path it seems that the host is using the gluster volume via the glusterfs FUSE driver.

QEMU then logs:

2018-07-29 11:25:44.598+0000: initiating migration
Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:642:
2018-07-29T11:25:47.638863Z qemu-kvm: Failed to lock byte 100

and aborts. Libvit then treats the VM as crashed.

2018-07-29 11:25:47.887+0000: shutting down, reason=crashed

Reassigning to qemu.

Comment 3 Han Han 2018-07-31 05:06:47 UTC

Not reproduced in libvirt.
Version:libvirt-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 glusterfs-3.12.2-15.el7rhgs.x86_64

1. Prepare a glusterfs server
2. Mount the glusterfs to /mnt on migrate src and dst host
3. Prepare a VM with following xml:
...
<iothreads>1</iothreads>
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/mnt/A.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0d' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver error_policy='report'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
...

4. Use iptable to block gluster connection
# iptables -I INPUT -s $GLUSTER_SERVER -j DROP

5. Try to migrate the VM
According to the following log of vdsm, I use --live --auto-converge to migrate the VM：
migrationMethod='ONLINE', tunnelMigration='false', migrationDowntime='0', autoConverge='true', migrateCompressed='false'

＃ virsh migrate fuse qemu+ssh://XX.XX.XX.XX/system --verbose --live --auto-converge
error: internal error: qemu unexpectedly closed the monitor: 2018-07-31T05:04:03.953420Z qemu-kvm: load of migration failed: Input/output error

Check qemu log, no 'crash' found.
The qemu log of dst:
2018-07-31T05:04:03.621566Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/3 (label charserial0)
2018-07-31T05:04:03.953420Z qemu-kvm: load of migration failed: Input/output error
2018-07-31 05:04:03.974+0000: shutting down, reason=failed


The qemu log of src:
2018-07-31 05:04:05.246+0000: initiating migration

Comment 4 Michal Skrivanek 2018-08-03 11:28:50 UTC

maybe you can try to block it only in the middle of migration?

Comment 5 CongLi 2018-08-07 11:32:17 UTC

Tested this bug on:
qemu-kvm-rhev-2.12.0-9.el7.x86_64

Steps:
1.run VM associated with gluster disk.
    -drive id=system_disk,if=none,snapshot=off,aio=threads,cache=writeback,format=qcow2,file=/mnt/gluster/rhel76-64-virtio-scsi.qcow2 \
    -device virtio-blk-pci,id=system,drive=system_disk,bus=pci.0 \

2. block gluster storage
    iptables -I INPUT -s $gluster_server_ip_addr -j DROP

3. migrate VM from src to dst, and check migration status in src host

4. quit src qemu.

Results:
1. After step 3, migrate failed due to connection timeout.
(qemu) qemu-kvm: Failed to connect socket: Connection timed out
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off 
Migration status: failed (Failed to connect socket: Connection timed out)
total time: 0 milliseconds

2. After step 4, quit qemu in src, qemu core dump.
(qemu) q
Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:680:
qemu-kvm: Failed to lock byte 100
qemu.sh: line 21:  2893 Aborted                 (core dumped) /usr/libexec/qemu-kvm -S -name 'avocado-vt-vm1' -sandbox off -machine pc -nodefaults -vga std -device virtio-net-pci,mac=9a:50:51:52:53:54,id=idQaGkcH,vectors=4,netdev=id1nw46d,bus=pci.0,addr=0x5 -netdev tap,id=id1nw46d,vhost=on -m 8192 -smp 8,cores=4,threads=1,sockets=2 -cpu host,+kvm_pv_unhalt -vnc :0 -rtc base=utc,clock=host,driftfix=slew -boot menu=off,strict=off,order=cdn,once=c -enable-kvm -monitor stdio -qmp tcp:localhost:4444,server,nowait -drive id=system_disk,if=none,snapshot=off,aio=threads,cache=writeback,format=qcow2,file=/mnt/gluster/rhel76-64-virtio-scsi.qcow2 -device virtio-blk-pci,id=system,drive=system_disk,bus=pci.0


Dump info:
(gdb) bt
#0  0x00007f92ab708207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f92ab7098f8 in __GI_abort () at abort.c:90
#2  0x00005653b6739b44 in error_handle_fatal (errp=<optimized out>, err=0x5653b93fb440) at util/error.c:38
#3  0x00005653b6739c1d in error_setv (errp=0x5653b73bacc0 <error_abort>, src=0x5653b68b0d8f "block/file-posix.c", line=680, func=0x5653b68b1730 <__func__.26520> "raw_apply_lock_bytes", err_class=
    ERROR_CLASS_GENERIC_ERROR, fmt=<optimized out>, ap=0x7ffe66b7fdb0, suffix=0x0) at util/error.c:71
#4  0x00005653b6739dae in error_setg_internal (errp=errp@entry=0x5653b73bacc0 <error_abort>, src=src@entry=0x5653b68b0d8f "block/file-posix.c", line=<optimized out>, func=<optimized out>, fmt=<optimized out>)
    at util/error.c:95
#5  0x00005653b66a449d in raw_apply_lock_bytes (fd=19, perm_lock_bits=11, shared_perm_lock_bits=18446744073709551594, unlock=unlock@entry=false, errp=errp@entry=0x5653b73bacc0 <error_abort>) at block/file-posix.c:686
#6  0x00005653b66a461b in raw_handle_perm_lock (bs=<optimized out>, op=
    RAW_PL_PREPARE, new_perm=11, new_shared=21, errp=0x5653b73bacc0 <error_abort>) at block/file-posix.c:775
#7  0x00005653b665b82f in bdrv_check_perm (bs=bs@entry=0x5653b9291400, q=0x0, 
    q@entry=0x28c6efc60b1c3800, cumulative_perms=11, cumulative_shared_perms=21, ignore_children=ignore_children@entry=0x5653b9260320 = {...}, errp=errp@entry=0x5653b73bacc0 <error_abort>) at block.c:1697
#8  0x00005653b665b68b in bdrv_check_update_perm (bs=0x5653b9291400, q=0x28c6efc60b1c3800, 
    q@entry=0x0, new_used_perm=new_used_perm@entry=11, new_shared_perm=new_shared_perm@entry=21, ignore_children=ignore_children@entry=0x5653b9260320 = {...}, errp=errp@entry=0x5653b73bacc0 <error_abort>) at block.c:1883
#9  0x00005653b665b91c in bdrv_check_perm (errp=0x5653b73bacc0 <error_abort>, ignore_children=0x5653b9260320 = {...}, shared=21, perm=11, q=0x0, c=0x5653b9187400) at block.c:1896
#10 0x00005653b665b91c in bdrv_check_perm (bs=bs@entry=0x5653b928e000, q=q@entry=0x0, cumulative_perms=cumulative_perms@entry=0, cumulative_shared_perms=cumulative_shared_perms@entry=31, ignore_children=ignore_children@entry=0x0, errp=0x5653b73bacc0 <error_abort>) at block.c:1713
#11 0x00005653b665bd5b in bdrv_replace_child (child=child@entry=0x5653b9187360, new_bs=new_bs@entry=0x0)
    at block.c:2079
---Type <return> to continue, or q <return> to quit---
#12 0x00005653b665bdf7 in bdrv_detach_child (child=0x5653b9187360) at block.c:2153
#13 0x00005653b665f249 in bdrv_root_unref_child (child=<optimized out>) at block.c:2164
#14 0x00005653b66a0c01 in blk_remove_bs (blk=blk@entry=0x5653b928c000) at block/block-backend.c:776
#15 0x00005653b66a0c5b in blk_remove_all_bs () at block/block-backend.c:474
#16 0x00005653b665c4ff in bdrv_close_all () at block.c:3380
#17 0x00005653b63dae6b in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4776

Comment 8 Han Han 2018-08-13 07:30:40 UTC

(In reply to Michal Skrivanek from comment #4)
> maybe you can try to block it only in the middle of migration?

Since Congli had reproduced this on qemu-kvm-rhev in commment5. Cancel the needinfo.

Comment 9 Jeff Cody 2018-08-15 17:43:08 UTC

This appears to be a dupe of BZ #1600599.

*** This bug has been marked as a duplicate of bug 1600599 ***

Note You need to log in before you can comment on or make changes to this bug.