Bug 1603104

Summary:

Qemu Aborted (core dumped) for 'qemu-kvm: Failed to lock byte 100' when remote NFS or GlusterFS volume stopped during the block mirror(or block commit/stream) process

Product:

Red Hat Enterprise Linux 7

Reporter:

Gu Nini <ngu>

Component:

qemu-kvm-rhev

Assignee:

Hanna Czenczek <hreitz>

Status:

CLOSED ERRATA

QA Contact:

aihua liang <aliang>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.6

CC:

areis, chayang, coli, jraju, juzhang, kwolf, nchandek, ngu, qzhang, rhodain, smykhail, victorlee0321, virt-maint, xianwang, yafu, yhong

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

qemu-kvm-rhev-2.12.0-27.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1981328 (view as bug list)

Environment:

Last Closed:

2019-08-22 09:18:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1649160, 1981328

Attachments:

Description	Flags
gdb info for drive mirror on 07192018	none
gdb debug info_all_threads-with_fzheng201807201416_qemu_branch-07232018	none

Description Gu Nini 2018-07-19 07:40:36 UTC

Created attachment 1459915 [details]
gdb info for drive mirror on 07192018

Description of problem:

Mount a glusterfs storage, start guest in localfs with a data disk(also in localfs), then do block mirror for the data disk to the glusterfs storage. Stop the glusterfs volume during the drive mirror process, it's found the qemu process core dumped.

Version-Release number of selected component (if applicable):
Host kernel: 3.10.0-918.el7.x86_64
Qemu-kvm-rhev: qemu-kvm-rhev-2.12.0-7.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Mount a glusterfs storage:
# mount -t glusterfs xx.xx.xx.xx:/gv1 /home/kvm_autotest_root/images

2. Start a guest on localfs, and with a data disk drive_image2
3. In the guest, mkfs/mount the data disk, write some data
3. Do block mirror for the drive_image2 to the glusterfs storage:
{ "execute": "drive-mirror", "arguments": { "device": "drive_image2", "target": "/home/kvm_autotest_root/images/mirror2", "format": "qcow2", "mode": "absolute-paths", "sync": "full","speed": 100}}

4. Stop the glusterfs volume in the glusterfs server:
# gluster volume stop gv1


Actual results:
The qemu-kvm guest core dumped:
# ./vm22.sh rhel76.qcow2  hd1
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) 
(qemu) Formatting '/home/kvm_autotest_root/images/mirror2', fmt=qcow2 size=1073741824 cluster_size=65536 lazy_refcounts=off refcount_bits=16
Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:649:
qemu-kvm: Failed to lock byte 100
./vm22.sh: line 29:  1193 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' ......


Expected results:
No core dump

Additional info:
block-commit and block-stream have the same problem

Comment 2 Fam Zheng 2018-07-20 06:29:35 UTC

Could you test this scratch build?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17262939

It has this patch from upstream and resolves the problem in BZ 1584982:

https://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03750.html

Comment 3 Gu Nini 2018-07-23 06:52:17 UTC

Created attachment 1469886 [details]
gdb debug info_all_threads-with_fzheng201807201416_qemu_branch-07232018

(In reply to Fam Zheng from comment #2)
> Could you test this scratch build?
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17262939
> 
> It has this patch from upstream and resolves the problem in BZ 1584982:
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03750.html

Have done the test, could also reproduce the bug:

# rpm -qa|grep qemu
qemu-kvm-tools-rhev-2.12.0-7.el7.fzheng201807201416.x86_64
qemu-img-rhev-2.12.0-7.el7.fzheng201807201416.x86_64
qemu-kvm-rhev-2.12.0-7.el7.fzheng201807201416.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-rhev-debuginfo-2.12.0-7.el7.fzheng201807201416.x86_64
qemu-kvm-common-rhev-2.12.0-7.el7.fzheng201807201416.x86_64

# ./vm22.sh rhel76.qcow2 hd1
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) Formatting '/mnt/mirror2', fmt=qcow2 size=10737418240 cluster_size=65536 lazy_refcounts=off refcount_bits=16

(qemu) 
(qemu) Unexpected error in raw_check_lock_bytes() at block/file-posix.c:702:
qemu-kvm: Failed to get "consistent read" lock
./vm22.sh: line 29: 10849 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' ......

Comment 4 Fam Zheng 2018-07-23 07:29:45 UTC

Indeed, there's one more place in the code that could be "optimized" away similarly to avoid this error. I'll write another version of the patch.

Comment 5 Yongxue Hong 2018-08-01 01:38:22 UTC

It is reproduced with nfs server.

===============================================================
Description of problem:
Qemu core dumped after force quit when stop the nfs server.

Version-Release number of selected component (if applicable):
Host: 3.10.0-928.el7.x86_64
Guest: 3.10.0-928.el7.x86_64
Qemu: qemu-kvm-rhev-2.12.0-8.el7

How reproducible:
100%

Steps to Reproduce:
1.Setup and configure a nfs server on localhost.
e.g:
[root@dhcp-8-111 ~]# vim /etc/exports
/var/tmp/mnt_dir/ *(rw,sync,no_root_squash)

[root@dhcp-8-111 ~]# mkdir -p /var/tmp/mnt_dir/

[root@dhcp-8-111 ~]# qemu-img create -f qcow2 /var/tmp/mnt_dir/nfs_corrupt.qcow2 10G
Formatting '/var/tmp/mnt_dir/nfs_corrupt.qcow2', fmt=qcow2 size=10737418240 cluster_size=65536 lazy_refcounts=off refcount_bits=16

[root@dhcp-8-111 ~]# systemctl restart nfs.service

2.Mount the nfs file to localhost.
e.g:
[root@dhcp-8-111 ~]# mount -o rw,soft,timeo=30,retrans=1,vers=3 10.66.8.111:/var/tmp/mnt_dir/ /var/tmp/mnt_dir/

3.Boot a guest with the above nfs iamge file.
e.g:
[root@dhcp-8-111 cmds]# cat guest.sh
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -device pvpanic,ioport=0x505,id=idLNrmW5  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20180727-191800-6KpwjpqB,server,nowait \
    -device isa-serial,chardev=serial_id_serial0  \
    -chardev socket,id=seabioslog_id_20180727-191800-6KpwjpqB,path=/var/tmp/seabios-20180727-191800-6KpwjpqB,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20180727-191800-6KpwjpqB,iobase=0x402 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \
    -drive id=drive_stg,if=none,werror=stop,snapshot=off,aio=threads,cache=none,format=qcow2,file=/var/tmp/mnt_dir/nfs_corrupt.qcow2 \
    -device virtio-blk-pci,id=stg,drive=drive_stg,bus=pci.0,addr=0xa \
    -device virtio-net-pci,mac=9a:62:63:64:65:66,id=idKCHf6S,vectors=4,netdev=idcsqaeC,bus=pci.0,addr=0x5 \
    -netdev tap,id=idcsqaeC,vhost=on \
    -m 4096  \
    -smp 4,maxcpus=4,cores=2,threads=1,sockets=2  \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,strict=off,order=cdn,once=c \
    -enable-kvm \
    -monitor stdio \

4. check the nfs image file info.
e.g:
[root@dhcp-8-137 ~]# lsblk
lsblk
NAME                       MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda                        252:0    0  20G  0 disk
├─vda1                     252:1    0   1G  0 part /boot
└─vda2                     252:2    0  19G  0 part
  ├─rhel_dhcp--8--137-root 253:0    0  17G  0 lvm  /
  └─rhel_dhcp--8--137-swap 253:1    0   2G  0 lvm  [SWAP]
vdb                        252:16   0  10G  0 disk

5. Stop the nfs server.
e.g:
[root@dhcp-8-111 ~]# systemctl stop nfs.service

6. Format this device.
e.g:
[root@dhcp-8-137 ~]# mkfs.xfs /dev/vdb
mkfs.xfs /dev/vdb
meta-data=/dev/vdb               isize=512    agcount=4, agsize=655360 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=2621440, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

...

Guest will be hung here since the nfs server is stopped.

7. Check the qmeu info with hmp.
e.g:
(qemu) info status
VM status: paused (io-error)
(qemu) info status
VM status: paused (io-error)

8. Force quit qmeu.
e.g:
(qemu) info status
VM status: paused (io-error)
(qemu) q
Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:680:
qemu-kvm: Failed to lock byte 100
guest.sh: line 27:  2641 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox off -machine pc -nodefaults -device VGA,bus=pci.0,addr=0x2 -device pvpanic,ioport=0x505,id=idLNrmW5 -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20180727-191800-6KpwjpqB,server,nowait -device isa-serial,chardev=serial_id_serial0 -chardev socket,id=seabioslog_id_20180727-191800-6KpwjpqB,path=/var/tmp/seabios-20180727-191800-6KpwjpqB,server,nowait -device isa-debugcon,chardev=seabioslog_id_20180727-191800-6KpwjpqB,iobase=0x402 -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2 -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 -drive id=drive_stg,if=none,werror=stop,snapshot=off,aio=threads,cache=none,format=qcow2,file=/var/tmp/mnt_dir/nfs_corrupt.qcow2 -device virtio-blk-pci,id=stg,drive=drive_stg,bus=pci.0,addr=0xa -device virtio-net-pci,mac=9a:62:63:64:65:66,id=idKCHf6S,vectors=4,netdev=idcsqaeC,bus=pci.0,addr=0x5 -netdev tap,id=idcsqaeC,vhost=on -m 4096 -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :0 -rtc base=utc,clock=host,driftfix=slew -boot menu=off,strict=off,order=cdn,once=c -enable-kvm -monitor stdio
[root@dhcp-8-111 cmds]#


Actual results:
Qemu core dumped.

Expected results:
Qemu should be quit normally.

Additional info:
gdb info:

Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:680:
qemu-kvm: Failed to lock byte 100

Program received signal SIGABRT, Aborted.
0x00007fffdef5c207 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fffdef5c207 in raise () at /lib64/libc.so.6
#1  0x00007fffdef5d8f8 in abort () at /lib64/libc.so.6
#2  0x0000555555ba2814 in error_handle_fatal ()
#3  0x0000555555ba28ed in error_setv ()
#4  0x0000555555ba2a7e in error_setg_internal ()
#5  0x0000555555b0d16d in raw_apply_lock_bytes ()
#6  0x0000555555b0d2eb in raw_handle_perm_lock ()
#7  0x0000555555ac44ff in bdrv_check_perm ()
#8  0x0000555555ac45ec in bdrv_check_perm ()
#9  0x0000555555ac4a2b in bdrv_replace_child ()
#10 0x0000555555ac4ac7 in bdrv_detach_child ()
#11 0x0000555555ac7f19 in bdrv_root_unref_child ()
#12 0x0000555555b098d1 in blk_remove_bs ()
#13 0x0000555555b0992b in blk_remove_all_bs ()
#14 0x0000555555ac51cf in bdrv_close_all ()
#15 0x0000555555840b0b in main () 
===============================================================
Thanks.

Comment 9 Hanna Czenczek 2019-04-03 17:49:38 UTC

The main fix should probably have been backported to qemu-kvm-rhev-2.12.0-23.el7 for BZ 1551486.  There is another similar crash, though, see the reproducer (2) in BZ 1694148 comment 2 (for which 696aaaed579ac5bf5fa336216909b46d3d8f07a8 needs to be backported).

Max

Comment 11 Miroslav Rezanina 2019-04-25 03:51:46 UTC

Fix included in qemu-kvm-rhev-2.12.0-27.el7

Comment 13 aihua liang 2019-04-28 12:05:35 UTC

Verified on qemu-kvm-rhev-2.12.0-27.el7.x86_64, it has been fixed, set bug's status to "Verified".

Test steps:
 1.Mount gluster volume
     mount.glusterfs intel-5405-32-2.englab.nay.redhat.com:/aliang /mnt/aliang

 2.Start guest with cmds:
    /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1' \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20190423-215834-BzwOjODj,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20190423-215834-BzwOjODj,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idapUGH0  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20190423-215834-BzwOjODj,server,nowait \
    -device isa-serial,chardev=serial_id_serial0  \
    -chardev socket,id=seabioslog_id_20190423-215834-BzwOjODj,path=/var/tmp/seabios-20190423-215834-BzwOjODj,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20190423-215834-BzwOjODj,iobase=0x402 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -object iothread,id=iothread0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel77-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4,iothread=iothread0 \
    -drive id=drive_data1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/test.img \
    -device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0,addr=0x6,iothread=iothread0 \
    -device virtio-net-pci,mac=9a:84:85:86:87:88,id=idc38p8G,vectors=4,netdev=idFM5N3v,bus=pci.0,addr=0x5  \
    -netdev tap,id=idFM5N3v,vhost=on \
    -m 2048  \
    -smp 4,maxcpus=4,cores=2,threads=1,sockets=2  \
    -cpu 'Westmere',+kvm_pv_unhalt \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,strict=off,order=cdn,once=c \
    -enable-kvm \
    -monitor stdio \
    -qmp tcp:0:3000,server,nowait \

 3.Format data disk and write 900M on it
    (guest)#mkfs.ext4 /dev/vdb
           #dd if=/dev/urandom of=/dev/vdb bs=1M count=900 oflag=direct

 4.Do mirror to target
    { "execute": "drive-mirror", "arguments": { "device": "drive_data1", "target": "/mnt/aliang/mirror", "format": "qcow2", "mode": "absolute-paths", "sync": "full","speed": 100}}
    {"execute":"query-block-jobs"}
{"return": [{"auto-finalize": true, "io-status": "ok", "device": "drive_data1", "auto-dismiss": true, "busy": false, "len": 34668544, "offset": 720896, "status": "running", "paused": false, "speed": 100, "ready": false, "type": "mirror"}]}

 5.Stop target volume and check block job status:
    (gluster server)# gluster volume stop aliang
    {"execute":"query-block-jobs"}
{"return": [{"auto-finalize": true, "io-status": "ok", "device": "drive_data1", "auto-dismiss": true, "busy": false, "len": 34668544, "offset": 720896, "status": "running", "paused": false, "speed": 100, "ready": false, "type": "mirror"}]}
 
 6.Start volume
    (gluster server)# gluster volume start aliang

 7.Wait for some minutes, then check block job status
    {"execute":"query-block-jobs"}
{"return": [{"auto-finalize": true, "io-status": "ok", "device": "drive_data1", "auto-dismiss": true, "busy": false, "len": 34668544, "offset": 720896, "status": "running", "paused": false, "speed": 100, "ready": false, "type": "mirror"}]}

 8.Set block job speed to unlimit
     { "execute": "block-job-set-speed", "arguments": { "device": "drive_data1","speed":0}}
{"return": {}}
{"timestamp": {"seconds": 1556452597, "microseconds": 599458}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "drive_data1"}}
{"timestamp": {"seconds": 1556452597, "microseconds": 599580}, "event": "BLOCK_JOB_READY", "data": {"device": "drive_data1", "len": 34668544, "offset": 34668544, "speed": 0, "type": "mirror"}}

 9.Complete mirror job
    { "execute": "block-job-complete", "arguments": { "device": "drive_data1"}}
{"return": {}}
{"timestamp": {"seconds": 1556452635, "microseconds": 827927}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "drive_data1"}}
{"timestamp": {"seconds": 1556452635, "microseconds": 827973}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "drive_data1"}}
{"timestamp": {"seconds": 1556452635, "microseconds": 861664}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_data1", "len": 34668544, "offset": 34668544, "speed": 0, "type": "mirror"}}
{"timestamp": {"seconds": 1556452635, "microseconds": 861748}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_data1"}}
{"timestamp": {"seconds": 1556452635, "microseconds": 861800}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_data1"}}

Comment 20 errata-xmlrpc 2019-08-22 09:18:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2553

Comment 21 victorlee0321 2021-04-25 06:09:31 UTC

I fixed it with mount -t nfs -o nolock nfs_address mount_point