1824042 – Qemu (core dumped) if active creating/adding disk image during the guest booting stage

Bug 1824042 - Qemu (core dumped) if active creating/adding disk image during the guest booting stage

Summary: Qemu (core dumped) if active creating/adding disk image during the guest boot...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.3
Hardware:	ppc64le
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.3
Assignee:	Greg Kurz
QA Contact:	Gu Nini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-15 07:19 UTC by Gu Nini
Modified:	2020-11-17 17:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:	qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-17 17:48:08 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
gdb_debug_info_ppc64le-04152020 (28.35 KB, text/plain) 2020-04-15 07:19 UTC, Gu Nini	no flags	Details
gdb_debug_info_ppc64le-qemu4.1-04212020 (26.90 KB, text/plain) 2020-04-21 04:10 UTC, Gu Nini	no flags	Details
View All

Description Gu Nini 2020-04-15 07:19:28 UTC

Created attachment 1678938 [details]
gdb_debug_info_ppc64le-04152020

Description of problem:
Boot up a guest, then start to create/add disk image immediately with blockdev-create/blockdev-add during the guest booting stage with a script. Then qemu (core dumped):

[root@ibm-p8-garrison-05 ngu]# sh vm1.sh
QEMU 4.2.92 monitor - type 'help' for more information
(qemu) qemu-kvm: /builddir/build/BUILD/qemu-5.0.0-rc2/block/block-backend.c:1968: blk_get_aio_context: Assertion `ctx == blk->ctx' failed.
vm1.sh: line 26: 241948 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine pseries,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -m 1024 -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 -cpu 'host' -chardev socket,id=qmp_id_qmpmonitor1,server,nowait,path=/var/tmp/avocado_1 -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=chardev_serial0,server,nowait,path=/var/tmp/avocado_2 -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel820-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 -device scsi-hd,id=image1,drive=drive_image1,write-cache=on -device virtio-net-pci,mac=9a:fa:14:5e:3d:13,id=idLviSMj,netdev=idlrY22o,bus=pci.0,addr=0x5 -netdev tap,id=idlrY22o,vhost=on -vnc :0 -rtc base=utc,clock=host -boot menu=off,order=cdn,once=c,strict=off -enable-kvm -monitor stdio


Version-Release number of selected component (if applicable):
Host kernel: kernel-4.18.0-193.6.el8.ppc64le
Qemu: qemu-kvm-5.0.0-0.scrmod+el8.2.0+6253+83a14d38.200408.ppc64le
Guest kernel: kernel-4.18.0-193.el8.ppc64le


How reproducible:
4/5

Steps to Reproduce:
1. Boot up a guest with following cmd:

/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pseries,max-cpu-compat=power8  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 1024  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -cpu 'host' \
    -chardev socket,id=qmp_id_qmpmonitor1,server,nowait,path=/var/tmp/avocado_1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,id=chardev_serial0,server,nowait,path=/var/tmp/avocado_2 \
    -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 \
    -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
    -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel820-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device virtio-net-pci,mac=9a:fa:14:5e:3d:13,id=idLviSMj,netdev=idlrY22o,bus=pci.0,addr=0x5  \
    -netdev tap,id=idlrY22o,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio

2. Run following script immediately to actively create the disk image and then hot plug it:

#/bin/bash

for i in {1..4}
do
date
echo "####create image sn$i####"
echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'size': 21474836480}, 'job-id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'file_sn$i', 'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'aio': 'threads'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'qcow2', 'file': 'file_sn$i', 'size': 21474836480}, 'job-id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1

sleep 1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_sn$i', 'driver': 'qcow2', 'file': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

done


Actual results:
Qemu core dumped as in the description part.

Expected results:
The qemu guest run normally and can boot up without any problem.

Additional info:
Met the bug also on the latest RHELAV 8.2 qemu: qemu-kvm-4.2.0-19.module+el8.2.0+6296+6b821950

Comment 1 Gu Nini 2020-04-15 07:27:30 UTC

HI Aihua, please help to try the bug on x86, if so, we can change the Hardware to 'All' and you can take the bug. Thanks in advance.

I found the bug when run following case in avocado:
blockdev_commit_reboot

This is the log:
http://10.0.136.47/ngu/04142020-blockdevcommitboot/results.html

Comment 2 Gu Nini 2020-04-15 10:36:37 UTC

If successfully creating/adding the disk images, followed creating snapshot on them might induce the guest failed to boot up and enter grub. I met the problem 16/30 times in the avocado test http://10.0.136.47/ngu/04142020-blockdevcommitboot/results.html, please check the 'serial-serial0-avocado-vt-vm1.log'.

2020-04-13 10:01:21: Trying to load:  from: /pci@800000020000000/scsi@4/disk@100000000000000 ...
2020-04-13 10:01:21:   Successfully loaded
2020-04-13 10:01:42: SCSI-DISK: Failed to get disk capacity!
2020-04-13 10:02:02: SCSI-DISK: Failed to get disk capacity!
2020-04-13 10:02:02: error: ../../grub-core/kern/disk.c:258:no such partition.
2020-04-13 10:02:02: Entering rescue mode...
2020-04-13 10:02:02: grub rescue>
2020-04-13 10:05:44: 
2020-04-13 10:05:44: grub rescue>

Comment 3 aihua liang 2020-04-16 02:43:31 UTC

Hi,ngu

 Run blockdev_commit_reboot test by auto with virtio_blk for 120 times, don't hit this issue.

Comment 4 aihua liang 2020-04-16 02:45:07 UTC

(In reply to aihua liang from comment #3)
> Hi,ngu
> 
>  Run blockdev_commit_reboot test by auto with virtio_blk for 120 times,
> don't hit this issue.

Sorry, the wrong info,correct it as:
  Run blockdev_commit_reboot test by auto with virtio_scsi for 120 times on x86_64, don't hit this issue.

Comment 5 Gu Nini 2020-04-16 05:18:48 UTC

(In reply to aihua liang from comment #4)
> (In reply to aihua liang from comment #3)
> > Hi,ngu
> > 
> >  Run blockdev_commit_reboot test by auto with virtio_blk for 120 times,
> > don't hit this issue.
> 
> Sorry, the wrong info,correct it as:
>   Run blockdev_commit_reboot test by auto with virtio_scsi for 120 times on
> x86_64, don't hit this issue.

It's very odd. Set the hardware to ppc64le temporarily.

Comment 6 David Gibson 2020-04-20 01:43:45 UTC

Is this a regression from the qemu-4.2 based qemu-kvm in RHEL-AV-8.2?

Comment 7 aihua liang 2020-04-20 05:57:37 UTC

Test it on qemu-kvm-4.2.0-19.module+el8.2.0+6296+6b821950 on x86, still not hit this issue.

Comment 8 Gu Nini 2020-04-21 03:51:34 UTC

(In reply to David Gibson from comment #6)
> Is this a regression from the qemu-4.2 based qemu-kvm in RHEL-AV-8.2?

I could reproduce it on qemu-4.1 although the reproduce rate is much lower, I have only reproduced it 1 out of 20 times with the way in bug description part.

[root@ibm-p9wr-04 ngu]# sh vm1.sh 
QEMU 4.1.0 monitor - type 'help' for more information
(qemu) qemu-kvm: block/block-backend.c:1862: blk_get_aio_context: Assertion `ctx == blk->ctx' failed.
vm1.sh: line 26: 248157 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine pseries,max-cpu-compat=power8 -nodefaults -device VGA,bus=pci.0,addr=0x2 -m 1024 -smp 8,maxcpus=8,cores=4,threads=1,sockets=2 -cpu 'host' -chardev socket,id=qmp_id_qmpmonitor1,server,nowait,path=/var/tmp/avocado_1 -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=chardev_serial0,server,nowait,path=/var/tmp/avocado_2 -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel830-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 -device scsi-hd,id=image1,drive=drive_image1,write-cache=on -device virtio-net-pci,mac=9a:fa:14:5e:3d:13,id=idLviSMj,netdev=idlrY22o,bus=pci.0,addr=0x5 -netdev tap,id=idlrY22o,vhost=on -vnc :0 -rtc base=utc,clock=host -boot menu=off,order=cdn,once=c,strict=off -enable-kvm -monitor stdio
[root@ibm-p9wr-04 ngu]# 

Host kernel: kernel-4.18.0-193.8.el8.ppc64le
Qemu: qemu-kvm-4.1.0-23.module+el8.1.1+6238+f5d69f68.3.ppc64le

Comment 9 Gu Nini 2020-04-21 04:10:37 UTC

Created attachment 1680407 [details]
gdb_debug_info_ppc64le-qemu4.1-04212020

I could reproduce the bug in qemu4.1 multiple times with a faster script.

Comment 10 David Gibson 2020-04-29 02:55:53 UTC

I'm not sure, but this might be related to a qemu/SLOF version mismatch.  Can you retest with Mirek's test SLOF package from http://batcave.lab.eng.brq.redhat.com/repos/test/SLOF-8.3-5.3/

Comment 11 Gu Nini 2020-04-29 07:54:08 UTC

(In reply to David Gibson from comment #10)
> I'm not sure, but this might be related to a qemu/SLOF version mismatch. 
> Can you retest with Mirek's test SLOF package from
> http://batcave.lab.eng.brq.redhat.com/repos/test/SLOF-8.3-5.3/

I don't think so, when I did test on qemu-kvm-4.2.0-10.module+el8.2.0+5740+c3dff59e.ppc64le and qemu-kvm-4.1.0-23.module+el8.1.1+6238+f5d69f68.3.ppc64le, I also updated SLOF to corresponding version in the same virt module.

Also I have a try on qemu-kvm-5.0.0-0.scrmod+el8.3.0+6312+cee4f348 with Mirek's SLOF, the bug could be reproduced.

Comment 12 David Gibson 2020-05-04 03:07:43 UTC

ngu,  sorry I misread and didn't see that you had also reproduced with qemu-4.2.

From the traces there isn't anything obviously ppc related about this.  Can you reproduce it on x86?

Comment 13 Gu Nini 2020-05-06 01:28:59 UTC

(In reply to David Gibson from comment #12)
> ngu,  sorry I misread and didn't see that you had also reproduced with
> qemu-4.2.
> 
> From the traces there isn't anything obviously ppc related about this.  Can
> you reproduce it on x86?

David, yes, I failed to reproduce it after several tries, it's really weird. From comment #7, the x86 feature owner Aihua failed to reproduce it there; and I had ever had a try on her machine with my local script, it's also a failure.

Comment 14 Greg Kurz 2020-07-08 10:44:23 UTC

I could reproduce it with upstream QEMU on POWER. It seems there's a
race between the QMP commands and a blk_drain_all() triggered by the
guest. Maybe x86 doesn't cause the same race to happen ?

I've worked in this area on a similar issue in the past (f45280cbf66d
"block: fix QEMU crash with scsi-hd and drive_del") so I guess I can
have a look.

Comment 15 Greg Kurz 2020-07-10 09:37:32 UTC

Posted a patch upstream.

Now in maintainer tree:

https://repo.or.cz/qemu/kevin.git/commit/5463edf7b4fdc27d2b2d745d7f8c9fddb495d140

Comment 28 errata-xmlrpc 2020-11-17 17:48:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137

Note You need to log in before you can comment on or make changes to this bug.