Bug 1849483

Summary: Failed to boot up guest when hotplugging vcpus on bios stage
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Xujun Ma <xuma>
Component: qemu-kvmAssignee: Laurent Vivier <lvivier>
qemu-kvm sub component: General QA Contact: Xujun Ma <xuma>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: bugproxy, dgibson, hannsj_uhl, jinzhao, juzhang, lvivier, qzhang, virt-maint
Version: 8.3Keywords: Patch, Triaged
Target Milestone: rc   
Target Release: 8.3   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-5.1.0-7.module+el8.3.0+8099+dba2fe3e Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1895948 (view as bug list) Environment:
Last Closed: 2020-11-17 17:49:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1776265, 1854692, 1895948    

Description Xujun Ma 2020-06-22 03:59:41 UTC
Description of problem:
Failed to boot up guest when hotplugging vcpus on bios stage

Version-Release number of selected component (if applicable):
qemu-kvm-5.0.0-0.scrmod+el8.3.0+7066+61d99e35.wrb200617.ppc64le
SLOF-20200327-1.git8e012d6f.scrmod+el8.3.0+7066+61d99e35.noarch
How reproducible:
100%

Steps to Reproduce:
1.Boot up guest with command
/usr/libexec/qemu-kvm \
 -smp 2,maxcpus=4,cores=1,threads=2,sockets=1  \
 -m 4096 \
 -nodefaults \
 -device virtio-scsi-pci,bus=pci.0 \
 -device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \
 -drive file=rhel830-ppc64le-virtio-scsi.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none \
 -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \
 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
 -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \
 -device spapr-vty,id=serial111,chardev=serial_id_serial0 \
 -mon chardev=serial_id_serial0,mode=readline \
2.Hotplug vcpus when running bios slof
(qemu)device_add host-spapr-cpu-core,core-id=2,id=core2
3.

Actual results:
Guest stop booting with error as follwing:

Trying to load:  from: /pci@800000020000000/scsi@0/disk@100000000000000 ... 
 

( 700 ) (qemu) Program Exception [ 7dc6cfff ]


    R0 .. R7           R8 .. R15         R16 .. R23         R24 .. R31
000000007dbf0308   000000007dc63650   0000000000000053   0000000000008000   
000000007e67eff0   000000007dc63658   000000007e747c07   000000000000f003   
000000007dc23100   000000007e47b010   000000007e72c44e   0000000000000006   
000000007dc65000   000000007dc63100   000000007e72c44e   000000007dc19800   
0000000000000000   0000000000000000   000000007e47b010   000000007dc1e040   
000000007dc6cfff   0000000000000000   000000007e748218   0000000000000003   
000000007dc1fcf0   0000000000000000   000000007e747a60   000000000000f001   
fffffffffffffff8   0000000000000000   000000007dc1e210   ffffffffffffffff   

    CR / XER           LR / CTR          SRR0 / SRR1        DAR / DSISR
        80000004   000000007dbf37b0   000000007e748218   0000000000000000   
0000000020040000   000000007e748218   0000000000081000           00000000   


cb > 

Expected results:
Guest boot smoothly.

Additional info:

No this problem when option threads=1.

Comment 1 Laurent Vivier 2020-06-24 07:14:15 UTC
Hi Xujun,

Could check if this is a regression?

Comment 2 Xujun Ma 2020-06-24 07:47:02 UTC
(In reply to Laurent Vivier from comment #1)
> Hi Xujun,
> 
> Could check if this is a regression?

Not a regression.

Comment 3 David Gibson 2020-08-24 05:03:45 UTC
Laurent, this may be a SLOF bug, so I hope you can look at this when you return.

Comment 4 Michael Roth 2020-08-28 19:11:08 UTC
This sounds like it might possibly be related to the issue fixed by patches for https://bugzilla.redhat.com/show_bug.cgi?id=1854692

In that case only SVM guests seem to trigger it, but maybe SLOF can trigger it in some cases even without SVM in play.

Comment 5 Michael Roth 2020-08-28 22:51:53 UTC
I wasn't able to reproduce the crash, but using the latest rhel-av-8.3.0 tree I can reproduce a permanent hang inside SLOF about 2 out of 5 tries using the script below:

#!/bin/bash                                                                    
                                                                               
(sleep .1 && echo "device_add host-spapr-cpu-core,core-id=2,id=core2" | nc -U monitor) &
                                                                               
/usr/libexec/qemu-kvm \                                                        
 -m 4096 \                                                                      
 -smp 2,maxcpus=4,cores=1,threads=2,sockets=1  \
 -nodefaults \^ and v keys to change the selection.                       
 -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \.   
 -device spapr-vty,id=serial111,chardev=serial_id_serial0 \                    
 -mon chardev=serial_id_serial0,mode=readline \
 -device virtio-scsi-pci,bus=pci.0 \
 -device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \
 -drive file=rhel8-guest.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none,snapshot=on \
 -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \
 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
 -mon chardev=monitor,mode=readline -chardev socket,path=monitor,id=monitor,server,nowait,signal=off -nographic

With the following patches for bz1854692 applied I am no longer able to reproduce the issue:

  https://lists.nongnu.org/archive/html/qemu-arm/2020-08/msg00705.html

I've made a brew build with these patches applied, please test:

  http://brewweb.devel.redhat.com/brew/taskinfo?taskID=31009866

Comment 6 Qunfang Zhang 2020-08-31 01:44:59 UTC
Xujun, Can you test this bug with the build provided by Michael in comment 5?  Thanks.

Comment 7 Xujun Ma 2020-09-01 14:24:32 UTC
(In reply to Qunfang Zhang from comment #6)
> Xujun, Can you test this bug with the build provided by Michael in comment
> 5?  Thanks.

I tested this build and didn't hit this bug again.

Comment 8 Laurent Vivier 2020-09-04 14:45:24 UTC
Patches are in David's today PR:

https://patchew.org/QEMU/20200904034719.673626-1-david@gibson.dropbear.id.au/

bb5d765a8d33 ("target/arm: Move start-powered-off property to generic CPUState")
a79d25aab2c8 ("target/arm: Move setting of CPU halted state to generic code")
695d615e4ac9 ("ppc/spapr: Use start-powered-off CPUState property")

Comment 10 Laurent Vivier 2020-09-08 16:37:41 UTC
Merged upstream:

554c2169e925 ppc/spapr: Use start-powered-off CPUState property
6ad1da667c8e target/arm: Move setting of CPU halted state to generic code
c1b701587e59 target/arm: Move start-powered-off property to generic CPUState

Comment 15 Xujun Ma 2020-09-18 01:43:35 UTC
I have tested it with build qemu-kvm-5.1.0-8.module+el8.3.0+8141+3cd9cd43.ppc64le.
And didn't hit this bug again.So the bug has beed fixed in this build.set it to verified.

Comment 17 Xujun Ma 2020-11-11 07:35:55 UTC
Reset bug priority to high according to the test result and bug criteria for evaluation.

Comment 19 errata-xmlrpc 2020-11-17 17:49:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137