Bug 1614610 - Guest quit with error when hotunplug cpu
Summary: Guest quit with error when hotunplug cpu
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.6
Hardware: ppc64le
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Laurent Vivier
QA Contact: Xujun Ma
URL:
Whiteboard:
Depends On:
Blocks: 1649160 1665844 1668205
TreeView+ depends on / blocked
 
Reported: 2018-08-10 03:39 UTC by Xujun Ma
Modified: 2019-08-22 09:20 UTC (History)
14 users (show)

Fixed In Version: qemu-kvm-rhev-2.12.0-22.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1665844 (view as bug list)
Environment:
Last Closed: 2019-08-22 09:18:48 UTC
Target Upstream Version:


Attachments (Terms of Use)
log (233.59 KB, text/plain)
2018-09-11 02:10 UTC, Xujun Ma
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1614606 None None None 2019-11-26 09:26:23 UTC
Red Hat Product Errata RHSA-2019:2553 None None None 2019-08-22 09:20:10 UTC

Internal Links: 1614606

Description Xujun Ma 2018-08-10 03:39:34 UTC
Description of problem:
Guest quit with error when hotunplug cpu.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.12.0-9.el7.ppc64le
guest:3.10.0-931.el7.ppc64


How reproducible:
1/5

Steps to Reproduce:
1.Boot up guest with command:
MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_iz0Gna/monitor-qmpmonitor1-20180809-215932-vNtDCgIK,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_iz0Gna/monitor-catch_monitor-20180809-215932-vNtDCgIK,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_iz0Gna/serial-serial0-20180809-215932-vNtDCgIK,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kar/vt_test_images/rhel76-ppc64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \
    -device virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,id=idDVCWwS,vectors=4,netdev=id6XSmsF,bus=pci.0,addr=0x5  \
    -netdev tap,id=id6XSmsF,vhost=on,vhostfd=11,fd=17 \
    -m 8192  \
    -smp 8,maxcpus=64,cores=1,threads=8,sockets=1 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host  \
    -boot menu=off,strict=off,order=cdn,once=c \
    -enable-kvm

2.Hotplug cpus to max 64
3.Offline all cpus till latest cpu
chcpu -d 0-63
4.Hotunplug all cpus


Actual results:
Guest quit with error as following when hotunplug cpu.
[qemu output] qemu:qemu_cpu_kick_thread: No such process
[qemu output] (Process terminated with status 1)



Expected results:
Guest no crash when hotunplug cpu.

Additional info:

Comment 2 Qunfang Zhang 2018-08-10 07:49:50 UTC
Set needinfo to Xujun for x86 test results.

Comment 3 Xujun Ma 2018-08-21 02:07:17 UTC
(In reply to Qunfang Zhang from comment #2)
> Set needinfo to Xujun for x86 test results.

x86 has the same problem with guest kernel 3.10.0-931.el7.x86_64.

Comment 4 Igor Mammedov 2018-08-22 14:15:19 UTC
(In reply to Xujun Ma from comment #0)
> Description of problem:
> Guest quit with error when hotunplug cpu.
> 
> Version-Release number of selected component (if applicable):
> qemu-kvm-rhev-2.12.0-9.el7.ppc64le
> guest:3.10.0-931.el7.ppc64
> 
> 
> How reproducible:
> 1/5
> 
> Steps to Reproduce:
> 1.Boot up guest with command:
> MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
>     -name 'avocado-vt-vm1'  \
>     -sandbox off  \
>     -machine pseries  \
>     -nodefaults \
>     -device VGA,bus=pci.0,addr=0x2  \
>     -chardev
> socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_iz0Gna/monitor-
> qmpmonitor1-20180809-215932-vNtDCgIK,server,nowait \
>     -mon chardev=qmp_id_qmpmonitor1,mode=control  \
>     -chardev
> socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_iz0Gna/monitor-
> catch_monitor-20180809-215932-vNtDCgIK,server,nowait \
>     -mon chardev=qmp_id_catch_monitor,mode=control  \
>     -chardev
> socket,id=serial_id_serial0,path=/var/tmp/avocado_iz0Gna/serial-serial0-
> 20180809-215932-vNtDCgIK,server,nowait \
>     -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
>     -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \
>     -drive
> id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,
> file=/home/kar/vt_test_images/rhel76-ppc64-virtio.qcow2 \
>     -device
> virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \
>     -device
> virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,id=idDVCWwS,vectors=4,netdev=id6XSmsF,
> bus=pci.0,addr=0x5  \
>     -netdev tap,id=id6XSmsF,vhost=on,vhostfd=11,fd=17 \
>     -m 8192  \
>     -smp 8,maxcpus=64,cores=1,threads=8,sockets=1 \
>     -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
>     -vnc :0  \
>     -rtc base=utc,clock=host  \
>     -boot menu=off,strict=off,order=cdn,once=c \
>     -enable-kvm
> 
> 2.Hotplug cpus to max 64
> 3.Offline all cpus till latest cpu
> chcpu -d 0-63
> 4.Hotunplug all cpus
pls provide exact steps here any scripts you use to do it

> 
> 
> Actual results:
> Guest quit with error as following when hotunplug cpu.
> [qemu output] qemu:qemu_cpu_kick_thread: No such process
> [qemu output] (Process terminated with status 1)
I'd suspect attempt to unplug cpu0 which is not supported on x86 and probably not supported on ppc as well.

Anyways lets see how cpus are unplugged and go from there

> 
> 
> Expected results:
> Guest no crash when hotunplug cpu.
> 
> Additional info:

Comment 5 David Gibson 2018-09-04 03:55:59 UTC
In discussion with Igor, it looks like this is similar bugs in both the POWER and x86 code, rather than a generic bug.  Therefore moving back to ppc64, Igor will clone for the x86 version of the bug.

Also moving to RHEL7.7, since it's not urgent enough for 7.6.

Comment 6 Xujun Ma 2018-09-11 02:00:17 UTC
(In reply to Igor Mammedov from comment #4)
> (In reply to Xujun Ma from comment #0)
> > Description of problem:
> > Guest quit with error when hotunplug cpu.
> > 
> > Version-Release number of selected component (if applicable):
> > qemu-kvm-rhev-2.12.0-9.el7.ppc64le
> > guest:3.10.0-931.el7.ppc64
> > 
> > 
> > How reproducible:
> > 1/5
> > 
> > Steps to Reproduce:
> > 1.Boot up guest with command:
> > MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
> >     -name 'avocado-vt-vm1'  \
> >     -sandbox off  \
> >     -machine pseries  \
> >     -nodefaults \
> >     -device VGA,bus=pci.0,addr=0x2  \
> >     -chardev
> > socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_iz0Gna/monitor-
> > qmpmonitor1-20180809-215932-vNtDCgIK,server,nowait \
> >     -mon chardev=qmp_id_qmpmonitor1,mode=control  \
> >     -chardev
> > socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_iz0Gna/monitor-
> > catch_monitor-20180809-215932-vNtDCgIK,server,nowait \
> >     -mon chardev=qmp_id_catch_monitor,mode=control  \
> >     -chardev
> > socket,id=serial_id_serial0,path=/var/tmp/avocado_iz0Gna/serial-serial0-
> > 20180809-215932-vNtDCgIK,server,nowait \
> >     -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
> >     -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \
> >     -drive
> > id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,
> > file=/home/kar/vt_test_images/rhel76-ppc64-virtio.qcow2 \
> >     -device
> > virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \
> >     -device
> > virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,id=idDVCWwS,vectors=4,netdev=id6XSmsF,
> > bus=pci.0,addr=0x5  \
> >     -netdev tap,id=id6XSmsF,vhost=on,vhostfd=11,fd=17 \
> >     -m 8192  \
> >     -smp 8,maxcpus=64,cores=1,threads=8,sockets=1 \
> >     -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
> >     -vnc :0  \
> >     -rtc base=utc,clock=host  \
> >     -boot menu=off,strict=off,order=cdn,once=c \
> >     -enable-kvm
> > 
> > 2.Hotplug cpus to max 64
> > 3.Offline all cpus till latest cpu
> > chcpu -d 0-63
> > 4.Hotunplug all cpus
> pls provide exact steps here any scripts you use to do it
Sorry for replying so late.
Send command: {'execute': 'device_del', 'arguments': {'id': 'core8'}, 'id': 'PQiEahED'}
....
Send command: {'execute': 'device_del', 'arguments': {'id': 'core56'}, 'id': 'PQiEahED'}

> 
> > 
> > 
> > Actual results:
> > Guest quit with error as following when hotunplug cpu.
> > [qemu output] qemu:qemu_cpu_kick_thread: No such process
> > [qemu output] (Process terminated with status 1)
> I'd suspect attempt to unplug cpu0 which is not supported on x86 and
> probably not supported on ppc as well.
That's right,actually I mean hot unplug all cpus pluged. 
> 
> Anyways lets see how cpus are unplugged and go from there
Will update a testing log. 
> 
> > 
> > 
> > Expected results:
> > Guest no crash when hotunplug cpu.
> > 
> > Additional info:

Comment 7 Xujun Ma 2018-09-11 02:10:59 UTC
Created attachment 1482250 [details]
log

Comment 8 Xujun Ma 2018-09-11 02:30:41 UTC
rhel8 qemu-kvm-2.12.0-26.el8+1648+9c120fe6.ppc64le  have the same issue.

Comment 9 Laurent Vivier 2018-09-13 16:35:01 UTC
I'm not able to reproduce it with host kernel 3.10.0-944
Could you try to reproduce the problem with this kernel on the host.

Comment 10 Laurent Vivier 2018-09-17 15:14:15 UTC
I'm not able to reproduce the problem with the exact same versions of QEMU and kernel (host and guest).

1- starting with:

    -smp 8,maxcpus=64,cores=1,threads=8,sockets=1

2- hotplugging CPUs with:

    device_add driver=host-spapr-cpu-core core-id=8 id=core-8
    device_add driver=host-spapr-cpu-core core-id=16 id=core-16
    device_add driver=host-spapr-cpu-core core-id=24 id=core-24
    device_add driver=host-spapr-cpu-core core-id=32 id=core-32
    device_add driver=host-spapr-cpu-core core-id=40 id=core-40
    device_add driver=host-spapr-cpu-core core-id=48 id=core-48
    device_add driver=host-spapr-cpu-core core-id=56 id=core-56
    device_add driver=host-spapr-cpu-core core-id=64 id=core-64

3- disabling CPUs with:

    chcpu -d 8-63

4- unplugging CPUs with:

    device_del id=core-8
    device_del id=core-16
    device_del id=core-24
    device_del id=core-32
    device_del id=core-40
    device_del id=core-48
    device_del id=core-56
    device_del id=core-64

Comment 11 Xujun Ma 2018-09-18 02:23:34 UTC
(In reply to Laurent Vivier from comment #9)
> I'm not able to reproduce it with host kernel 3.10.0-944
> Could you try to reproduce the problem with this kernel on the host.

It can't be reproduced every time,and the issue happens with 1/5 probability.

Comment 12 Laurent Vivier 2018-11-26 19:48:17 UTC
Could you re-test with kernel-3.10.0-967 on the host?

Comment 13 Xujun Ma 2018-11-30 00:00:46 UTC
test env:
host:kernel-3.10.0-967.el7.ppc64le
qemu-kvm-rhev-2.12.0-18.el7.ppc64le
guest:kernel-3.10.0-967.el7.ppc64le

Run this case 100 times,hit this problem 6 times,so the bug hasn't been fixed.

Comment 14 Laurent Vivier 2018-12-04 11:39:03 UTC
Xujun,

could you provide the script to reproduce the problem?

I tried to reproduce the commands I can seen in the attachment 1482250 [details] but I'm not able to reproduce the problem on 1000 attempts.

Comment 15 Laurent Vivier 2018-12-06 10:26:45 UTC
The following script is running for 2 days (5000 loops) and didn't trigger any problem:

IP=root@192.168.122.80
QMP=$HOME/qemu/scripts/qmp/qmp-shell

function query_cpu
{
        echo "query-cpus" | sudo $QMP /tmp/qmp0
}

function plug
{
        for i in $(seq 8 8 $1); do
                echo "query-cpus"
                echo "device_add driver=host-spapr-cpu-core core-id=$i id=core-$i";
        done | sudo $QMP /tmp/qmp0
}

function unplug
{
        for i in $(seq 8 8 $1); do
                echo "query-cpus"
                echo "device_del id=core-$i";
        done | sudo $QMP /tmp/qmp0
}

function remote_getconf
{
        ssh $IP getconf $1
}

function get_nb_plugged_cpus
{
        ssh $IP lscpu | sed -n "/^CPU(s)/s/^CPU(s):[^[:digit:]]*\(.*\)/\1/p"
}

ssh -f $IP "stress-ng --cpu 64 --io 4 --vm 2 --vm-bytes 256M -l 100"

l=0
while plug 63; do
        while [ "$(remote_getconf _NPROCESSORS_ONLN)" != 64 ] ; do
                :
        done
        ssh $IP lscpu
        ssh $IP chcpu -d 1-63
        ssh $IP lscpu
        unplug 63
        while [ "$(get_nb_plugged_cpus)" != 8 ] ; do
                :
        done
        ssh $IP lscpu
        ssh $IP chcpu -e 0-7
        ssh $IP lscpu
        l=$((l+1))
        echo "LOOP $l DONE"
done

Comment 16 Laurent Vivier 2018-12-06 15:04:45 UTC
(In reply to Xujun Ma from comment #0)
> Description of problem:
...
> 4.Hotunplug all cpus
> 
> 
> Actual results:
> Guest quit with error as following when hotunplug cpu.
> [qemu output] qemu:qemu_cpu_kick_thread: No such process
> [qemu output] (Process terminated with status 1)

The only explanation I can find about this error is a race condition between qemu_kvm_cpu_thread_fn() that releases the thread and qemu_cpu_kick_thread() that could be using the same thread.

Paolo, any idea?

Comment 17 Laurent Vivier 2018-12-13 19:09:45 UTC
(In reply to Laurent Vivier from comment #16)
> (In reply to Xujun Ma from comment #0)
> > Description of problem:
> ...
> > 4.Hotunplug all cpus
> > 
> > 
> > Actual results:
> > Guest quit with error as following when hotunplug cpu.
> > [qemu output] qemu:qemu_cpu_kick_thread: No such process
> > [qemu output] (Process terminated with status 1)
> 
> The only explanation I can find about this error is a race condition between
>  that releases the thread and qemu_cpu_kick_thread()
> that could be using the same thread.

I think in this case we could ignore the error doing:

--- a/cpus.c
+++ b/cpus.c
@@ -1700,7 +1700,7 @@ static void qemu_cpu_kick_thread(CPUState *cpu)
     }
     cpu->thread_kicked = true;
     err = pthread_kill(cpu->thread->thread, SIG_IPI);
-    if (err) {
+    if (err && err != ESRCH) {
         fprintf(stderr, "qemu:%s: %s", __func__, strerror(err));
         exit(1);
     }

Comment 20 Xujun Ma 2018-12-18 01:46:48 UTC
Hi Laurent

I'm very sorry to fetch the scratch build late and it's closed.
Could you provide a new one?

Comment 23 Xujun Ma 2018-12-24 02:34:24 UTC
Tested 100 times and no this problem with build qemu-kvm-rhev-2.12.0-20.el7.BZ1614610.ppc64le.

Comment 24 Laurent Vivier 2019-01-02 14:19:55 UTC
Patch sent upstream:

  cpus: ignore ESRCH in qemu_cpu_kick_thread()
  https://patchwork.ozlabs.org/patch/1020005/

Comment 26 Miroslav Rezanina 2019-02-05 17:08:21 UTC
Fix included in qemu-kvm-rhev-2.12.0-22.el7

Comment 28 Xujun Ma 2019-03-29 00:02:08 UTC
Verify this issue with qemu-kvm-rhev-2.12.0-25.el7.ppc64le,
Tested this scenario 100 times,and didn't hit this problem again.
Base the test result,the bug has been fixed,so set status to verified.

Comment 30 errata-xmlrpc 2019-08-22 09:18:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2553


Note You need to log in before you can comment on or make changes to this bug.