Bug 1284324

Summary: qemu aborted with host in high pressure while guest runing ltp.
Product: Red Hat Enterprise Linux 7 Reporter: Zhengtong <zhengtli>
Component: qemu-kvm-rhevAssignee: Thomas Huth <thuth>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: jstancek, mazhang, michen, qzhang, shuyu, thuth, virt-maint, xuhan, xuma, zhengtli
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-30 10:58:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Zhengtong 2015-11-23 03:45:13 UTC
Description of problem:
Boot up several guests in one host and do ltp test in each guest in a endless loop. After run for hours or days , qemu aborted

Version-Release number of selected component (if applicable):
host kernel and guest kernel: 3.10.0-327.2.1.el7.ppc64le
qemu-kvm-rhev: qemu-kvm-rhev-2.3.0-31.el7

How reproducible:
two guests in 15 guests tested till now

Steps to Reproduce:
1.Boot up 5 guests in host:
#boot_guest_gdb.sh 1/2/3/4/5
2.In each guest . do ltp test
screen -d -m -S ltp_test /root/run_ltp.sh

[root@dhcp70-169 ~]# cat /root/run_ltp.sh
#!/bin/sh
cd /mnt/tests/kernel/RHEL6/ltp-lite/; while  true ; do make run; done

3.In host, run mem_alloc bomb application

[root@ibm-p8-rhevm-17 home]#i=1 ;while true ; do echo $i ; ./mem_alloc ; i=$((${i}+1)); sleep 180 ; done

4. wait for result


Actual results:
one guest aborted

Expected results:
no guest aborted ,and all guest work well

Additional info:

[root@ibm-p8-rhevm-14 virtio-scsi]# cat boot_guest_gdb.sh 
gdb --args /usr/libexec/qemu-kvm \
-name liuzt-ltp_${1} \
-machine pseries,accel=kvm,usb=off \
-m 4096 \
-realtime mlock=off \
-smp 4,sockets=1,cores=2,threads=2 \
-monitor stdio \
-rtc base=localtime,clock=host \
-no-shutdown \
-boot strict=on \
-device usb-ehci,id=usb,bus=pci.0,addr=0x1 \
-device pci-ohci,id=usb1,bus=pci.0,addr=0x2 \
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x3 \
-drive file=/home/vdisk/sn${1}.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
-device usb-kbd,id=input0 \
-device usb-mouse,id=input1 \
-device usb-tablet,id=input2 \
-vnc 0:${1} \
-device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x4 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
-msg timestamp=on \
-netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device spapr-vlan,netdev=hostnet0,id=net0,mac=52:54:00:f4:ee:$((${1}+10)),reg=0x2000

mem_alloc source code (write by Thomas Huth):

    #include <stdio.h>
    #include <stdlib.h>

    int main()
    {
        char *p;

        while(1) {
            p = malloc(65536);
            *(long*)p = rand()<<16 | rand();
        }
        return 0;
    }

Comment 2 Zhengtong 2015-11-23 03:47:25 UTC
component may be not correct. please modify it if not suitable. 

and gdb btrace log as follows:
[New Thread 0x3ffeae50eaf0 (LWP 8259)]
qemu: qemu_thread_create: Resource temporarily unavailable

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x3ffeae50eaf0 (LWP 8259)]
0x00003fffb6c6e578 in raise () from /lib64/power8/libc.so.6
Missing separate debuginfos, use: debuginfo-install alsa-lib-1.0.28-2.el7.ppc64le bzip2-libs-1.0.6-13.el7.ppc64le cyrus-sasl-lib-2.1.26-19.2.el7.ppc64le dbus-libs-1.6.12-13.el7.ppc64le elfutils-libelf-0.163-3.el7.ppc64le elfutils-libs-0.163-3.el7.ppc64le flac-libs-1.3.0-5.el7_1.ppc64le glib2-2.42.2-5.el7.ppc64le glibc-2.17-105.el7.ppc64le gmp-6.0.0-11.el7.ppc64le gnutls-3.3.8-12.el7_1.1.ppc64le gperftools-libs-2.4-7.el7.ppc64le gsm-1.0.13-11.el7.ppc64le json-c-0.11-4.el7_0.ppc64le keyutils-libs-1.5.8-3.el7.ppc64le krb5-libs-1.13.2-10.el7.ppc64le libICE-1.0.9-2.el7.ppc64le libSM-1.2.2-2.el7.ppc64le libX11-1.6.3-2.el7.ppc64le libXau-1.0.8-2.1.el7.ppc64le libXext-1.3.3-3.el7.ppc64le libXi-1.7.4-2.el7.ppc64le libXtst-1.2.2-2.1.el7.ppc64le libaio-0.3.109-13.el7.ppc64le libasyncns-0.8-7.el7.ppc64le libattr-2.4.46-12.el7.ppc64le libcap-2.22-8.el7.ppc64le libcom_err-1.42.9-7.el7.ppc64le libcurl-7.29.0-25.el7.ppc64le libdb-5.3.21-19.el7.ppc64le libfdt-1.4.0-2.el7.ppc64le libffi-3.0.13-16.el7.ppc64le libgcc-4.8.5-4.el7.ppc64le libgcrypt-1.5.3-12.el7_1.1.ppc64le libgpg-error-1.12-3.el7.ppc64le libibverbs-1.1.8-8.el7.ppc64le libidn-1.28-4.el7.ppc64le libiscsi-1.9.0-6.el7.ppc64le libnl3-3.2.21-10.el7.ppc64le libogg-1.3.0-7.el7.ppc64le libpng-1.5.13-5.el7.ppc64le librdmacm-1.0.21-1.el7.ppc64le libselinux-2.2.2-6.el7.ppc64le libsndfile-1.0.25-10.el7.ppc64le libssh2-1.4.3-10.el7.ppc64le libstdc++-4.8.5-4.el7.ppc64le libtasn1-3.8-2.el7.ppc64le libusbx-1.0.15-4.el7.ppc64le libuuid-2.23.2-26.el7.ppc64le libvorbis-1.3.3-8.el7.ppc64le libxcb-1.11-4.el7.ppc64le lzo-2.06-8.el7.ppc64le nettle-2.7.1-4.el7.ppc64le nspr-4.10.8-2.el7_1.ppc64le nss-3.19.1-18.el7.ppc64le nss-softokn-freebl-3.16.2.3-14.el7.ppc64le nss-util-3.19.1-4.el7_1.ppc64le numactl-libs-2.0.9-5.el7_1.ppc64le openldap-2.4.40-8.el7.ppc64le openssl-libs-1.0.1e-42.el7_1.9.ppc64le p11-kit-0.20.7-3.el7.ppc64le pcre-8.32-15.el7.ppc64le pixman-0.32.6-3.el7.ppc64le pulseaudio-libs-6.0-7.el7.ppc64le snappy-1.1.0-3.el7.ppc64le systemd-libs-219-19.el7.ppc64le tcp_wrappers-libs-7.6-77.el7.ppc64le trousers-0.3.13-1.el7.ppc64le xz-libs-5.1.2-12alpha.el7.ppc64le zlib-1.2.7-15.el7.ppc64le
(gdb) bt
#0  0x00003fffb6c6e578 in raise () from /lib64/power8/libc.so.6
#1  0x00003fffb6c706fc in abort () from /lib64/power8/libc.so.6
#2  0x000000005d9a088c in error_exit (err=<optimized out>, msg=0x5dde4278 <__func__.6592> "qemu_thread_create") at util/qemu-thread-posix.c:48
#3  0x000000005dd61dc4 in qemu_thread_create (thread=0x3ffeae50e0c0, name=0x5ddd1ce8 "worker", start_routine=0x5dcb7800 <worker_thread>, arg=0x5edc0ea0, mode=<optimized out>)
    at util/qemu-thread-posix.c:473
#4  0x000000005dcb7770 in do_spawn_thread (pool=<optimized out>) at thread-pool.c:135
#5  0x000000005dcb7860 in worker_thread (opaque=0x5edc0ea0) at thread-pool.c:83
#6  0x00003fffb7bf8728 in start_thread () from /lib64/power8/libpthread.so.0
#7  0x00003fffb6d47ae0 in clone () from /lib64/power8/libc.so.6
(gdb)

Comment 3 Thomas Huth 2015-11-23 05:27:32 UTC
Thanks a lot for the backtrace, that is very valuable! So in that I can see that QEMU aborts right after trying to do a pthread_create() in qemu_thread_create(), by calling the function error_exit() after pthread_create() returned an error code.

I assume this is just due to the out-of-memory condition that is caused by the malloc bomb. I've experienced these a couple of times, too, while doing test runs for BZ 1269467. pthread_create() seems to be quite sensitive when the host is running out of memory. IIRC, the pthread_create() returned EAGAIN in my case, which means "insufficient resources to create another thread" for this function (see http://man7.org/linux/man-pages/man3/pthread_create.3.html for example). I just never noticed that it is aborting instead of exiting normally since I was using libvirt and that seems to hide that information. But I got the error output of QEMU in my log that it failed due to pthread_create().

By the way, the error_exit() function that is called right after the pthread_create() should have printed something to stderr before aborting, do you still have that text by any chance? It should give a better indication of what exactly whent wrong during pthread_create.

Comment 4 Zhengtong 2015-11-30 03:01:49 UTC
Hi Thomas, I think the only something printed to stderr is 

"[root@ibm-p8-rhevm-17 virtio-scsi]# ./boot_guest.sh 1
QEMU 2.3.0 monitor - type 'help' for more information
(qemu) 
(qemu) 
(qemu) vscsi_process_tsk_mgmt 01
vscsi_process_tsk_mgmt 01
vscsi_process_tsk_mgmt 01
vscsi_process_tsk_mgmt 01
vscsi_process_tsk_mgmt 01
vscsi_process_tsk_mgmt 01
vscsi_process_tsk_mgmt 01
qemu: qemu_thread_create: Resource temporarily unavailable
./boot_guest.sh: line 24: 12326 Aborted"

The line "qemu: qemu_thread_create: Resource temporarily unavailable" may means something. Is that what you want ?

Comment 5 Thomas Huth 2015-11-30 09:51:28 UTC
Yes, thanks, that "qemu_thread_create: Resource temporarily unavailable" was what I was interested in!
"Resource temporarily unavailable" is the error message for the EAGAIN error code - so that means QEMU simply terminated here because the host was running out of memory.

That means there's not much we can fix here, and I'd suggest to close this ticket as NOTABUG if you agree. The only things that might be possible is that we could improve the error message a little bit, but I'm not sure whether this is worth the effort. What do you think?

Comment 6 Zhengtong 2015-11-30 09:59:45 UTC
Hi Thomas,

I agree with you. If this still caused by oom. That will make sense. I think that's ok to close this bug as NOTABUG. the qemu-kvm-rhev is not the production the customer may touch directly, I mean there may be libvirt or higher layer may handle this. So that's ok for me.