Hide Forgot
Description of problem: with the default libvirtd.conf , perform the multiple concurrent operation script , will get following errors frequently error: Failed to save domain win7-32 to /test/save-dir/win7-32.save error: server closed connection: error: cannot send data: Broken pipe error: failed to connect to the hypervisor Domain XP-sp3 started error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Connection refused error: failed to connect to the hypervisor set the following stuff in libvirtd.conf , still can't get rid of the "close connection" and "failed to connect to hypervisor" errors. max_clients = 100 max_workers = 100 max_requests =100 max_client_requests = 100 Version-Release number of selected component (if applicable): libvirt-0.8.2-20.el5 How reproducible: Always Steps to Reproduce: 1.sh multi_lifecycle.sh 2. 3. Actual results: Expected results: Additional info: host is with 8 cores CPU and 32GB memory . There are 8 guests defined on it .
Created attachment 499565 [details] libvirtd.conf
Created attachment 499566 [details] libvirtd.log and messages log
Created attachment 499567 [details] actual result
Created attachment 499568 [details] reproducor scripts
We will do the test on rhel5.6 to check if it's a regression , then add flags against exact versions and keywords
tested with libvirt-0.8.2-15.el5 on RHEL5.6 released version . Didn't meet this issue . So it is a regression bug . Add the keyword , and set the flag to 5.7.0 ?
Created attachment 504434 [details] gdb.log (In reply to comment #10) > (In reply to comment #9) > > QE tried this issue with Osier today, but still cannot reproduce it(pls see > > comment 8). The reporter of this bug, vbian is taking PTO those days, so we'll > > retest this bug once she comes back next Mon. > > Thanks for looking into it. If it still does not reproduce on Monday, I think > the correct resolution would be WORKSFORME. tried again on 5.7 How reproducible : 4/6 Steps: 1. enlarge the save/restore times to more than 150 times in the multi_lifecycle.sh . I got the libvirtd crashion near 150 times . 2. Should test it with Windows guest . With RHEL guest , tried two times, didn't encounter this bug . So in order to place the buggy point , please try it with Win guest first . gdb debuginfo is attached
Please get the backtrace stack. Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x40a00940 (LWP 27272)] 0x000000392860d8fb in write () from /lib64/libpthread.so.0 (gdb)bt
(In reply to comment #12) > Please get the backtrace stack. > > Program received signal SIGPIPE, Broken pipe. > [Switching to Thread 0x40a00940 (LWP 27272)] > 0x000000392860d8fb in write () from /lib64/libpthread.so.0 > (gdb)bt Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x40a00940 (LWP 17080)] 0x000000392860d8fb in write () from /lib64/libpthread.so.0 (gdb) bt #0 0x000000392860d8fb in write () from /lib64/libpthread.so.0 #1 0x00000000004632c9 in qemuMonitorIOWrite (watch=<value optimized out>, fd=59, events=11, opaque=0x2aaab002ddd0) at qemu/qemu_monitor.c:413 #2 qemuMonitorIO (watch=<value optimized out>, fd=59, events=11, opaque=0x2aaab002ddd0) at qemu/qemu_monitor.c:519 #3 0x0000000000415c53 in virEventDispatchHandles () at event.c:479 #4 virEventRunOnce () at event.c:606 #5 0x0000000000417e6c in qemudOneLoop (opaque=0x6c5930) at libvirtd.c:2221 #6 qemudRunLoop (opaque=0x6c5930) at libvirtd.c:2330 #7 0x000000392860673d in start_thread () from /lib64/libpthread.so.0 #8 0x0000003927ed44bd in clone () from /lib64/libc.so.6
SIGPIPE is ignored by libvirtd (libvirt-0.8.2-20), however, gdb captures SIGPIPE by default, so backtrace above is not actually the right backtrace of the crashing. To avoid gdb stopping on SIGPIPE: (gdb) handle SIGPIPE nostop print pass However, I see another backtrace, which should be the exact nit caused libvirtd crashing. Program received signal SIGABRT, Aborted. [Switching to Thread 0x4a766940 (LWP 10004)] 0x0000003927e30265 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x0000003927e30265 in raise () from /lib64/libc.so.6 #1 0x0000003927e31d10 in abort () from /lib64/libc.so.6 #2 0x0000003927e6a99b in __libc_message () from /lib64/libc.so.6 #3 0x0000003927e706d3 in malloc_consolidate () from /lib64/libc.so.6 #4 0x0000003927e72bbc in _int_malloc () from /lib64/libc.so.6 #5 0x0000003927e74aad in calloc () from /lib64/libc.so.6 #6 0x0000003930a2e4dc in virAllocN (ptrptr=0x4a763048, size=10004, count=6) at util/memory.c:128 #7 0x0000003930a33254 in virStorageFileGetMetadataFromFD ( path=0xd6dcc50 "/var/lib/libvirt/images/kvm--10", fd=35, format=0, meta=0xffffffffffffffff) at util/storage_file.c:722 #8 0x0000003930a3e60e in virDomainDiskDefForeachPath (disk=0xd6e9d30, allowProbing=false, ignoreOpenFailure=false, iter=0x3930af7b40 <SELinuxSetSecurityFileLabel>, opaque=0xd6ec570) at conf/domain_conf.c:7642 #9 0x0000003930af8db9 in SELinuxSetSecurityAllLabel (drv=0x3930d4eec0, vm=0x2714, stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save") at security/security_selinux.c:1001 #10 0x000000000044a671 in qemuSecurityStackedSetSecurityAllLabel ( drv=<value optimized out>, vm=0xd680ac0, stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save") at qemu/qemu_security_stacked.c:231 #11 0x00000000004327a4 in qemudStartVMDaemon (conn=0xd6efca0, ---Type <return> to continue, or q <return> to quit--- driver=0xd662c00, vm=0xd680ac0, migrateFrom=0x493263 "stdio", start_paused=true, stdin_fd=30, stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save") at qemu/qemu_driver.c:3607 #12 0x000000000043e07b in qemudDomainSaveImageStartVM (conn=0xd6efca0, driver=0xd662c00, vm=0xd680ac0, fd=0x4a764d2c, read_pid=0x4a764d28, header=0x4a764d30, path=0xd6fb410 "/test/save-dir/kvm--10.save") at qemu/qemu_driver.c:6479 #13 0x000000000043ebe0 in qemudDomainRestore (conn=0xd6efca0, path=0xd6fb410 "/test/save-dir/kvm--10.save") at qemu/qemu_driver.c:6606 #14 0x0000003930a7a4d8 in virDomainRestore (conn=0xd6efca0, from=0xd6fb410 "/test/save-dir/kvm--10.save") at libvirt.c:2525 #15 0x0000000000423fb4 in remoteDispatchDomainRestore ( server=<value optimized out>, client=<value optimized out>, conn=0x6, hdr=0xffffffffffffffff, rerr=0x80, args=0x101010101010101, ret=0x4a765f00) at remote.c:2157 #16 0x00000000004275ee in remoteDispatchClientCall (server=0xd651930, client=0x2aaaac19ac80, msg=0x2aaaac482270) at dispatch.c:504 #17 0x0000000000427a12 in remoteDispatchClientRequest (server=0xd651930, client=0x2aaaac19ac80, msg=0x2aaaac482270) at dispatch.c:386 #18 0x0000000000419df8 in qemudWorker (data=<value optimized out>) at libvirtd.c:1574 #19 0x000000392860673d in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #20 0x0000003927ed44bd in clone () from /lib64/libc.so.6 (gdb) Vibian, if your libvirtd also crashed when restoring the domain, then we are at the same point.
Have tried during the whole day , with gdb , I can't reproduce this bug . But without gdb , libvirtd crashed at the restore point . So we should have been at the same point ?
Okay, Vivian, you said it's a regression because you could not reproduce it in 5.6, this is obviously a race condition, and now you said you can't reproduce it for 24 hours in 5.7 with gdb. For how long have you tried to reproduce it in 5.6, because you may just not have hitten the race ! So I question the fact that it's labelled as a regression, Daniel
(In reply to comment #16) > Okay, Vivian, you said it's a regression because you could not reproduce it > in 5.6, this is obviously a race condition, and now you said you can't > reproduce it for 24 hours in 5.7 with gdb. For how long have you tried to > reproduce it > in 5.6, because you may just not have hitten the race ! > > So I question the fact that it's labelled as a regression, > > Daniel Good point Daniel , and at the time of pasting the Regression keyword , RHEL5.6 was checking for about 36 hours during the scalability testing . But still , as you mentioned here , it is a race condition , so 36 hours still can't define the ultimate result, but a lower percentage than it is on RHEL5.7 . Because on RHEL5.7 , this bug could be seen during a period of time, and we could get it reproduced even in a 4 of 6 times . So any suggestion on the regression keyword ?
Were those 2 tests (5.6 and 5.7) done on the same physical machine ? Daniel
(In reply to comment #18) > Were those 2 tests (5.6 and 5.7) done on the same physical machine ? > > Daniel Yeah , they were all tested on the same physical machine .
Created attachment 505789 [details] multi_lifecycle execution result , and the gdb info against it retested Version-Release number of selected component (if applicable): libvirt-0.8.2-20.el5 kernel-2.6.18-268.el5 kvm-83-237.el5 tree RHEL5.7-Server-20110615.0/tree-x86_64 How reproducible : 1/6 Steps [Preparation] 1.install 10 Windows guest (Mine are Win2008-R2-2.raw Win2k3-32bit.raw Win2k8-32bit.raw Win7-i386.raw WinXP-i386-2.raw Win2008-R2.raw Win2k3-64bit.raw Win2k8-64bit.raw Win7-x86_64.raw WinXP-i386.raw ) 2.modify /etc/libvirt/libvirt.conf with max_clients = 100 max_workers = 100 max_requests = 100 max_client_requests = 100 Action 1.sh multi_lifecycle.sh once after once manually . Actual result Please see the attachment virsh-command --- the multi_lifecycle.sh execution result gdb-info --- the gdb info again the scripts' execution
Additional info : from times' testing , found that if we don't meet # virsh save guest /dir/guest.save error: Failed to save domain kvm--1 to /test/save-dir/kvm--1.save error: operation failed: cannot stop CPU execution (gdb) 17:15:51.036: error : qemuMonitorCommandWithHandler:362 : cannot send monitor command 'stop': Connection reset by peer 17:15:51.036: error : qemuMonitorTextStopCPUs:498 : operation failed: cannot stop CPU execution we won't meet the libvirtd crash . But still , with the above error shown , we won't meet this problem 100% .
Getting more context based on #21 the test consist of doing various operations simultaneously on 10 domains. The part of the test leading to the failure is a repeated save state of guest restore guest from saved state loop The segfault only happens if we got the error message qemuMonitorCommandWithHandler:362 : cannot send monitor command 'stop': Connection reset by peer qemuMonitorTextStopCPUs:498 : operation failed: cannot stop CPU execution the first ones come from libvirt monitor code detecting a failure, always during the stop command during a save, the second one is a generic error emitted by the libvirt vrapper for the stop qemu monitor command. If the error is not seen in the loop the daemon does not crash during the test. If the error is seen then it may crash. Based on 2 stack traces obtained from gdb on a crashed libvirt, the crash always come from a malloc() error at a random point. The memory from the allocator has been corrupted in some way by one of the operations. The 2 operations done repeteadly on the 10 domains are just save and restore for Qemu/KVM domains so that's the 2 code paths that need to get extra attention. When the test on 5.6 was done to try to estimate if it was a regression, the same errors were seen but it never led to a daemon crash (which could be pure luck i.e. no regression, or that the bug didn't occur) Investigations are going on ... Daniel
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.