Bug 705737 - multiple concurrent virsh operations make libvirtd closed connection or says connection refused during save guests
Summary: multiple concurrent virsh operations make libvirtd closed connection or says ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: libvirt
Version: 5.7
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Daniel Veillard
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-18 10:13 UTC by Vivian Bian
Modified: 2011-08-15 19:45 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-15 19:45:10 UTC
Target Upstream Version:


Attachments (Terms of Use)
libvirtd.conf (10.56 KB, text/plain)
2011-05-18 10:13 UTC, Vivian Bian
no flags Details
libvirtd.log and messages log (46.24 KB, application/x-gzip)
2011-05-18 10:14 UTC, Vivian Bian
no flags Details
actual result (3.60 KB, text/plain)
2011-05-18 10:14 UTC, Vivian Bian
no flags Details
reproducor scripts (564 bytes, application/x-gzip)
2011-05-18 10:15 UTC, Vivian Bian
no flags Details
gdb.log (108.77 KB, text/plain)
2011-06-13 12:41 UTC, Vivian Bian
no flags Details

Description Vivian Bian 2011-05-18 10:13:16 UTC
Description of problem:
with the default libvirtd.conf , perform the multiple concurrent operation script , will get following errors frequently 

error: Failed to save domain win7-32 to /test/save-dir/win7-32.save
error: server closed connection:
error: cannot send data: Broken pipe
error: failed to connect to the hypervisor
Domain XP-sp3 started

error: unable to connect to '/var/run/libvirt/libvirt-sock', libvirtd may need to be started: Connection refused
error: failed to connect to the hypervisor


set the following stuff in libvirtd.conf , still can't get rid of the "close connection" and "failed to connect to hypervisor" errors. 


max_clients = 100
max_workers = 100
max_requests =100
max_client_requests = 100



Version-Release number of selected component (if applicable):
libvirt-0.8.2-20.el5

How reproducible:
Always

Steps to Reproduce:
1.sh multi_lifecycle.sh
2.
3.
  
Actual results:


Expected results:


Additional info:
host is with 8 cores CPU and 32GB memory . There are 8 guests defined on it .

Comment 1 Vivian Bian 2011-05-18 10:13:47 UTC
Created attachment 499565 [details]
libvirtd.conf

Comment 2 Vivian Bian 2011-05-18 10:14:24 UTC
Created attachment 499566 [details]
libvirtd.log and messages log

Comment 3 Vivian Bian 2011-05-18 10:14:49 UTC
Created attachment 499567 [details]
actual result

Comment 4 Vivian Bian 2011-05-18 10:15:20 UTC
Created attachment 499568 [details]
reproducor scripts

Comment 5 Vivian Bian 2011-05-26 02:43:24 UTC
We will do the test on rhel5.6 to check if it's a regression , then add flags against exact versions and keywords

Comment 6 Vivian Bian 2011-05-31 05:31:36 UTC
tested with 
libvirt-0.8.2-15.el5 on RHEL5.6 released version . Didn't meet this issue . So it is a regression bug . Add the keyword , and set the flag to 5.7.0 ?

Comment 11 Vivian Bian 2011-06-13 12:41:12 UTC
Created attachment 504434 [details]
gdb.log

(In reply to comment #10)
> (In reply to comment #9)
> > QE tried this issue with Osier today, but still cannot reproduce it(pls see
> > comment 8). The reporter of this bug, vbian is taking PTO those days, so we'll
> > retest this bug once she comes back next Mon.
> 
> Thanks for looking into it.  If it still does not reproduce on Monday, I think
> the correct resolution would be WORKSFORME.

tried again on 5.7 

How reproducible :
4/6 

Steps: 
1. enlarge the save/restore times to more than 150 times in the multi_lifecycle.sh . I got the libvirtd crashion near 150 times . 
2. Should test it with Windows guest . With RHEL guest , tried two times, didn't encounter this bug . So in order to place the buggy point , please try it with Win guest first . 

gdb debuginfo is attached

Comment 12 Osier Yang 2011-06-13 13:00:12 UTC
Please get the backtrace stack.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x40a00940 (LWP 27272)]
0x000000392860d8fb in write () from /lib64/libpthread.so.0
(gdb)bt

Comment 13 Vivian Bian 2011-06-14 02:01:38 UTC
(In reply to comment #12)
> Please get the backtrace stack.
> 
> Program received signal SIGPIPE, Broken pipe.
> [Switching to Thread 0x40a00940 (LWP 27272)]
> 0x000000392860d8fb in write () from /lib64/libpthread.so.0
> (gdb)bt

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x40a00940 (LWP 17080)]
0x000000392860d8fb in write () from /lib64/libpthread.so.0
(gdb) bt
#0  0x000000392860d8fb in write () from /lib64/libpthread.so.0
#1  0x00000000004632c9 in qemuMonitorIOWrite (watch=<value optimized out>, fd=59, events=11, opaque=0x2aaab002ddd0) at qemu/qemu_monitor.c:413
#2  qemuMonitorIO (watch=<value optimized out>, fd=59, events=11, opaque=0x2aaab002ddd0) at qemu/qemu_monitor.c:519
#3  0x0000000000415c53 in virEventDispatchHandles () at event.c:479
#4  virEventRunOnce () at event.c:606
#5  0x0000000000417e6c in qemudOneLoop (opaque=0x6c5930) at libvirtd.c:2221
#6  qemudRunLoop (opaque=0x6c5930) at libvirtd.c:2330
#7  0x000000392860673d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003927ed44bd in clone () from /lib64/libc.so.6

Comment 14 Osier Yang 2011-06-15 03:26:08 UTC
SIGPIPE is ignored by libvirtd (libvirt-0.8.2-20), however, gdb captures SIGPIPE by default, so backtrace above is not actually the right backtrace of the crashing. To avoid gdb stopping on SIGPIPE:

(gdb) handle SIGPIPE nostop print pass

However, I see another backtrace, which should be the exact nit caused libvirtd crashing. 

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x4a766940 (LWP 10004)]
0x0000003927e30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003927e30265 in raise () from /lib64/libc.so.6
#1  0x0000003927e31d10 in abort () from /lib64/libc.so.6
#2  0x0000003927e6a99b in __libc_message () from /lib64/libc.so.6
#3  0x0000003927e706d3 in malloc_consolidate () from /lib64/libc.so.6
#4  0x0000003927e72bbc in _int_malloc () from /lib64/libc.so.6
#5  0x0000003927e74aad in calloc () from /lib64/libc.so.6
#6  0x0000003930a2e4dc in virAllocN (ptrptr=0x4a763048, size=10004, count=6)
    at util/memory.c:128
#7  0x0000003930a33254 in virStorageFileGetMetadataFromFD (
    path=0xd6dcc50 "/var/lib/libvirt/images/kvm--10", fd=35, format=0, 
    meta=0xffffffffffffffff) at util/storage_file.c:722
#8  0x0000003930a3e60e in virDomainDiskDefForeachPath (disk=0xd6e9d30, 
    allowProbing=false, ignoreOpenFailure=false, 
    iter=0x3930af7b40 <SELinuxSetSecurityFileLabel>, opaque=0xd6ec570)
    at conf/domain_conf.c:7642
#9  0x0000003930af8db9 in SELinuxSetSecurityAllLabel (drv=0x3930d4eec0, 
    vm=0x2714, stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save")
    at security/security_selinux.c:1001
#10 0x000000000044a671 in qemuSecurityStackedSetSecurityAllLabel (
    drv=<value optimized out>, vm=0xd680ac0, 
    stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save")
    at qemu/qemu_security_stacked.c:231
#11 0x00000000004327a4 in qemudStartVMDaemon (conn=0xd6efca0, 
---Type <return> to continue, or q <return> to quit---
    driver=0xd662c00, vm=0xd680ac0, migrateFrom=0x493263 "stdio", 
    start_paused=true, stdin_fd=30, 
    stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save")
    at qemu/qemu_driver.c:3607
#12 0x000000000043e07b in qemudDomainSaveImageStartVM (conn=0xd6efca0, 
    driver=0xd662c00, vm=0xd680ac0, fd=0x4a764d2c, read_pid=0x4a764d28, 
    header=0x4a764d30, path=0xd6fb410 "/test/save-dir/kvm--10.save")
    at qemu/qemu_driver.c:6479
#13 0x000000000043ebe0 in qemudDomainRestore (conn=0xd6efca0, 
    path=0xd6fb410 "/test/save-dir/kvm--10.save") at qemu/qemu_driver.c:6606
#14 0x0000003930a7a4d8 in virDomainRestore (conn=0xd6efca0, 
    from=0xd6fb410 "/test/save-dir/kvm--10.save") at libvirt.c:2525
#15 0x0000000000423fb4 in remoteDispatchDomainRestore (
    server=<value optimized out>, client=<value optimized out>, conn=0x6, 
    hdr=0xffffffffffffffff, rerr=0x80, args=0x101010101010101, ret=0x4a765f00)
    at remote.c:2157
#16 0x00000000004275ee in remoteDispatchClientCall (server=0xd651930, 
    client=0x2aaaac19ac80, msg=0x2aaaac482270) at dispatch.c:504
#17 0x0000000000427a12 in remoteDispatchClientRequest (server=0xd651930, 
    client=0x2aaaac19ac80, msg=0x2aaaac482270) at dispatch.c:386
#18 0x0000000000419df8 in qemudWorker (data=<value optimized out>)
    at libvirtd.c:1574
#19 0x000000392860673d in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#20 0x0000003927ed44bd in clone () from /lib64/libc.so.6
(gdb) 

Vibian, if your libvirtd also crashed when restoring the domain, then we are at the same point.

Comment 15 Vivian Bian 2011-06-16 11:21:53 UTC
Have tried during the whole day , with gdb , I can't reproduce this bug . But without gdb , libvirtd crashed at the restore point . So we should have been at the same point ?

Comment 16 Daniel Veillard 2011-06-21 02:43:30 UTC
Okay, Vivian, you said it's a regression because you could not reproduce it
in 5.6, this is obviously a race condition, and now you said you can't reproduce it for 24 hours in 5.7 with gdb. For how long have you tried to reproduce it
in 5.6, because you may just not have hitten the race !

  So I question the fact that it's labelled as a regression,

Daniel

Comment 17 Vivian Bian 2011-06-21 03:28:11 UTC
(In reply to comment #16)
> Okay, Vivian, you said it's a regression because you could not reproduce it
> in 5.6, this is obviously a race condition, and now you said you can't
> reproduce it for 24 hours in 5.7 with gdb. For how long have you tried to
> reproduce it
> in 5.6, because you may just not have hitten the race !
> 
>   So I question the fact that it's labelled as a regression,
> 
> Daniel

Good point Daniel , and at the time of pasting the Regression keyword , RHEL5.6 was checking for about 36 hours during the scalability testing . But still , as you mentioned here , it is a race condition , so 36 hours still can't define the ultimate result, but a lower percentage than it is on RHEL5.7 . Because on RHEL5.7 , this bug could be seen during a period of time, and we could get it reproduced even in a 4 of 6 times . 

So any suggestion on the regression keyword ?

Comment 18 Daniel Veillard 2011-06-21 04:27:53 UTC
Were those 2 tests (5.6 and 5.7) done on the same physical machine ?

Daniel

Comment 19 Vivian Bian 2011-06-21 04:30:23 UTC
(In reply to comment #18)
> Were those 2 tests (5.6 and 5.7) done on the same physical machine ?
> 
> Daniel

Yeah , they were all tested on the same physical machine .

Comment 20 Vivian Bian 2011-06-21 09:58:57 UTC
Created attachment 505789 [details]
multi_lifecycle execution result , and the gdb info against it

retested 

Version-Release number of selected component (if applicable):
libvirt-0.8.2-20.el5
kernel-2.6.18-268.el5
kvm-83-237.el5
tree RHEL5.7-Server-20110615.0/tree-x86_64

How reproducible :
1/6 

Steps 
[Preparation]
1.install 10 Windows guest (Mine are Win2008-R2-2.raw  Win2k3-32bit.raw  Win2k8-32bit.raw  Win7-i386.raw    WinXP-i386-2.raw Win2008-R2.raw    Win2k3-64bit.raw  Win2k8-64bit.raw  Win7-x86_64.raw  WinXP-i386.raw )
2.modify /etc/libvirt/libvirt.conf with 
max_clients = 100
max_workers = 100
max_requests = 100
max_client_requests = 100

Action
1.sh multi_lifecycle.sh once after once manually .

Actual result 
Please see the attachment 
virsh-command --- the multi_lifecycle.sh execution result
gdb-info --- the gdb info again the scripts' execution

Comment 21 Vivian Bian 2011-06-21 10:02:53 UTC
Additional info :
from times' testing , found that if we don't meet 

# virsh save guest /dir/guest.save
error: Failed to save domain kvm--1 to /test/save-dir/kvm--1.save
error: operation failed: cannot stop CPU execution

(gdb)
17:15:51.036: error : qemuMonitorCommandWithHandler:362 : cannot send monitor command 'stop': Connection reset by peer
17:15:51.036: error : qemuMonitorTextStopCPUs:498 : operation failed: cannot stop CPU execution

we won't meet the libvirtd crash . But still , with the above error shown , we won't meet this problem 100% .

Comment 22 Daniel Veillard 2011-06-21 12:22:22 UTC
Getting more context based on #21

the test consist of doing various operations simultaneously on 10 domains.
The part of the test leading to the failure is a repeated
  save state of guest
  restore guest from saved state
loop

The segfault only happens if we got the error message
  qemuMonitorCommandWithHandler:362 : cannot send monitor
       command 'stop': Connection reset by peer
  qemuMonitorTextStopCPUs:498 : operation failed: cannot
       stop CPU execution

the first ones come from libvirt monitor code detecting a failure,
always during the stop command during a save, the second one
is a generic error emitted by the libvirt vrapper for the stop
qemu monitor command.

If the error is not seen in the loop the daemon does not crash during the
test. If the error is seen then it may crash.

Based on 2 stack traces obtained from gdb on a crashed libvirt, the
crash always come from a malloc() error at a random point. The memory
from the allocator has been corrupted in some way by one of the operations.
The 2 operations done repeteadly on the 10 domains are just save and restore
for Qemu/KVM domains so that's the 2 code paths that need to get extra attention.

When the test on 5.6 was done to try to estimate if it was a regression,
the same errors were seen but it never led to a daemon crash (which could be
pure luck i.e. no regression, or that the bug didn't occur)

Investigations are going on ...

Daniel

Comment 30 RHEL Program Management 2011-08-15 19:45:10 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.


Note You need to log in before you can comment on or make changes to this bug.