| Summary: | multiple concurrent virsh operations make libvirtd closed connection or says connection refused during save guests | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Vivian Bian <vbian> | ||||||||||||
| Component: | libvirt | Assignee: | Daniel Veillard <veillard> | ||||||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | high | ||||||||||||||
| Version: | 5.7 | CC: | ajia, dallan, dyuan, eblake, jdenemar, llim, mshao, rwu, veillard | ||||||||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | x86_64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2011-08-15 19:45:10 UTC | Type: | --- | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Vivian Bian
2011-05-18 10:13:16 UTC
Created attachment 499565 [details]
libvirtd.conf
Created attachment 499566 [details]
libvirtd.log and messages log
Created attachment 499567 [details]
actual result
Created attachment 499568 [details]
reproducor scripts
We will do the test on rhel5.6 to check if it's a regression , then add flags against exact versions and keywords tested with libvirt-0.8.2-15.el5 on RHEL5.6 released version . Didn't meet this issue . So it is a regression bug . Add the keyword , and set the flag to 5.7.0 ? Created attachment 504434 [details] gdb.log (In reply to comment #10) > (In reply to comment #9) > > QE tried this issue with Osier today, but still cannot reproduce it(pls see > > comment 8). The reporter of this bug, vbian is taking PTO those days, so we'll > > retest this bug once she comes back next Mon. > > Thanks for looking into it. If it still does not reproduce on Monday, I think > the correct resolution would be WORKSFORME. tried again on 5.7 How reproducible : 4/6 Steps: 1. enlarge the save/restore times to more than 150 times in the multi_lifecycle.sh . I got the libvirtd crashion near 150 times . 2. Should test it with Windows guest . With RHEL guest , tried two times, didn't encounter this bug . So in order to place the buggy point , please try it with Win guest first . gdb debuginfo is attached Please get the backtrace stack. Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x40a00940 (LWP 27272)] 0x000000392860d8fb in write () from /lib64/libpthread.so.0 (gdb)bt (In reply to comment #12) > Please get the backtrace stack. > > Program received signal SIGPIPE, Broken pipe. > [Switching to Thread 0x40a00940 (LWP 27272)] > 0x000000392860d8fb in write () from /lib64/libpthread.so.0 > (gdb)bt Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x40a00940 (LWP 17080)] 0x000000392860d8fb in write () from /lib64/libpthread.so.0 (gdb) bt #0 0x000000392860d8fb in write () from /lib64/libpthread.so.0 #1 0x00000000004632c9 in qemuMonitorIOWrite (watch=<value optimized out>, fd=59, events=11, opaque=0x2aaab002ddd0) at qemu/qemu_monitor.c:413 #2 qemuMonitorIO (watch=<value optimized out>, fd=59, events=11, opaque=0x2aaab002ddd0) at qemu/qemu_monitor.c:519 #3 0x0000000000415c53 in virEventDispatchHandles () at event.c:479 #4 virEventRunOnce () at event.c:606 #5 0x0000000000417e6c in qemudOneLoop (opaque=0x6c5930) at libvirtd.c:2221 #6 qemudRunLoop (opaque=0x6c5930) at libvirtd.c:2330 #7 0x000000392860673d in start_thread () from /lib64/libpthread.so.0 #8 0x0000003927ed44bd in clone () from /lib64/libc.so.6 SIGPIPE is ignored by libvirtd (libvirt-0.8.2-20), however, gdb captures SIGPIPE by default, so backtrace above is not actually the right backtrace of the crashing. To avoid gdb stopping on SIGPIPE:
(gdb) handle SIGPIPE nostop print pass
However, I see another backtrace, which should be the exact nit caused libvirtd crashing.
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x4a766940 (LWP 10004)]
0x0000003927e30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003927e30265 in raise () from /lib64/libc.so.6
#1 0x0000003927e31d10 in abort () from /lib64/libc.so.6
#2 0x0000003927e6a99b in __libc_message () from /lib64/libc.so.6
#3 0x0000003927e706d3 in malloc_consolidate () from /lib64/libc.so.6
#4 0x0000003927e72bbc in _int_malloc () from /lib64/libc.so.6
#5 0x0000003927e74aad in calloc () from /lib64/libc.so.6
#6 0x0000003930a2e4dc in virAllocN (ptrptr=0x4a763048, size=10004, count=6)
at util/memory.c:128
#7 0x0000003930a33254 in virStorageFileGetMetadataFromFD (
path=0xd6dcc50 "/var/lib/libvirt/images/kvm--10", fd=35, format=0,
meta=0xffffffffffffffff) at util/storage_file.c:722
#8 0x0000003930a3e60e in virDomainDiskDefForeachPath (disk=0xd6e9d30,
allowProbing=false, ignoreOpenFailure=false,
iter=0x3930af7b40 <SELinuxSetSecurityFileLabel>, opaque=0xd6ec570)
at conf/domain_conf.c:7642
#9 0x0000003930af8db9 in SELinuxSetSecurityAllLabel (drv=0x3930d4eec0,
vm=0x2714, stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save")
at security/security_selinux.c:1001
#10 0x000000000044a671 in qemuSecurityStackedSetSecurityAllLabel (
drv=<value optimized out>, vm=0xd680ac0,
stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save")
at qemu/qemu_security_stacked.c:231
#11 0x00000000004327a4 in qemudStartVMDaemon (conn=0xd6efca0,
---Type <return> to continue, or q <return> to quit---
driver=0xd662c00, vm=0xd680ac0, migrateFrom=0x493263 "stdio",
start_paused=true, stdin_fd=30,
stdin_path=0xd6fb410 "/test/save-dir/kvm--10.save")
at qemu/qemu_driver.c:3607
#12 0x000000000043e07b in qemudDomainSaveImageStartVM (conn=0xd6efca0,
driver=0xd662c00, vm=0xd680ac0, fd=0x4a764d2c, read_pid=0x4a764d28,
header=0x4a764d30, path=0xd6fb410 "/test/save-dir/kvm--10.save")
at qemu/qemu_driver.c:6479
#13 0x000000000043ebe0 in qemudDomainRestore (conn=0xd6efca0,
path=0xd6fb410 "/test/save-dir/kvm--10.save") at qemu/qemu_driver.c:6606
#14 0x0000003930a7a4d8 in virDomainRestore (conn=0xd6efca0,
from=0xd6fb410 "/test/save-dir/kvm--10.save") at libvirt.c:2525
#15 0x0000000000423fb4 in remoteDispatchDomainRestore (
server=<value optimized out>, client=<value optimized out>, conn=0x6,
hdr=0xffffffffffffffff, rerr=0x80, args=0x101010101010101, ret=0x4a765f00)
at remote.c:2157
#16 0x00000000004275ee in remoteDispatchClientCall (server=0xd651930,
client=0x2aaaac19ac80, msg=0x2aaaac482270) at dispatch.c:504
#17 0x0000000000427a12 in remoteDispatchClientRequest (server=0xd651930,
client=0x2aaaac19ac80, msg=0x2aaaac482270) at dispatch.c:386
#18 0x0000000000419df8 in qemudWorker (data=<value optimized out>)
at libvirtd.c:1574
#19 0x000000392860673d in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#20 0x0000003927ed44bd in clone () from /lib64/libc.so.6
(gdb)
Vibian, if your libvirtd also crashed when restoring the domain, then we are at the same point.
Have tried during the whole day , with gdb , I can't reproduce this bug . But without gdb , libvirtd crashed at the restore point . So we should have been at the same point ? Okay, Vivian, you said it's a regression because you could not reproduce it in 5.6, this is obviously a race condition, and now you said you can't reproduce it for 24 hours in 5.7 with gdb. For how long have you tried to reproduce it in 5.6, because you may just not have hitten the race ! So I question the fact that it's labelled as a regression, Daniel (In reply to comment #16) > Okay, Vivian, you said it's a regression because you could not reproduce it > in 5.6, this is obviously a race condition, and now you said you can't > reproduce it for 24 hours in 5.7 with gdb. For how long have you tried to > reproduce it > in 5.6, because you may just not have hitten the race ! > > So I question the fact that it's labelled as a regression, > > Daniel Good point Daniel , and at the time of pasting the Regression keyword , RHEL5.6 was checking for about 36 hours during the scalability testing . But still , as you mentioned here , it is a race condition , so 36 hours still can't define the ultimate result, but a lower percentage than it is on RHEL5.7 . Because on RHEL5.7 , this bug could be seen during a period of time, and we could get it reproduced even in a 4 of 6 times . So any suggestion on the regression keyword ? Were those 2 tests (5.6 and 5.7) done on the same physical machine ? Daniel (In reply to comment #18) > Were those 2 tests (5.6 and 5.7) done on the same physical machine ? > > Daniel Yeah , they were all tested on the same physical machine . Created attachment 505789 [details]
multi_lifecycle execution result , and the gdb info against it
retested
Version-Release number of selected component (if applicable):
libvirt-0.8.2-20.el5
kernel-2.6.18-268.el5
kvm-83-237.el5
tree RHEL5.7-Server-20110615.0/tree-x86_64
How reproducible :
1/6
Steps
[Preparation]
1.install 10 Windows guest (Mine are Win2008-R2-2.raw Win2k3-32bit.raw Win2k8-32bit.raw Win7-i386.raw WinXP-i386-2.raw Win2008-R2.raw Win2k3-64bit.raw Win2k8-64bit.raw Win7-x86_64.raw WinXP-i386.raw )
2.modify /etc/libvirt/libvirt.conf with
max_clients = 100
max_workers = 100
max_requests = 100
max_client_requests = 100
Action
1.sh multi_lifecycle.sh once after once manually .
Actual result
Please see the attachment
virsh-command --- the multi_lifecycle.sh execution result
gdb-info --- the gdb info again the scripts' execution
Additional info : from times' testing , found that if we don't meet # virsh save guest /dir/guest.save error: Failed to save domain kvm--1 to /test/save-dir/kvm--1.save error: operation failed: cannot stop CPU execution (gdb) 17:15:51.036: error : qemuMonitorCommandWithHandler:362 : cannot send monitor command 'stop': Connection reset by peer 17:15:51.036: error : qemuMonitorTextStopCPUs:498 : operation failed: cannot stop CPU execution we won't meet the libvirtd crash . But still , with the above error shown , we won't meet this problem 100% . Getting more context based on #21
the test consist of doing various operations simultaneously on 10 domains.
The part of the test leading to the failure is a repeated
save state of guest
restore guest from saved state
loop
The segfault only happens if we got the error message
qemuMonitorCommandWithHandler:362 : cannot send monitor
command 'stop': Connection reset by peer
qemuMonitorTextStopCPUs:498 : operation failed: cannot
stop CPU execution
the first ones come from libvirt monitor code detecting a failure,
always during the stop command during a save, the second one
is a generic error emitted by the libvirt vrapper for the stop
qemu monitor command.
If the error is not seen in the loop the daemon does not crash during the
test. If the error is seen then it may crash.
Based on 2 stack traces obtained from gdb on a crashed libvirt, the
crash always come from a malloc() error at a random point. The memory
from the allocator has been corrupted in some way by one of the operations.
The 2 operations done repeteadly on the 10 domains are just save and restore
for Qemu/KVM domains so that's the 2 code paths that need to get extra attention.
When the test on 5.6 was done to try to estimate if it was a regression,
the same errors were seen but it never led to a daemon crash (which could be
pure luck i.e. no regression, or that the bug didn't occur)
Investigations are going on ...
Daniel
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |