Bug 746555 - sometimes guest start will hang and the status is ambiguous when start 512 guests
Summary: sometimes guest start will hang and the status is ambiguous when start 512 gu...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt
Version: 6.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Osier Yang
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-17 04:38 UTC by weizhang
Modified: 2011-10-21 10:08 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-21 09:44:19 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description weizhang 2011-10-17 04:38:40 UTC
Description of problem:
I start 512 guests with loop, when start on 396th guest, it hang without return, but libvirtd still running and on other console, virsh list still work well. I check the guest status with virsh list (without --all or inactive), it shows that the 307th guest is in shut off status but in active domain list.
# virsh list |grep "396"
396 rhel6u1-x86_646     shut off

When do
# virsh start rhel5u7-x86_6464
error: Domain is already active

#virsh destroy rhel6u1-x86_646
error: Failed to destroy domain rhel6u1-x86_646
error: Requested operation is not valid: domain is not running

After destroy, the guest return to normal shut off status and can be started again

Version-Release number of selected component (if applicable):
libvirt-0.9.4-17.el6.x86_64
kernel-2.6.32-206.el6.x86_64
qemu-kvm-0.12.1.2-2.196.el6.x86_64

How reproducible:
sometimes

Steps to Reproduce:
1. start 512 guest with command
# for i in {1..512}; do virsh start guest$i; done
2. 
3.
  
Actual results:
It may hang on one guest start up, but the virsh list will show error info

Expected results:
virsh start will not hang, and virsh list will show correctly

Additional info:
# free -g
             total       used       free     shared    buffers     cached
Mem:           992        865        127          0          2        688
-/+ buffers/cache:        174        818
Swap:            0          0          0

# top -p `pidof libvirtd`
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7202 root      20   0  994m  32m 5236 S 26.8  0.0   6:46.61 libvirtd  

I don't know if it is helpful, but I found in libvirtd.log it may report error like
23:30:12.323: 7202: error : qemuMonitorIO:583 : internal error End of file from monitor
23:31:19.956: 7202: error : qemuMonitorIO:583 : internal error End of file from monitor
10:08:59.096: 7202: error : virNetSocketReadWire:911 : End of file while reading data: Input/output error
10:09:00.844: 7202: error : virNetSocketReadWire:911 : End of file while reading data: Input/output error

Comment 2 weizhang 2011-10-18 06:17:43 UTC
(In reply to comment #0)
> Description of problem:
> # virsh list |grep "396"
> 396 rhel6u1-x86_646     shut off
> 
> When do
> # virsh start rhel5u7-x86_6464
Here I mean the same guest rhel6u1-x86_646, should be
# virsh start rhel6u1-x86_646
error: Domain is already active

> error: Domain is already active
> 
> #virsh destroy rhel6u1-x86_646
> error: Failed to destroy domain rhel6u1-x86_646
> error: Requested operation is not valid: domain is not running
> 
> After destroy, the guest return to normal shut off status and can be started
> again
>

Comment 4 Eric Blake 2011-10-18 20:24:47 UTC
The fact that the domain is listed means it has been added to the hash table of started domains, although the actual start process has not yet progressed far enough to reach the point where the domain is marked as running.  We have to drop mutex to call into the domain monitor to verify that the domain started, so that explains why there is a window where a domain can show up in the active list while still being shut off.  But until I know the root cause for why the creation seems to hang, I'm not sure if it is worth tweaking code to try to prevent this data race.


Note You need to log in before you can comment on or make changes to this bug.