Bug 511151
Summary: | hung virtual machine, spinning qemu-kvm process | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Aron Griffis <aron.griffis> | ||||||||||||||||||||||
Component: | kvm | Assignee: | Gleb Natapov <gleb> | ||||||||||||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Lawrence Lim <llim> | ||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||||
Version: | 5.4 | CC: | adaora.onyia, alex_williamson, dwa, jjarvis, knoel, linda.knippers, llim, martine.silbermann, mra, mtosatti, oramraz, rick.hester, rpacheco, shengliang.lv, stillwell, tburke, tools-bugs, virt-maint, yeylon, ykaul | ||||||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||
Whiteboard: | hp:dl785solblk | ||||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||
Last Closed: | 2009-08-05 13:23:04 UTC | Type: | --- | ||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||
Attachments: |
|
I'll leave this running overnight in case there's any further state you'd like me to capture. Tomorrow morning I need to kill the guest to proceed with other testing. I attempted to capture a core file from the process. You can fetch it from http://free.linux.hp.com/~agriffis/rhel5/bz511151/kvm-tile15-idle1.qemu-kvm-spinning.core.bz2 I say "attempted" because gdb emitted some warnings as it started: [Thread debugging using libthread_db enabled] [New Thread 0x2b51082b8f90 (LWP 17004)] ../../gdb/linux-nat.c:977: internal-error: linux_nat_post_attach_wait: Assertion `pid == new_pid && WIFSTOPPED (status)' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) n ../../gdb/linux-nat.c:977: internal-error: linux_nat_post_attach_wait: Assertion `pid == new_pid && WIFSTOPPED (status)' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Create a core file of GDB? (y or n) n Updating the "how reproducible" question. I've seen this a few times now, about four of them yesterday. Aron, Can you capture serial console output for this guests, so we can see whats in the crashed guest console? Also, you mentioned guests are started in sequence, can you provide more details on the exact timing ? What is the delay between starting two guests? Hi Marcelo, Regarding the sequence, the timing is determined by when "virsh start" returns. Right now there's some code to retry a couple times if we hit a race condition in libvirtd, see bug 511241. If you prefer to read code... for g; do if virsh start $g || { warn "pausing 5 seconds before trying again" sleep 5 virsh start $g } || { warn "pausing 5 seconds before trying once more" sleep 5 virsh start $g } then echo "logfile /var/log/libvirt/qemu/$g-console.log" > /root/screenrc-tiler screen -c /root/screenrc-tiler -S $g-console -L -d -m virsh console $g continue fi die "failed to start $g" done So you can see that I've added code to capture the serial console output for all the guests. I've just started the run now. All 256 guests are running and the associated qemu-kvm processes are normal at around 2% CPU. Based on past experience, eventually a qemu-kvm process will spike to 100% and remain there while the guest hangs. Gradually this will happen to more guests. When this happens, I'll provide the associated console logs. First screen of top output, showing 5 hung guests... +-------------------------------------------------------------------------------+ | top - 14:16:23 up 2:15, 2 users, load average: 68.23, 74.68, 75.38 | | Tasks: 1276 total, 1 running, 1275 sleeping, 0 stopped, 0 zombie | | Cpu(s): 0.1%us, 23.7%sy, 0.0%ni, 76.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st| | Mem: 264284396k total, 124553136k used, 139731260k free, 48556476k buffers | | Swap: 4194296k total, 0k used, 4194296k free, 266704k cached | | | | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | | 22591 root 15 0 726m 268m 2328 S 99.9 0.1 65:44.37 qemu-kvm | | 2061 root 15 0 1238m 279m 2328 S 99.5 0.1 119:40.70 qemu-kvm | | 14062 root 15 0 1238m 278m 2328 S 99.5 0.1 117:19.56 qemu-kvm | | 11262 root 15 0 1236m 76m 2316 S 99.2 0.0 118:25.95 qemu-kvm | | 12065 root 15 0 1238m 280m 2328 S 98.6 0.1 56:00.34 qemu-kvm | | 6130 root 15 0 1238m 279m 2328 S 3.9 0.1 1:38.05 qemu-kvm | | 11995 root 15 0 1238m 278m 2328 S 3.6 0.1 1:42.48 qemu-kvm | | 10223 root 15 0 723m 263m 2328 S 3.3 0.1 1:29.59 qemu-kvm | | 20829 root 15 0 1239m 280m 2328 S 3.3 0.1 1:37.21 qemu-kvm | | 3689 root 15 0 1237m 278m 2328 S 2.9 0.1 1:32.71 qemu-kvm | | 14740 root 15 0 723m 263m 2328 S 2.9 0.1 1:24.51 qemu-kvm | | 3788 root 15 0 1238m 279m 2328 S 2.6 0.1 1:38.78 qemu-kvm | | 9579 root 15 0 1239m 280m 2328 S 2.6 0.1 1:35.80 qemu-kvm | | 1846 root 15 0 1238m 279m 2328 S 2.3 0.1 1:25.69 qemu-kvm | | 4195 root 15 0 1238m 280m 2328 S 2.3 0.1 1:30.39 qemu-kvm | | 4926 root 15 0 1237m 279m 2328 S 2.3 0.1 1:26.70 qemu-kvm | | 11807 root 15 0 1238m 277m 2328 S 2.3 0.1 1:42.26 qemu-kvm | +-------------------------------------------------------------------------------+ I'll attach the console output from all 5. Created attachment 353868 [details]
kvm-tile1-webserver1-console.log
Created attachment 353870 [details]
kvm-tile1-aim1.log
Created attachment 353871 [details]
kvm-tile1-aim1.log
Created attachment 353872 [details]
kvm-tile1-aim1.log
Created attachment 353873 [details]
kvm-tile16-postfix1-console.log
Created attachment 353882 [details]
kvm-tile17-specweb1-console.log
Created attachment 353884 [details]
kvm-tile20-mysql1-console.log
Created attachment 353887 [details]
kvm-tile32-idle1-console.log
Aron, can you attempt to reproduce with an upstream kernel host? (2.6.30) Gleb, can you take a look at this please? Its an AMD host. Unable to handle kernel NULL pointer dereference at 0000000000000046 RIP: [<0000000000000046>] PGD 3f5d9067 PUD 3f5b8067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /block/ram0/dev CPU 0 Modules linked in: dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 318, comm: nash-hotplug Not tainted 2.6.18-156.el5 #1 RIP: 0010:[<0000000000000046>] [<0000000000000046>] RSP: 0000:ffffffff8043dfa8 EFLAGS: 00010096 RAX: ffff81003f6bffd8 RBX: 0000000000000046 RCX: ffffffff8043df58 RDX: ffff810081142000 RSI: 00000000005188b0 RDI: ffffffff80492f80 RBP: 0000000000000000 R08: ffffffff80012a0c R09: ffffffff8043df98 R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00000000005188b0 R15: 0000000000518870 FS: 0000000007bc7930(0063) GS:ffffffff803c1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000046 CR3: 000000003f5d8000 CR4: 00000000000006e0 Process nash-hotplug (pid: 318, threadinfo ffff81003f6be000, task ffff81003f5a10c0) Stack: 00000000005188b0 ffffffff8005f2fc ffffffff8043df98 <EOI> 0000000000000000 0000000000000046 00000000005188b0 ffffffff8005f2fc ffffffff8043df98 <EOI> 0000000000000000 0000000000000046 00000000005188b0 ffffffff8005f2fc Call Trace: <IRQ> [<ffffffff8005f2fc>] call_softirq+0x1c/0x28 <EOI> [<ffffffff8005f2fc>] call_softirq+0x1c/0x28 Code: Bad RIP value. RIP [<0000000000000046>] RSP <ffffffff8043dfa8> CR2: 0000000000000046 <0>Kernel panic - not syncing: Fatal exception 0xffffffff8005f2e0 <call_softirq+0>: push %rbp 0xffffffff8005f2e1 <call_softirq+1>: mov %rsp,%rbp 0xffffffff8005f2e4 <call_softirq+4>: incl %gs:0x28 0xffffffff8005f2ec <call_softirq+12>: cmove %gs:0x30,%rsp 0xffffffff8005f2f6 <call_softirq+22>: push %rbp 0xffffffff8005f2f7 <call_softirq+23>: callq 0xffffffff80012983 <__do_softirq> 0xffffffff8005f2fc <call_softirq+28>: leaveq 0xffffffff8005f2fd <call_softirq+29>: decl %gs:0x28 0xffffffff8005f305 <call_softirq+37>: retq I _guess_ for some reason RBX is crap on return from __do_softirq, so leaveq restores a bogus RSP and retq jumps to 0000000000000046. Note __do_softirq enables interrupts (but disables them before returning). Anything unusual in a host dmesg when this happens? (In reply to comment #16) > Anything unusual in a host dmesg when this happens? No Can you run these two commands in qemu monitor of the problematic VM: info cpus x/20i $pc-10 (qemu) info cpus * CPU #0: pc=0xffffffff8000d077 thread_id=19375 (qemu) x/20i $pc-10 0xffffffff8000d06d: add %al,%bl 0xffffffff8000d06f: rdtsc 0xffffffff8000d071: mov %eax,%ecx 0xffffffff8000d073: repz nop 0xffffffff8000d075: rdtsc 0xffffffff8000d077: sub %ecx,%eax 0xffffffff8000d079: cmp %rdi,%rax 0xffffffff8000d07c: jb 0xffffffff8000d073 0xffffffff8000d07e: retq 0xffffffff8000d07f: push %r12 0xffffffff8000d081: cmpl $0x0,4071336(%rip) # 0xffffffff803ef030 0xffffffff8000d088: mov %rsi,%r12 0xffffffff8000d08b: push %rbp 0xffffffff8000d08c: mov %rdi,%rbp 0xffffffff8000d08f: push %rbx 0xffffffff8000d090: je 0xffffffff8000d0e5 0xffffffff8000d092: lea 0x8(%rdi),%rdi 0xffffffff8000d096: callq 0xffffffff80065a55 0xffffffff8000d09b: mov 0x28(%rbp),%rbx 0xffffffff8000d09f: mov 0x10(%rbx),%rax By the way, of approximately 250 guests running on this machine, I see approximately 1-2 panics per hour. So it's not really "the problematic VM", in fact the results I'm feeding you are typically from separate runs since I need to continue with my testing despite the problem. another one, with a slightly different $pc: (qemu) info cpus * CPU #0: pc=0xffffffff8000d079 thread_id=30713 (qemu) x/20i $pc-10 0xffffffff8000d069: jmpq 0xffffffff800c9512 0xffffffff8000d06e: retq 0xffffffff8000d06f: rdtsc 0xffffffff8000d071: mov %eax,%ecx 0xffffffff8000d073: repz nop 0xffffffff8000d075: rdtsc 0xffffffff8000d077: sub %ecx,%eax 0xffffffff8000d079: cmp %rdi,%rax 0xffffffff8000d07c: jb 0xffffffff8000d073 0xffffffff8000d07e: retq 0xffffffff8000d07f: push %r12 0xffffffff8000d081: cmpl $0x0,4071336(%rip) # 0xffffffff803ef030 0xffffffff8000d088: mov %rsi,%r12 0xffffffff8000d08b: push %rbp 0xffffffff8000d08c: mov %rdi,%rbp 0xffffffff8000d08f: push %rbx 0xffffffff8000d090: je 0xffffffff8000d0e5 0xffffffff8000d092: lea 0x8(%rdi),%rdi 0xffffffff8000d096: callq 0xffffffff80065a55 0xffffffff8000d09b: mov 0x28(%rbp),%rbx Both of them are at the same function and it appears to be __delay(). Are those VMs were stuck with the oops message you posted before when you retrieved this output? Did they stuck during boot (it looks like __delay() is called only during boot)? (In reply to comment #21) > Both of them are at the same function and it appears to be __delay(). Are those > VMs were stuck with the oops message you posted before when you retrieved this > output? Yes > Did they stuck during boot (it looks like __delay() is called only > during boot)? They panicked some time after boot finished, including providing the login prompt on the console. Note that I've switched to RHEL 5.4 snapshot 2 and comments 19 and 20 are using kernel 2.6.18-157.el5. I don't know if the kernel has changed sufficiently to affect your analysis. can you try with ide interface, just to rule virtio out. I changed the configuration and I'm starting 248 guests now, will let you know. That's 31 tiles of 8 guests. I'm using one tile presently to continue my work. cd /etc/libvirt/qemu sed -i "/bus='virtio'/{s/vd/hd/;s/virtio/ide/};/model type='virtio'/d" kvm-tile{2..32}-* Just to make sure, is the guest is rhel5.4 or rhel5.3? Host and guests are RHEL 5.4 snapshot 2 presently. They were all snapshot 1 when I first filed the report. Created attachment 354079 [details]
kvm-tile17-webclient1-console.log (NOT virtio)
Happened this evening to a non-virtio guest. Console log attached, and here's the guest xml:
$ virsh dumpxml kvm-tile17-webclient1
<domain type='kvm' id='136'>
<name>kvm-tile17-webclient1</name>
<uuid>00624e31-a378-7ef7-65c9-90aade99a5da</uuid>
<memory>1048576</memory>
<currentMemory>1048576</currentMemory>
<vcpu>1</vcpu>
<os>
<type arch='x86_64' machine='pc'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='block' device='disk'>
<source dev='/dev/msa1/kvm-tile17-webclient1-root'/>
<target dev='hda' bus='ide'/>
</disk>
<disk type='block' device='disk'>
<source dev='/dev/msa1/kvm-tile17-webclient1-usr'/>
<target dev='hdb' bus='ide'/>
</disk>
<disk type='block' device='disk'>
<source dev='/dev/msa1/kvm-tile17-webclient1-swap'/>
<target dev='hdc' bus='ide'/>
</disk>
<disk type='block' device='disk'>
<source dev='/dev/msa1/kvm-tile17-webclient1-data'/>
<target dev='hdd' bus='ide'/>
</disk>
<interface type='bridge'>
<mac address='00:01:01:11:06:71'/>
<source bridge='br0'/>
<target dev='vnet135'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/407'/>
<target port='0'/>
</serial>
<console type='pty' tty='/dev/pts/407'>
<source path='/dev/pts/407'/>
<target port='0'/>
</console>
</devices>
</domain>
Aron, There has been a number of fixes to AMD's interrupt injection code in upstream, some of them which have not been backported to the RHEL codebase. I can't say for sure, but perhaps one of them has influence on the issue in question. Since Gleb is not working today, perhaps it would be helpful if you can try to reproduce the issue with a 2.6.30 kernel installed on the host. Can you please run this on qemu monitor after failure: x/6x 0xffffffff803bfe60 can please try to run with kvm-88? (In reply to comment #29) > Can you please run this on qemu monitor after failure: > x/6x 0xffffffff803bfe60 I will do this the next time I see a failure. (In reply to comment #28) > perhaps it would be helpful if you can try to > reproduce the issue with a 2.6.30 kernel installed on the host. (In reply to comment #30) > can please try to run with kvm-88? Regarding both of these requests, unfortunately I don't have the time to test newer bits because my time and this machine are consumed by a project related to RHEL 5.4. However if you would like to backport patches to the RHEL 5.4 base and provide me with drop-in rpms to test, I'm willing to do that. I'm not sure about this, but I think it should be possible to reproduce this problem on a smaller AMD box. I think the reason I'm hitting it is the increased chance from running lots of guests. If you can construct a setup inside RH with a couple hundred idle guests, you might hit it too. (In reply to comment #31) > (In reply to comment #29) > > Can you please run this on qemu monitor after failure: > > x/6x 0xffffffff803bfe60 > > I will do this the next time I see a failure. > > (In reply to comment #28) > > perhaps it would be helpful if you can try to > > reproduce the issue with a 2.6.30 kernel installed on the host. > > (In reply to comment #30) > > can please try to run with kvm-88? > > Regarding both of these requests, unfortunately I don't have the time to test > newer bits because my time and this machine are consumed by a project related > to RHEL 5.4. However if you would like to backport patches to the RHEL 5.4 > base and provide me with drop-in rpms to test, I'm willing to do that. Does tar.gz with modules and qemu-kvm binary would be good enough? > > I'm not sure about this, but I think it should be possible to reproduce this > problem on a smaller AMD box. I think the reason I'm hitting it is the > increased chance from running lots of guests. If you can construct a setup > inside RH with a couple hundred idle guests, you might hit it too. I am going to do that. I don't have such huge machine here though and I may not have enough memory for a hundred guests. BTW do you run KSM? (In reply to comment #32) > Does tar.gz with modules and qemu-kvm binary would be good enough? Sure, but I would appreciate if you've tested them on RHEL 5.4 before I try. > BTW do you run KSM? Not intentionally. Is that available on RHEL 5.4? Aron, We will attempt to reproduce the problem internally. If you get some available time on the machine, please see if you can reproduce with kvm_amd.ko npt=0 module parameter (you should see "Nested Paging Disabled" in dmesg). Aron I've sent you kvm modules backported to 2.6.18-157.el5 from latest kvm git. Can you give them a try please. Marcelo, Gleb, I was able to reproduce the problem on a second machine. This is a ProLiant BL465c G5 with 8 Barcelona cores and 32G RAM. The host is RHEL 5.4 snapshot 2 and I'm running 128 guests, also RHEL 5.4 snapshot 2. The guests are idle other than minor daemon activity. I've only seen the problem on this machine (aka barcelona) once so I'm leaving it again overnight with the base configuration to get a better handle on the probability of seeing the panic. On the DL785 G5 (aka octagon) with 32 Barcelona cores and 256G RAM, I typically run 256 guests and I usually see between 2 and 5 guests panic overnight. So on octagon I loaded kvm_amd with NPT disabled for tonight's run. Depending on how things go tonight and what you advise next, I'll plan to try the backported kvm modules tomorrow. Thanks, Aron Aron I was able to reproduce the problem on my (much smaller) machine. Happens rarely though. I am looking into it. Meanwhile please try backported modules it may help to narrow the problem. (In reply to comment #36) > I was able to reproduce the problem on a second machine. This is a ProLiant > BL465c G5 with 8 Barcelona cores and 32G RAM. The host is RHEL 5.4 snapshot 2 > and I'm running 128 guests, also RHEL 5.4 snapshot 2. The guests are idle > other than minor daemon activity. > > I've only seen the problem on this machine (aka barcelona) once so I'm leaving > it again overnight with the base configuration to get a better handle on the > probability of seeing the panic. Unfortunately it only happened once in a 24 hour period, so it's hard to consider this a useful data point. Still I'll run the new kvm modules on it over the weekend on barcelona to see what happens. > On the DL785 G5 (aka octagon) with 32 Barcelona cores and 256G RAM, I typically > run 256 guests and I usually see between 2 and 5 guests panic overnight. So on > octagon I loaded kvm_amd with NPT disabled for tonight's run. I didn't see any problems running with npt=0 for about 24 hours. I'm switching octagon now to running with the new kvm modules, same as on barcelona. Why is the Priority on this bug marked Low? Shouldn't this be a RHEL 5.4 kitstopper? Gleb, note that the ksm module no longer loads now that I've updated to the kvm modules you sent me (In reply to comment #39) > Gleb, note that the ksm module no longer loads now that I've updated to the kvm > modules you sent me This is OK. You haven't used KSM anyway. I ran both machines with the new kvm modules over the weekend. No problems Another update, I've successfully run 256 VMs on a DL785 for a few days with the updated modules with no problems. Aron, thanks for the update. I am working on this bug but things go slow since it is very rarely reproducible for me. Aron Can you try kmod-kvm-83-104.el5.x86_64.rpm I've sent you by email. It has TLB related bug fix for AMD. Hi Gleb, I'm now running kmod-kvm-83-104.el5 on RHEL 5.4 snapshot 5 with a total of 416 idle VMs on barcelona and octagon. I will be adding a couple more machines to the mix with another couple hundred VMs, so we should know in about 24 hours if the modules do the trick. Thanks! Aron Any news (good or bad) on this? Gleb, I ran kmod-kvm-83-104.el5 on RHEL 5.4 snapshot 5 with over 600 guests on four machines for 36 hours with no problems. I think this patch solves the bug! (clearing NEEDINFO state) *** This bug has been marked as a duplicate of bug 513394 *** |
Created attachment 351530 [details] kvm-tile15-idle1.qemu-kvm-spinning.strace Description of problem: One of my (supposedly idle) RHEL 5.4 snapshot 1 KVM guests is hung. Running top on the host shows that the associated qemu-kvm process is spinning. $ ps -Fwwp 17004 UID PID PPID C SZ RSS PSR STIME TTY TIME CMD root 17004 1 45 185378 270324 24 17:03 ? 00:17:10 /usr/libexec/qemu-kvm -S -M pc -m 512 -smp 1 -name kvm-tile15-idle1 -uuid cdf00302-4df4-2717-2b6e-8bcad1c0d99c -nographic -monitor pty -pidfile /var/run/libvirt/qemu//kvm-tile15-idle1.pid -boot c -drive file=/dev/msa15/kvm-tile15-idle1-root,if=virtio,index=0,boot=on -drive file=/dev/msa15/kvm-tile15-idle1-usr,if=virtio,index=1 -net nic,macaddr=00:01:01:11:08:0f,vlan=0,model=virtio -net tap,fd=243,script=,vlan=0,ifname=vnet113 -serial pty -parallel none -usb $ cat /var/log/libvirt/qemu/kvm-tile15-idle1.log LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/ /usr/libexec/qemu-kvm -S -M pc -m 512 -smp 1 -name kvm-tile15-idle1 -uuid cdf00302-4df4-2717-2b6e-8bcad1c0d99c -nographic -monitor pty -pidfile /var/run/libvirt/qemu//kvm-tile15-idle1.pid -boot c -drive file=/dev/msa15/kvm-tile15-idle1-root,if=virtio,index=0,boot=on -drive file=/dev/msa15/kvm-tile15-idle1-usr,if=virtio,index=1 -net nic,macaddr=00:01:01:11:08:0f,vlan=0,model=virtio -net tap,fd=243,script=,vlan=0,ifname=vnet113 -serial pty -parallel none -usb char device redirected to /dev/pts/229 char device redirected to /dev/pts/230 $ ping kvm-tile15-idle1 PING kvm-tile15-idle1.nashua (10.202.8.15) 56(84) bytes of data. From octagon.nashua (10.202.2.120) icmp_seq=2 Destination Host Unreachable From octagon.nashua (10.202.2.120) icmp_seq=3 Destination Host Unreachable From octagon.nashua (10.202.2.120) icmp_seq=4 Destination Host Unreachable I'll also attach an strace capture of the qemu-kvm process. Version-Release number of selected component (if applicable): RHEL 5.4 snapshot 1 kernel-2.6.18-156.el5.x86_64 kmod-kvm-83-82.el5.x86_64 kvm-83-82.el5.x86_64 kvm-tools-83-82.el5.x86_64 kvm-qemu-img-83-82.el5.x86_64 python-virtinst-0.400.3-4.el5.noarch libvirt-0.6.3-13.el5.x86_64 virt-manager-0.6.1-5.el5.x86_64 virt-viewer-0.0.2-3.el5.x86_64 libvirt-python-0.6.3-13.el5.x86_64 How reproducible: unknown Additional info: This is on an HP DL785 with 32 cores, 256G RAM, lots of storage. I'm running 256 guests simultaneously, started in sequence, all idle after boot. The bad one is in the middle of the pack. All the rest of the guests are running normally.