Bug 557044

Summary: Kdump cause guest hang and qemu takes up 100% cpu
Product: Red Hat Enterprise Linux 5 Reporter: Yolkfull Chow <yzhou>
Component: kvmAssignee: Zachary Amsden <zamsden>
Status: CLOSED WORKSFORME QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: christian, llim, ndai, tburke, virt-maint, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-04-16 02:53:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580948    

Description Yolkfull Chow 2010-01-20 08:56:25 UTC
Description of problem:
Setup kdump in guest and trigger a crash through /proc/sysrq-trigger interface will cause guest hang and the qemu process take up nearly 100% cpu.

`strace' output of qemu process:

...
select(19, [4 6 8 11 12 14 16 18], [], [], {0, 999000}) = 1 (in [16], left {0, 995000})
read(16, "\16\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
rt_sigaction(SIGALRM, NULL, {0x4079b0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x3a4960e4c0}, 8) = 0
write(5, "\0", 1)                       = 1
read(16, 0x7fff4e7abe50, 128)           = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {1129889, 697487611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 697542611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 697597611}) = 0
select(19, [4 6 8 11 12 14 16 18], [], [], {1, 0}) = 1 (in [4], left {1, 0})
read(4, "\0", 512)                      = 1
read(4, 0x7fff4e7abcd0, 512)            = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {1129889, 697856611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 697935611}) = 0
timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 3686000}}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 698134611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 698207611}) = 0
select(19, [4 6 8 11 12 14 16 18], [], [], {1, 0}) = 1 (in [16], left {0, 996000})
read(16, "\16\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
rt_sigaction(SIGALRM, NULL, {0x4079b0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x3a4960e4c0}, 8) = 0
write(5, "\0", 1)                       = 1
read(16, 0x7fff4e7abe50, 128)           = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {1129889, 702510611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 702569611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 702623611}) = 0
select(19, [4 6 8 11 12 14 16 18], [], [], {1, 0}) = 1 (in [4], left {1, 0})
read(4, "\0", 512)                      = 1
read(4, 0x7fff4e7abcd0, 512)            = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {1129889, 702883611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 702956611}) = 0
timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 3686000}}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 703135611}) = 0
clock_gettime(CLOCK_MONOTONIC, {1129889, 703189611}) = 0
...

Does this related to timer IRQs since found following message during guest booting up:

WARNING calibrate_APIC_clock: the APIC timer calibration may be wrong.

Some information from guest:

#cat /proc/cmdline
ro root=LABEL=/ rhgb quiet crashkernel=128M@16M 3 console=tty0 console=ttyS0,115200
# dmesg |grep -i memory
Memory: 1925200k/2097088k available (2575k kernel code, 171436k reserved, 1298k data, 212k init)
Freeing initrd memory: 2565k freed
Total HugeTLB memory allocated, 0
Non-volatile memory driver v1.2
Freeing unused kernel memory: 212k freed
[root@localhost ~]# free -m
             total       used       free     shared    buffers     cached
Mem:          1882        252       1630          0         14        167
-/+ buffers/cache:         70       1812
Swap:         2047          0       2047
[root@localhost ~]#

NOTE: we can see from output of `free -m' that the 128M memory has been reserved for capture kernel.


Version-Release number of selected component (if applicable):
kvm-83-147.el5
kmod-kvm-83-147.el5
kvm-tools-83-147.el5
etherboot-zroms-kvm-5.4.4-13.el5
kvm-qemu-img-83-147.el5
kvm-debuginfo-83-147.el5

Guest kernel: 2.6.18-185.el5


How reproducible:
Always

Steps to Reproduce:
1. Booting the guest: 
#/root/devel/features/sr-iov/client/tests/kvm/qemu -name vm1 -monitor tcp:0:6001,server,nowait -drive file=/root/devel/features/sr-iov/client/tests/kvm/images/RHEL-Server-5.4-64.qcow2,if=ide,boot=on -net nic,vlan=0,model=e1000,macaddr=00:AE:70:2A:9D:00 -net tap,vlan=0,ifname=e1000_0_6001,script=/root/devel/features/sr-iov/client/tests/kvm/scripts/qemu-ifup-switch,downscript=no -m 2048 -smp 1 -soundhw ac97 -usbdevice tablet -rtc-td-hack -no-hpet -cpu qemu64,+sse2 -no-kvm-pit-reinjection -vnc :0
2. setup kdump in guest and reboot
3. trigger a crash through /proc/sysrq-trigger: echo c > /proc/sysrq-trigger
  
Actual results:
Guest hang before actually dump and qemu process takes up nearly 100% cpu of host

Expected results:
The system should boot into the capture kernel.

Additional info:

Comment 2 Zachary Amsden 2010-04-13 01:55:37 UTC
based on the information in this bug report, it isn't clear if a 32-bit or 64-bit guest kernel was tried.

It is clear, however, this is a kexec bug and NOT a timer IRQ related issue.  I'll attempt to reproduce on 32/64 regardless.

Comment 3 Zachary Amsden 2010-04-15 00:12:49 UTC
Tested 32-bit: appeared to have booted into crash kernel.  I've never actually done this before so I'm not 100% sure what so expect, but the system rebooted again right after that.  Unfortunately, can't log in because of an unrelated disk space issue, but the system did not get stuck.

Comment 4 Zachary Amsden 2010-04-15 00:14:43 UTC
worth noting, my guest kernel is 2.6.18-164.el5, so it is possible this is some kind of regression.

Comment 5 Zachary Amsden 2010-04-15 01:41:30 UTC
Tested 64-bit: crash kernel is working just fine.

I'll try with RHEL-5.5 beta guest kernels to rule out a regression, but it looks like this bug might have already been squashed - could have been a reboot issue or something fixed in KVM since 83-147.

Comment 6 Zachary Amsden 2010-04-16 02:53:51 UTC
32-bit crash kernel works fine after upgrade to RHEL-5-5 beta (2.6.18-194.el5)  So does 64-bit.

So it's not a kernel regression, nor is the bug found in recent KVM.  Closing as unable to reproduce.