Bug 591231
Summary: | [CRM 2012430] - Running KVM guests crashes host | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Issue Tracker <tao> |
Component: | kvm | Assignee: | Marcelo Tosatti <mtosatti> |
Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.5 | CC: | amit.shah, herrold, llim, sgrinber, tao, virt-maint, ykaul |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-08-04 01:58:26 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 580949 |
Description
Issue Tracker
2010-05-11 17:29:47 UTC
Event posted on 04-27-2010 04:41am EDT by nbansal Description of problem: Customer has a number of KVM installations on various non-RHEL stacks, and they've beenstable. Recently, they have been testing out the KVM as included in RHEL (both 5.5 and 5.4), and it has been very unstable from their point of view. Most recently they ran 24 Ubuntu 9.10 VMs (default install, idling only), and the host panicked after about 16 hours. The invocation lines were like so: /usr/libexec/qemu-kvm -drive file=/var/lib/kvm/ubuntu-9.10.img,if=virtio,boot=on -m 512 -net nic,macaddr=52:54:00:22:45:02,model=e1000 -net tap,downscript=no -monitor stdio -usbdevice tablet -snapshot -vnc :0 /usr/libexec/qemu-kvm -drive file=/var/lib/kvm/ubuntu-9.10.img,if=virtio,boot=on -m 512 -net nic,macaddr=52:54:00:22:45:03,model=e1000 -net tap,downscript=no -monitor stdio -usbdevice tablet -snapshot -vnc :1 # 22 more in the same pattern NMI Watchdog detected LOCKUP on CPU 3 CPU 3 Modules linked in: iptable_filter ip_tables tun autofs4 lockd sunrpc bridge ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ksm(U) kvm_intel(U) kvm(U) snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm ide_cd i2c_i801 snd_timer snd_page_alloc i2c_core shpchp snd_hwdep cdrom snd sg skge soundcore serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache raid10 ahci libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 3973, comm: qemu-kvm Tainted: G 2.6.18-194.el5 #1 RIP: 0010:[<ffffffff883c183d>] [<ffffffff883c183d>] :kvm_intel:vmx_vcpu_run+0x365/0x4c1 RSP: 0018:ffff81021c1e5d98 EFLAGS: 00000046 RAX: 0000000000000200 RBX: 00000000002c1ff4 RCX: 0000000000000000 RDX: 0000000080000202 RSI: 000000000c031970 RDI: ffff81021baf81c0 RBP: ffff81021baf81c0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000041a25940(0000) GS:ffff810107be7640(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 000000000c0359c4 CR3: 000000022014d000 CR4: 00000000000026e0 Process qemu-kvm (pid: 3973, threadinfo ffff81021c1e4000, task ffff81021df80080) Stack: ffff81021cf0e4c0 ffffffff8839ab7a 0000000000000096 00000000ffffffff ffff81021baf81c0 ffff81021be90000 0000000000000000 000000000000000f 0000000000001000 ffffffff8838cb16 fffffffe7ffbfeff ffff810200000001 Call Trace: [<ffffffff8839ab7a>] :kvm:kvm_apic_has_interrupt+0x45/0x6a [<ffffffff8838cb16>] :kvm:kvm_arch_vcpu_ioctl_run+0x393/0x61e [<ffffffff883880ec>] :kvm:kvm_vcpu_ioctl+0xf2/0x448 [<ffffffff80042489>] do_ioctl+0x21/0x6b [<ffffffff800304d6>] vfs_ioctl+0x457/0x4b9 [<ffffffff800b878c>] audit_syscall_entry+0x180/0x1b3 [<ffffffff8004cbb7>] sys_ioctl+0x59/0x78 [<ffffffff8005e28d>] tracesys+0xd5/0xe0 Code: ba 04 44 00 00 0f 78 d0 f6 05 20 1b 01 00 20 74 1a f6 c4 10 crash> bt PID: 3973 TASK: ffff81021df80080 CPU: 3 COMMAND: "qemu-kvm" #0 [ffff81022fc3bdc0] crash_kexec at ffffffff800aeb6b #1 [ffff81022fc3be80] die_nmi at ffffffff800662a5 #2 [ffff81022fc3bea0] nmi_watchdog_tick at ffffffff80066a0b #3 [ffff81022fc3bef0] default_do_nmi at ffffffff80066629 #4 [ffff81022fc3bf40] do_nmi at ffffffff80066896 #5 [ffff81022fc3bf50] nmi at ffffffff80065eef [exception RIP: vmx_vcpu_run+869] RIP: ffffffff883c183d RSP: ffff81021c1e5d98 RFLAGS: 00000046 RAX: 0000000000000200 RBX: 00000000002c1ff4 RCX: 0000000000000000 RDX: 0000000080000202 RSI: 000000000c031970 RDI: ffff81021baf81c0 RBP: ffff81021baf81c0 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff81021c1e5d98] vmx_vcpu_run at ffffffff883c183d #7 [ffff81021c1e5de0] kvm_arch_vcpu_ioctl_run at ffffffff8838cb16 #8 [ffff81021c1e5e20] kvm_vcpu_ioctl at ffffffff883880ec #9 [ffff81021c1e5eb0] do_ioctl at ffffffff80042489 #10 [ffff81021c1e5ed0] vfs_ioctl at ffffffff800304d6 #11 [ffff81021c1e5f40] sys_ioctl at ffffffff8004cbb7 #12 [ffff81021c1e5f80] tracesys at ffffffff8005e28d (via system_call) RIP: 00000031486cc647 RSP: 0000000041a24f58 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005e28d RCX: ffffffffffffffff RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000f RBP: 0000000000000000 R8: 00000000004feb90 R9: 0000000000000f85 R10: 00002aaaaaedd450 R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000000001 R14: 00000000176b8500 R15: 0000000000000000 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b Another crash occurred running 16 Ubuntu guests, host panicked in 13 hours: Vmcore has a partial panic message: NMI Watchdog detected LOCKUP on CPU 3 CPU 3 Modules linked in: tun autofs4 lockd sunrpc bridge ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ksm(U) kvm_intel(U) kvm(U) snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd serio_raw skge ide_cd cdrom i2c_i801 i2c_core shpchp sg soundcore pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache raid10 ahci libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 23522, comm: qemu-kvm Tainted: G 2.6.18-194.el5 #1 RIP: 0033:[<00000000005222e7>] [<00000000005222e7>] RSP: 002b:00000000416e6f20 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000008edc500 RCX: ffffffffffffffff RDX: 0000000008e55740 RSI: 0000000000000000 RDI: 0000000008e55740 RBP: 0000000000000000 R08: 0000000000b41da0 R09: 0000000000005be2 R10: 00002aaaaaeddca8 R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000001000 FS: 00000000416e7940(0063) GS:ffff810107be7640(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000b2e9b000 CR3: 00000001df698000 CR4: 00000000000026e0 Process qemu-kvm (pid: 23522, threadinfo ffff8101ddae0000, task ffff81021e68b100) There are few messages like these in vmcore log prior to panic: INFO: task md2_resync:9057 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md2_resync D ffff8101e5b2fe30 0 9057 89 9059 2961 (L-TLB) ffff8101e5b2fd70 0000000000000046 0000000000000000 ffff81022e8e940c ffff81022e99c00c 000000000000000a ffff81022fbd10c0 ffff81022fbf60c0 00009200296ab830 0000000000008790 ffff81022fbd12a8 000000018008c597 Call Trace: [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021bacf>] md_do_sync+0x1d8/0x833 [<ffffffff8008db07>] enqueue_task+0x41/0x56 [<ffffffff8008db72>] __activate_task+0x56/0x6d [<ffffffff8008d957>] dequeue_task+0x18/0x37 [<ffffffff80063ff8>] thread_return+0x62/0xfe [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021c4a3>] md_thread+0xf8/0x10e [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff8021c3ab>] md_thread+0x0/0x10e [<ffffffff80032bdc>] kthread+0xfe/0x132 [<ffffffff8005efb1>] child_rip+0xa/0x11 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032ade>] kthread+0x0/0x132 [<ffffffff8005efa7>] child_rip+0x0/0x11 Both the crashes occurred on CPU 3. How reproducible: Every 13-16 hours Steps to Reproduce: ( as per customer ) Run 24 instanced of Ubuntu, I am having difficulty reproducing it. I tried with some 8-9 rhel5.4 guests but host continues to run fine. Actual results: Host panicks. Expected results: Host should not panic. Additional info: Attaching sosreport, vmcore details: megatron.gsslab.rdu.redhat.com Login with kerberos name/password $ cd /cores/20100419053127/work /cores/20100419053127/work$ ./crash megatron.gsslab.rdu.redhat.com Login with kerberos name/password $ cd /cores/20100427005455/work /cores/20100427005455/work$ ./crash This event sent from IssueTracker by sgrinber [RHEV Specialty Group] issue 814313 Event posted on 04-27-2010 04:41am EDT by nbansal File uploaded: 30-sosreport-JWong.2012430-65205-793d0b.tar.bz2 This event sent from IssueTracker by sgrinber [RHEV Specialty Group] issue 814313 it_file 606423 Event posted on 05-04-2010 09:55am EDT by sgrinber Assigning this to rallton for having owner for follow up. However the guest OS is not officially supported on KVM: Please refer the customer to the following page: http://www.redhat.com/rhel/server/advanced/virt.html If your customer is using the open source kvm, then you can refer him to: http://www.linux-kvm.org/page/Guest_Support_Status#UNIX_Family:_Solaris.2FOpenSolaris However this is only supported by the community. Never the less - devel might be interested with this info and may want to investigate, I'll try to grub somebodies attention. Regards, Simon. rallton assigned to issue for RHEV Specialty Group. This event sent from IssueTracker by sgrinber [RHEV Specialty Group] issue 814313 Event posted on 05-10-2010 02:06am EDT by nbansal I have informed customer about guest OS, and from sosreport customer appears to be using our kvm: kmod-kvm-83-164.el5.x86_64 kvm-83-164.el5.x86_64 Thanks, This event sent from IssueTracker by sgrinber [RHEV Specialty Group] issue 814313 To rule out guest issues, can you let us know if this can be reproduced on RHEL guests? Event posted on 05-18-2010 04:01am EDT by sgrinber Hi Amit, If you look at customer original message: " I tried with some 8-9 rhel5.4 guests but host continues to run fine." So no it's not reproduced with rhel5.4, though with Ubuntu they used 24guests. Regards, Simon Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by sgrinber issue 814313 To isolate whether this is a guest kernel issue or an issue with running lots of guests at a time, can they try reproducing with 24 (or similar number) of RHEL 5 guests? The NMI seems to be spurious, since KVM appears to be running normally. Please disable the NMI watchdog with "nmi_watchdog=0" kernel option. Also pass "processor.max_cstate=1" option to avoid the TSC from stopping on ACPI C-states. Might be useful to test RHEL on a different host machine. Event posted on 06-11-2010 05:22am EDT by nbansal The older machine (selenium.polldev.com) hasn't crashed yet with "nmi_watchdog=0 processor.max_cstate=1", however, it is not working properly. crond is running: [root@selenium ~]# /etc/init.d/crond status crond (pid 3491) is running... But cron tasks abruptly stopped running on June 3rd: [root@selenium ~]# tail /var/log/cron Jun 3 17:42:01 selenium crond[7207]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:44:01 selenium crond[7233]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:46:01 selenium crond[7255]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:48:01 selenium crond[7283]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:50:01 selenium crond[7310]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:52:01 selenium crond[7337]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:54:01 selenium crond[7359]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:56:01 selenium crond[7381]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 17:58:01 selenium crond[7405]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jun 3 18:00:01 selenium crond[7430]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) KSM is also enabled: [root@selenium ~]# ksmctl info flags 1, pages_to_scan 1024, sleep_time 200 But it also has not actually done anything (consumed CPU time) since June 3rd or so. I also had a top instance open at the time, and it abruptly stopped updating around the same time as well. Customer also took a new machine and running RHEL guests on it. The newer machine appears to be running the same workload without issue on a default install. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by sgrinber issue 814313 Event posted on 06-11-2010 05:27am EDT by nbansal Please let me know if there is any specific information is needed from new machine i.e. hardware details, sosreport etc. Could this cron and top issue be unrelated ? Please let me know if there is anything needed from this machine. Thanks, This event sent from IssueTracker by sgrinber issue 814313 Event posted on 06-18-2010 03:09am EDT by nbansal File uploaded: messages This event sent from IssueTracker by sgrinber issue 814313 it_file 783013 Event posted on 06-18-2010 03:09am EDT by nbansal /var/log/messages of old machine (selenium) is also filling up with following kind of messages: May 30 04:24:18 selenium kernel: INFO: task md2_resync:13075 blocked for more than 120 seconds. May 30 04:24:18 selenium kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this mess age. May 30 04:24:18 selenium kernel: md2_resync D ffff81000901d7a0 0 13075 89 13077 31 32 (L-TLB) May 30 04:24:18 selenium kernel: ffff81021683fd70 0000000000000046 0000000000000000 ffff81022f9dd40c May 30 04:24:18 selenium kernel: ffff81022f9dd00c 000000000000000a ffff81022d7d6860 ffff810107bf4100 May 30 04:24:18 selenium kernel: 00007908b6f4232b 000000000000399c ffff81022d7d6a48 000000038008c597 May 30 04:24:18 selenium kernel: Call Trace: May 30 04:24:18 selenium kernel: [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 May 30 04:24:18 selenium kernel: [<ffffffff8021bacf>] md_do_sync+0x1d8/0x833 May 30 04:24:18 selenium kernel: [<ffffffff80063ff8>] thread_return+0x62/0xfe May 30 04:24:18 selenium kernel: [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e May 30 04:24:18 selenium kernel: [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 May 30 04:24:18 selenium kernel: [<ffffffff8021c4a3>] md_thread+0xf8/0x10e May 30 04:24:19 selenium kernel: [<ffffffff8021c3ab>] md_thread+0x0/0x10e May 30 04:24:19 selenium kernel: [<ffffffff80032bdc>] kthread+0xfe/0x132 May 30 04:24:19 selenium kernel: [<ffffffff8005efb1>] child_rip+0xa/0x11 May 30 04:24:19 selenium kernel: [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 May 30 04:24:19 selenium kernel: [<ffffffff80032ade>] kthread+0x0/0x132 May 30 04:24:19 selenium kernel: [<ffffffff8005efa7>] child_rip+0x0/0x11 May 30 04:24:19 selenium kernel: Attaching /var/log/messages Thanks Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by sgrinber issue 814313 |