Bug 612436
Summary: | udevd report unexpected exit when guest boot up with nmi_watchdog = 1 and using debugfs tracing KVM (AMD) | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Joy Pu <ypu> |
Component: | kernel | Assignee: | Karen Noel <knoel> |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 6.0 | CC: | gleb, knoel, lihuang, tburke, yacui |
Target Milestone: | rc | ||
Target Release: | 6.1 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.32-121.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-05-19 12:55:01 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 580954 |
Description
Joy Pu
2010-07-08 08:11:39 UTC
Avi, worth fixing? Yes. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Not reproduced on a Nehalem. But I see this was reported on an AMD, will try again. Reproduced on AMD. Most strange. Same guest, on upstream host, works fine. 2.6.32.27 works. Older 2.6.32 kernels don't boot at all. Perhaps we leak eflags.tf, which ends up in guest registers, which ends up sending a #DB exception, which ends up as a SIGTRAP (we did see a SIGTRAP killing a script). Does not reproduce with kernel-2.6.32-94.el6.x86_64. Reason unknown. Joy, can you confirm? Actually, it does reproduce. Hm, with qemu-kvm upstream, it does not. With RHEL qemu-kvm, it does. Patches posted upstream: http://www.spinics.net/lists/kvm/msg49588.html Patch(es) available on kernel-2.6.32-121.el6 I have conducted nmi_watch dog test on the host with kernel version of 2.6.32-120.el6.x86_64 and 2.6.32-121.el6.x86_64, and on the kernel of 120, the test result is the same with the above(no root device and kernel panic), and on the 121 version,i met the following cases: Guest: RHEL5.6-ide-e1000-raw Kvm version: qemu-kvm-debuginfo-0.12.1.2-2.150.el6.x86_64 qemu-kvm-0.12.1.2-2.150.el6.x86_64 qemu-kvm-tools-0.12.1.2-2.150.el6.x86_64 When the test is conducted with configuration of "-smp 1"(to ignore the causes of SMP), I have tested it for 100 times and guest can bootup after adding the nmi_watchdog=1 in the kernel line. When the test is conducted with configuration of "-smp 4", the following fail case might be met: the guest hang during the startup process (reproduce rate "7 out of 200") Last few lines of serial output in this scenario: 2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: UHCI Host Controller 2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: new USB bus registered, assigned bus number 1 2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: irq 11, io base 0x0000c020 2011-03-16 22:25:56: usb usb1: configuration #1 chosen from 1 choice 2011-03-16 22:25:56: hub 1-0:1.0: USB hub found 2011-03-16 22:25:56: hub 1-0:1.0: 2 ports detected 2011-03-16 22:25:56: input: ImExPS/2 Generic Explorer Mouse as /class/input/input1 2011-03-16 22:25:56: usb 1-1: new full speed USB device using uhci_hcd and address 2 2011-03-16 22:25:56: SCSI subsystem initialized 2011-03-16 22:25:56: usb 1-1: configuration #1 chosen from 1 choice 2011-03-16 22:25:56: input: QEMU 0.12.1 QEMU USB Tablet as /class/input/input2 2011-03-16 22:25:56: input: USB HID v0.01 Pointer [QEMU 0.12.1 QEMU USB Tablet] on usb-0000:00:01.2-1 2011-03-16 22:25:56: device-mapper: uevent: version 1.0.3 2011-03-16 22:25:56: device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel 2011-03-16 22:25:57: device-mapper: dm-raid45: initialized v0.2594l 2011-03-16 22:26:18: kjournald starting. Commit interval 5 seconds 2011-03-16 22:26:18: EXT3-fs: mounted filesystem with ordered data mode. 2011-03-16 22:26:18: type=1404 audit(1300285577.849:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 2011-03-16 22:26:18: type=1403 audit(1300285578.112:3): policy loaded auid=4294967295 ses=4294967295 Host CPU info: processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 67 model name : Dual-Core AMD Opteron(tm) Processor 1216 stepping : 3 cpu MHz : 1000.000 cache size : 1024 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 2009.10 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc As our test results in comment 18, the guest still hang when set nmi_watchdog=1 and smp > 1 (7 out of 200). But the details are not same. Is this a new problem, or still the same one? Likely a different problem. Please file a new bug with as much information as possible. If you get message something like "NMI received for unknown reason" then this is known problem (but I can't find the BZ). The only solution we can offer for rhel6 is to not use nmi_watchdog. I think we have tech note about that. (In reply to comment #21) > If you get message something like "NMI received for unknown reason" then this > is known problem (but I can't find the BZ). The only solution we can offer for > rhel6 is to not use nmi_watchdog. I think we have tech note about that. Have checked that, the "NMI received for unknown reason" only happened in RHEL 6.1 guest(Bug 688547 - RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown reason), so we didn't use it to verify this bug. And the test results are based in RHEL 5 guest. (In reply to comment #20) > Likely a different problem. Please file a new bug with as much information as > possible. OK. We will file a new bug for this, and we are trying to catch the debugfs info and thread info from the hang guests. As it is hard to reproduce, we will update the infos as soon as we get them. Thanks for your help. (In reply to comment #22) > (In reply to comment #21) > > If you get message something like "NMI received for unknown reason" then this > > is known problem (but I can't find the BZ). The only solution we can offer for > > rhel6 is to not use nmi_watchdog. I think we have tech note about that. > > Have checked that, the "NMI received for unknown reason" only happened in RHEL > 6.1 guest(Bug 688547 - RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown > reason), so we didn't use it to verify this bug. And the test results are based > in RHEL 5 guest Bug 688547 is not related to the problem I am talking about. Fixing it will not fix the problem in KVM. There is no difference what guest you are running. RHEL5 guest with nmi_watchdog will have the same problem as RHEL6 guest. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html |