Red Hat Bugzilla – Bug 612436
udevd report unexpected exit when guest boot up with nmi_watchdog = 1 and using debugfs tracing KVM (AMD)
Last modified: 2013-01-09 17:50:12 EST
Description: In the RHEL6.0 host, boot up a vm with nmi_watchdog=1. And try to tracing kvm by debugfs, udevd will report lots of error message. This will happened in the 2.6.32-37 kernel too. Version-Release number of selected component (if applicable): host kernel: 2.6.32-43.el6.x86_64 guest kernel: 2.6.32-37.el6.x86_64 # rpm -qa |grep qemu qemu-img-0.12.1.2-2.86.el6.x86_64 qemu-kvm-0.12.1.2-2.86.el6.x86_64 qemu-kvm-debuginfo-0.12.1.2-2.86.el6.x86_64 qemu-kvm-tools-0.12.1.2-2.86.el6.x86_64 gpxe-roms-qemu-0.9.7-6.3.el6.noarch How reproducible: Most of time Steps to Reproduce: 1. Mount debugfs in host. And trace the kvm with this cmdline: # echo 1 > /sys/kernel/debug/tracing/tracing_on # echo kvm > /sys/kernel/debug/tracing/set_event 2. Set nmi_watchdog=1 in boot cmdline. Boot up the guest. 3. Listen to the serial port with nc nc -U /tmp/serial-20100707-112343-aRGs Actual results: There are lots of error message reported by udevd from serial port, vm can not boot up and sometime there is a kernel panic Expected results: Can boot up a kernel in RHEL 6.0 without error message Additional info: 1. The command line: # /root/autotest/client/tests/kvm/qemu -name 'vm1' -monitor unix:'/tmp/monitor-humanmonitor1-20100707-112343-aRGs',server,nowait -drive file=/home/kvm_autotest_root/images/RHEL-Server-6.0-64.qcow2 ,if=virtio,cacheone,boot=on,aio=native -net nic,vlan=0,netdev=2THD,model=virtio,macaddr='02:30:7C:AF:4b:36' -netdev tap,id=2THD,ifname=virtio_0_8000,script=/root/autotest/client/tests/kvm/scripts/qemu-ifup,downscript=no,vhost=on -m 4096 -smp 1 -vnc :0 -spice port=8000,disable-ticketing -usbdevice tablet -rtc-td-hack -cpu qemu64,+sse2 -no-kvm-pit-reinjection -serial unix:/tmp/serial-20100707-112343-aRGs,server,nowait 2.Host cpuinfo processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 107 model name : AMD Athlon(tm) Dual Core Processor 5400B stepping : 2 cpu MHz : 1000.000 cache size : 512 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch lbrv bogomips : 2004.25 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc 100mhzsteps 3. Serial output 3.1 kernel panic udevd-work[138]: 'path_id /devices/virtual/block/loop1' unexpected exit with status 0x0005 scsi0 : ata_piix scsi1 : ata_piix ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc000 irq 14 ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc008 irq 15 udevd-work[126]: '/sbin/modprobe -b pci:v00008086d00007000sv00001AF4sd00001100bc06sc01i00' unexpected exit with status 0x0005 udevd-work[145]: 'path_id /devices/virtual/block/loop5' unexpected exit with status 0x0005 udevd-work[144]: 'path_id /devices/virtual/block/loop4' unexpected exit with status 0x0005 udevd-work[123]: '/sbin/blkid -o udev -p /dev/loop0' unexpected exit with status 0x0005 udevd-work[145]: '/sbin/blkid -o udev -p /dev/loop5' unexpected exit with status 0x0005 udevd-work[153]: 'path_id /devices/virtual/block/loop7' unexpected exit with status 0x0005 ata2.00: ATAPI: QEMU DVD-ROM, 0.12.1, max UDMA/100 ata1.00: ATA-7: QEMU HARDDISK, 0.12.1, max UDMA/100 ata1.00: 41943040 sectors, multi 16: LBA48 ata2.00: configured for MWDMA2 ata1.00: configured for MWDMA2 scsi 0:0:0:0: Direct-Access ATA QEMU HARDDISK 0.12 PQ: 0 ANSI: 5 %Gscsi 1:0:0:0: CD-ROM QEMU QEMU DVD-ROM 0.12 PQ: 0 ANSI: 5 ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 virtio-pci 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 10 (level, high) -> IRQ 10 sd 0:0:0:0: [sda] 41943040 512-byte logical blocks: (21.4 GB/20.0 GiB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sd 0:0:0:0: [sda] Attached SCSI disk sr0: scsi3-mmc drive: 4x/4x xa/form2 tray Uniform CD-ROM driver Revision: 3.20 udevd-work[133]: 'scsi_id --whitelisted --replace-whitespace -p0x80 -d/dev/sda' unexpected exit with status 0x0005 dracut: Scanning devices sda2 for LVM logical volumes VolGroup/lv_root VolGroup/lv_swap dracut: inactive '/dev/VolGroup/lv_root' [15.57 GiB] inherit dracut: inactive '/dev/VolGroup/lv_swap' [3.94 GiB] inherit EXT4-fs (dm-0): INFO: recovery required on readonly filesystem EXT4-fs (dm-0): write access will be enabled during recovery EXT4-fs (dm-0): recovery complete EXT4-fs (dm-0): mounted filesystem with ordered data mode Kernel panic - not syncing: Attempted to kill init! Pid: 1, comm: init Not tainted 2.6.32-37.el6.x86_64 #1 Call Trace: [<ffffffff814d751d>] panic+0x78/0x137 [<ffffffff81107da7>] ? perf_event_exit_task+0x37/0x160 [<ffffffff8106f7c2>] do_exit+0x832/0x840 [<ffffffff81080a8d>] ? __sigqueue_free+0x3d/0x50 [<ffffffff8106f828>] do_group_exit+0x58/0xd0 [<ffffffff810844f6>] get_signal_to_deliver+0x1f6/0x460 [<ffffffff8105e812>] ? enqueue_entity+0x122/0x320 [<ffffffff8103c908>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff81012355>] do_signal+0x75/0x800 [<ffffffff810556b0>] ? __dequeue_entity+0x30/0x50 [<ffffffff810117bc>] ? __switch_to+0x1ac/0x320 [<ffffffff814d7c76>] ? thread_return+0x4e/0x768 [<ffffffff81166cc7>] ? fd_install+0x47/0x90 [<ffffffff81012b70>] do_notify_resume+0x90/0xc0 [<ffffffff814daaf2>] paranoid_userspace+0x46/0x50 2. no root device udevd-work[125]: '/sbin/modprobe -b pci:v00008086d00007010sv00001AF4sd00001100bc01sc01i80' unexpected exit with status 0x0005 generic-usb 0003:0627:0001.0001: input,hidraw0: USB HID v0.01 Pointer [QEMU 0.12.1 QEMU USB Tablet] on usb-0000:00:01.2-1/input0 udevd-work[124]: '/sbin/modprobe -b pci:v00008086d00007000sv00001AF4sd00001100bc06sc01i00' unexpected exit with status 0x0005 udevd-work[136]: '/sbin/blkid -o udev -p /dev/loop0' unexpected exit with status 0x0005 udevd-work[123]: '/sbin/blkid -o udev -p /dev/loop6' unexpected exit with status 0x0005 udevd-work[150]: '/sbin/blkid -o udev -p /dev/loop5' unexpected exit with status 0x0005 %GACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 virtio-pci 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 10 (level, high) -> IRQ 10 No root device found %Gudevd-work[125]: '/sbin/modprobe -b pci:v00008086d00007010sv00001AF4sd00001100bc01sc01i80' unexpected exit with status 0x0005 udevd-work[124]: '/sbin/modprobe -b pci:v00008086d00007000sv00001AF4sd00001100bc06sc01i00' unexpected exit with status 0x0005 udevd-work[136]: '/sbin/blkid -o udev -p /dev/loop0' unexpected exit with status 0x0005 udevd-work[123]: '/sbin/blkid -o udev -p /dev/loop6' unexpected exit with status 0x0005 udevd-work[150]: '/sbin/blkid -o udev -p /dev/loop5' unexpected exit with status 0x0005 %G No root device found Boot has failed, sleeping forever. udevd[69]: worker [131] unexpectedly returned with status 0x0005
Avi, worth fixing?
Yes.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Not reproduced on a Nehalem. But I see this was reported on an AMD, will try again.
Reproduced on AMD. Most strange.
Same guest, on upstream host, works fine.
2.6.32.27 works. Older 2.6.32 kernels don't boot at all.
Perhaps we leak eflags.tf, which ends up in guest registers, which ends up sending a #DB exception, which ends up as a SIGTRAP (we did see a SIGTRAP killing a script).
Does not reproduce with kernel-2.6.32-94.el6.x86_64. Reason unknown. Joy, can you confirm?
Actually, it does reproduce.
Hm, with qemu-kvm upstream, it does not. With RHEL qemu-kvm, it does.
Patches posted upstream: http://www.spinics.net/lists/kvm/msg49588.html
Patch(es) available on kernel-2.6.32-121.el6
I have conducted nmi_watch dog test on the host with kernel version of 2.6.32-120.el6.x86_64 and 2.6.32-121.el6.x86_64, and on the kernel of 120, the test result is the same with the above(no root device and kernel panic), and on the 121 version,i met the following cases: Guest: RHEL5.6-ide-e1000-raw Kvm version: qemu-kvm-debuginfo-0.12.1.2-2.150.el6.x86_64 qemu-kvm-0.12.1.2-2.150.el6.x86_64 qemu-kvm-tools-0.12.1.2-2.150.el6.x86_64 When the test is conducted with configuration of "-smp 1"(to ignore the causes of SMP), I have tested it for 100 times and guest can bootup after adding the nmi_watchdog=1 in the kernel line. When the test is conducted with configuration of "-smp 4", the following fail case might be met: the guest hang during the startup process (reproduce rate "7 out of 200") Last few lines of serial output in this scenario: 2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: UHCI Host Controller 2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: new USB bus registered, assigned bus number 1 2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: irq 11, io base 0x0000c020 2011-03-16 22:25:56: usb usb1: configuration #1 chosen from 1 choice 2011-03-16 22:25:56: hub 1-0:1.0: USB hub found 2011-03-16 22:25:56: hub 1-0:1.0: 2 ports detected 2011-03-16 22:25:56: input: ImExPS/2 Generic Explorer Mouse as /class/input/input1 2011-03-16 22:25:56: usb 1-1: new full speed USB device using uhci_hcd and address 2 2011-03-16 22:25:56: SCSI subsystem initialized 2011-03-16 22:25:56: usb 1-1: configuration #1 chosen from 1 choice 2011-03-16 22:25:56: input: QEMU 0.12.1 QEMU USB Tablet as /class/input/input2 2011-03-16 22:25:56: input: USB HID v0.01 Pointer [QEMU 0.12.1 QEMU USB Tablet] on usb-0000:00:01.2-1 2011-03-16 22:25:56: device-mapper: uevent: version 1.0.3 2011-03-16 22:25:56: device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel@redhat.com 2011-03-16 22:25:57: device-mapper: dm-raid45: initialized v0.2594l 2011-03-16 22:26:18: kjournald starting. Commit interval 5 seconds 2011-03-16 22:26:18: EXT3-fs: mounted filesystem with ordered data mode. 2011-03-16 22:26:18: type=1404 audit(1300285577.849:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 2011-03-16 22:26:18: type=1403 audit(1300285578.112:3): policy loaded auid=4294967295 ses=4294967295 Host CPU info: processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 67 model name : Dual-Core AMD Opteron(tm) Processor 1216 stepping : 3 cpu MHz : 1000.000 cache size : 1024 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 2009.10 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc
As our test results in comment 18, the guest still hang when set nmi_watchdog=1 and smp > 1 (7 out of 200). But the details are not same. Is this a new problem, or still the same one?
Likely a different problem. Please file a new bug with as much information as possible.
If you get message something like "NMI received for unknown reason" then this is known problem (but I can't find the BZ). The only solution we can offer for rhel6 is to not use nmi_watchdog. I think we have tech note about that.
(In reply to comment #21) > If you get message something like "NMI received for unknown reason" then this > is known problem (but I can't find the BZ). The only solution we can offer for > rhel6 is to not use nmi_watchdog. I think we have tech note about that. Have checked that, the "NMI received for unknown reason" only happened in RHEL 6.1 guest(Bug 688547 - RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown reason), so we didn't use it to verify this bug. And the test results are based in RHEL 5 guest.
(In reply to comment #20) > Likely a different problem. Please file a new bug with as much information as > possible. OK. We will file a new bug for this, and we are trying to catch the debugfs info and thread info from the hang guests. As it is hard to reproduce, we will update the infos as soon as we get them. Thanks for your help.
(In reply to comment #22) > (In reply to comment #21) > > If you get message something like "NMI received for unknown reason" then this > > is known problem (but I can't find the BZ). The only solution we can offer for > > rhel6 is to not use nmi_watchdog. I think we have tech note about that. > > Have checked that, the "NMI received for unknown reason" only happened in RHEL > 6.1 guest(Bug 688547 - RHEL6.1-20110316.1 dell-pe2800 NMI received for unknown > reason), so we didn't use it to verify this bug. And the test results are based > in RHEL 5 guest Bug 688547 is not related to the problem I am talking about. Fixing it will not fix the problem in KVM. There is no difference what guest you are running. RHEL5 guest with nmi_watchdog will have the same problem as RHEL6 guest.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html