Bug 524153 - dom0 freeze during kernel startup [rhel-5.4.z]
Summary: dom0 freeze during kernel startup [rhel-5.4.z]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jiri Pirko
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On: 518338
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-18 06:52 UTC by RHEL Program Management
Modified: 2015-05-05 01:17 UTC (History)
17 users (show)

Fixed In Version: kernel-2.6.18-164.6.1.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-11-03 19:34:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1548 0 normal SHIPPED_LIVE Important: kernel security and bug fix update 2009-11-03 19:33:33 UTC

Description RHEL Program Management 2009-09-18 06:52:29 UTC
This bug has been copied from bug #518338 and has been proposed
to be backported to 5.4 z-stream (EUS).

Comment 11 Jiri Pirko 2009-10-27 18:59:27 UTC
in kernel-2.6.18-164.6.1.el5

Comment 12 Sandy Garza 2009-10-27 19:28:23 UTC
Kernel, kernel-2.6.18-170.el5, is posted in BZ 518338. Is there a change in -164 posted in Comment 11? If so, what has changed?

Comment 13 James G. Brown III 2009-10-27 19:31:16 UTC
Sandy, This BZ is to backport the fix from -170 to the current update RHEL5.4 kernel -164 for customer to get this before RHEL5.5.

- James

Comment 16 Vivian Bian 2009-11-03 07:30:46 UTC
(In reply to comment #13)
> Sandy, This BZ is to backport the fix from -170 to the current update RHEL5.4
> kernel -164 for customer to get this before RHEL5.5.
> 
> - James  

  Tested with hp-dl160g6-01.rhts.bos.redhat.com ,2.6.18-164.6.1.el5xen x86_64 ,still get a dom0 freeze during kernel startup . I set a crontab */5 * * * * root reboot in the hp box, from 12:36 PM -14 :20 PM , at about 13:55PM , I began to got dom0 freeze . But when freeze occurred, I could 
ping hp-dl160g6-01.rhts.bos.redhat.com successfully without packet lost,and could ssh to this box after a while . But in the serial console , will always see the output below. Can't set the status to VERIFIED . 

    Booting 'Red Hat Enterprise Linux Server (2.6.18-164.6.1.el5xen)'

root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
kernel /xen.gz-2.6.18-164.6.1.el5 console=com2,com1,vga dom0_mem=1572864 nmi=do
m0 com1=9600,8n1 com2=9600,8n1
   [Multiboot-elf, <0x100000:0xe95c8:0x168a38>, shtab=0x352078, entry=0x100000]
module /vmlinuz-2.6.18-164.6.1.el5xen ro root=/dev/VolGroup00/LogVol00 console=
ttyS0,115200 rhgb quiet
   [Multiboot-module @ 0x353000, 0x96d790 bytes]
module /initrd-2.6.18-164.6.1.el5xen.img
   [Multiboot-module @ 0xcc1000, 0x801600 bytes]

 __  __            _____  _   ____     _  __   _  _    __    _       _ ____  
 \ \/ /___ _ __   |___ / / | |___ \   / |/ /_ | || |  / /_  / |  ___| | ___| 
  \  // _ \ '_ \    |_ \ | |   __) |__| | '_ \| || |_| '_ \ | | / _ \ |___ \ 
  /  \  __/ | | |  ___) || |_ / __/|__| | (_) |__   _| (_) || ||  __/ |___) |
 /_/\_\___|_| |_| |____(_)_(_)_____|  |_|\___/   |_|(_)___(_)_(_)___|_|____/ 
                                                                             
 http://www.cl.cam.ac.uk/netos/xen
 University of Cambridge Computer Laboratory

 Xen version 3.1.2-164.6.1.el5 (mockbuild) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) Tue Oct 27 11:24:47 EDT 2009
 Latest ChangeSet: unavailable

(XEN) Command line: console=com2,com1,vga dom0_mem=1572864 nmi=dom0 com1=9600,8n1 com2=9600,8n1
(XEN) Video information:
(XEN)  VGA is text mode 80x25, font 8x16
(XEN)  VBE/DDC methods: none; EDID transfer time: 0 seconds
(XEN)  EDID info not retrieved because no DDC retrieval method detected
(XEN) Disc information:
(XEN)  Found 1 MBR signatures
(XEN)  Found 1 EDD information structures
(XEN) Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009fc00 (usable)
(XEN)  000000000009fc00 - 00000000000a0000 (reserved)
(XEN)  00000000000e0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 00000000bf760000 (usable)
(XEN)  00000000bf76e000 - 00000000bf770000 type 9
(XEN)  00000000bf770000 - 00000000bf77e000 (ACPI data)
(XEN)  00000000bf77e000 - 00000000bf7d0000 (ACPI NVS)
(XEN)  00000000bf7d0000 - 00000000bf7e0000 (reserved)
(XEN)  00000000bf7ed000 - 00000000c0000000 (reserved)
(XEN)  00000000e0000000 - 00000000f0000000 (reserved)
(XEN)  00000000fee00000 - 00000000fee01000 (reserved)
(XEN)  00000000ffa00000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000140000000 (usable)
(XEN) System RAM: 4086MB (4185084kB)
(XEN) Xen heap: 13MB (13844kB)
(XEN) Domain heap initialised: DMA width 32 bits
(XEN) Processor #16 7:10 APIC version 21
(XEN) Processor #18 7:10 APIC version 21
(XEN) Processor #20 7:10 APIC version 21
(XEN) Processor #22 7:10 APIC version 21
(XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
(XEN) IOAPIC[1]: apic_id 3, version 32, address 0xfec8a000, GSI 24-47
(XEN) Enabling APIC mode:  Flat.  Using 2 I/O APICs
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Detected 2000.115 MHz processor.
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) HVM: VMX enabled
(XEN) HVM: Hardware Assisted Paging detected and enabled.
(XEN) VMX: MSR intercept bitmap enabled
(XEN) CPU0: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Booting processor 1/18 eip 90000
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) CPU1: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Booting processor 2/20 eip 90000
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) CPU2: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Booting processor 3/22 eip 90000
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) CPU3: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Total of 4 processors activated.
(XEN) ENABLING IO-APIC IRQs
(XEN)  -> Using new ACK method
(XEN) Platform timer overflows in 14998 jiffies.
(XEN) Platform timer is 14.318MHz HPET
(

Comment 17 Vivian Bian 2009-11-03 08:07:36 UTC
tested with 2.6.18-170.el5xen again, and still have the dom0 freeze during kernel starting up .
  Booting 'Red Hat Enterprise Linux Server (2.6.18-170.el5xen)'

root (hd0,0)
 Filesystem type is ext2fs, partition type 0x83
kernel /xen.gz-2.6.18-170.el5 console=com2,com1,vga dom0_mem=1572864 nmi=dom0 c
om1=9600,8n1 com2=9600,8n1
   [Multiboot-elf, <0x100000:0xe95c8:0x168a38>, shtab=0x352078, entry=0x100000]
module /vmlinuz-2.6.18-170.el5xen ro root=/dev/VolGroup00/LogVol00 console=ttyS
0,115200 rhgb quiet
   [Multiboot-module @ 0x353000, 0x96d8e0 bytes]
module /initrd-2.6.18-170.el5xen.img
   [Multiboot-module @ 0xcc1000, 0x801800 bytes]

 __  __            _____  _   ____     _ _____ ___       _ ____  
 \ \/ /___ _ __   |___ / / | |___ \   / |___  / _ \  ___| | ___| 
  \  // _ \ '_ \    |_ \ | |   __) |__| |  / / | | |/ _ \ |___ \ 
  /  \  __/ | | |  ___) || |_ / __/|__| | / /| |_| |  __/ |___) |
 /_/\_\___|_| |_| |____(_)_(_)_____|  |_|/_/  \___(_)___|_|____/ 
                                                                 
 http://www.cl.cam.ac.uk/netos/xen
 University of Cambridge Computer Laboratory

 Xen version 3.1.2-170.el5 (mockbuild) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) Tue Oct 20 18:26:46 EDT 2009
 Latest ChangeSet: unavailable

(XEN) Command line: console=com2,com1,vga dom0_mem=1572864 nmi=dom0 com1=9600,8n1 com2=9600,8n1
(XEN) Video information:
(XEN)  VGA is text mode 80x25, font 8x16
(XEN)  VBE/DDC methods: none; EDID transfer time: 0 seconds
(XEN)  EDID info not retrieved because no DDC retrieval method detected
(XEN) Disc information:
(XEN)  Found 1 MBR signatures
(XEN)  Found 1 EDD information structures
(XEN) Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009fc00 (usable)
(XEN)  000000000009fc00 - 00000000000a0000 (reserved)
(XEN)  00000000000e0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 00000000bf760000 (usable)
(XEN)  00000000bf76e000 - 00000000bf770000 type 9
(XEN)  00000000bf770000 - 00000000bf77e000 (ACPI data)
(XEN)  00000000bf77e000 - 00000000bf7d0000 (ACPI NVS)
(XEN)  00000000bf7d0000 - 00000000bf7e0000 (reserved)
(XEN)  00000000bf7ed000 - 00000000c0000000 (reserved)
(XEN)  00000000e0000000 - 00000000f0000000 (reserved)
(XEN)  00000000fee00000 - 00000000fee01000 (reserved)
(XEN)  00000000ffa00000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000140000000 (usable)
(XEN) System RAM: 4086MB (4185084kB)
(XEN) Xen heap: 13MB (13844kB)
(XEN) Domain heap initialised: DMA width 32 bits
(XEN) Processor #16 7:10 APIC version 21
(XEN) Processor #18 7:10 APIC version 21
(XEN) Processor #20 7:10 APIC version 21
(XEN) Processor #22 7:10 APIC version 21
(XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
(XEN) IOAPIC[1]: apic_id 3, version 32, address 0xfec8a000, GSI 24-47
(XEN) Enabling APIC mode:  Flat.  Using 2 I/O APICs
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Detected 2000.133 MHz processor.
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) HVM: VMX enabled
(XEN) HVM: Hardware Assisted Paging detected and enabled.
(XEN) VMX: MSR intercept bitmap enabled
(XEN) CPU0: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Booting processor 1/18 eip 90000
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) CPU1: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Booting processor 2/20 eip 90000
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) CPU2: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Booting processor 3/22 eip 90000
(XEN) , L1 D cache: 32K
(XEN) VMX: VPID is available.
(XEN) CPU3: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz stepping 05
(XEN) Total of 4 processors activated.
(XEN) ENABLING IO-APIC IRQs
(XEN)  -> Using new ACK method
(XEN) Platform timer overflows in 14998 jiffies.
(XEN) Platform timer is 14.318MHz HPET
(

Comment 18 Chris Lalancette 2009-11-03 08:53:59 UTC
(In reply to comment #17)
> tested with 2.6.18-170.el5xen again, and still have the dom0 freeze during
> kernel starting up .

I should have been much more clear about the effects of this bug, and how to test it out.  What you are getting below is expected, but let me explain in more detail.

The fundamental problem that caused us to see this bug is that the HP serial console hardware on this type of machine is buggy.  In particular, it can get into a state where it won't timely drain the TX buffer.  That means that the operating system (in this case, the hypervisor + dom0) is filling up the serial TX buffer with data to send, but the serial hardware isn't picking that up and rendering it out to the IPMI in a timely manner.

Previous to this patch, when the TX buffer got full, the hypervisor would just hang until the buffer emptied.  This was not a good thing to do; we don't want to stop booting the machine just because we encountered buggy serial hardware.  You could tell that the whole machine was hung by the fact that ping and/or ssh into the machine failed.

After this patch, we *still* have the buggy serial hardware, but the machine finishes booting.  That means that the TX buffer fills up, the hardware fails to drain it, but instead of hanging, the hypervisor now just drops serial data.  This means that the machine can finish booting, even though we lose some serial data along the way.  You can tell that it finished booting by ssh'ing or ping'ing the machine, which still should be up.

So, I think that we have proved that the HP serial hardware is still buggy.  Assuming that we can ssh into the machine with the -164.6.1 kernel, even though we don't get all of the serial output, then I think that we've proved that this patch works.  I'm going to flip it back to ON_QA for the time being.

Chris Lalancette

Comment 19 Vivian Bian 2009-11-03 09:24:35 UTC
(In reply to comment #18)
 
 so the real problem to delay the verify process is the buggy serial console. 
 with both -164-6.1 kernel and -170 kernel, I could ping the HP host and ssh into it successfully when I got the hang output. So we can say the kernel is still booting at that point , and could boot successfully . Also I did removed the crontab after I ssh into the HP host , and could perform some commands successfully. 

 now set the status to VERIFIED

Comment 22 errata-xmlrpc 2009-11-03 19:34:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1548.html


Note You need to log in before you can comment on or make changes to this bug.