Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 603776 - RHEL6 Xen DomU won't boot on xen 3.1.2-128.1.6.el5
RHEL6 Xen DomU won't boot on xen 3.1.2-128.1.6.el5
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.1
x86_64 Linux
low Severity medium
: rc
: ---
Assigned To: Andrew Jones
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-14 10:46 EDT by Kai Meyer
Modified: 2011-01-05 05:10 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-16 04:10:30 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Kai Meyer 2010-06-14 10:46:22 EDT
Description of problem:
RHEL6 Xen DomU won't boot on xen 3.1.2-128.1.6.el5


Version-Release number of selected component (if applicable):


How reproducible:
Boot from a RHEL6 DVD, or from the PXE vmlinuz and initrd.img files. 


Steps to Reproduce:
1. Create a DomU
2. Boot from RHEL6 DVD
  
Actual results:
I don't get through the entire boot sequence. Here's the complete Console Output.

Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.32-19.el6.x86_64 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.4.3 20100121 (Red Hat 4.4.3-1) (GCC) ) #1 SMP Tue Mar 9 17:48:46 EST 2010
Command line: ip=209.90.101.133 netmask=255.255.255.192 gateway=209.90.101.129 nameserver=216.83.130.2,216.83.130.7 hostname=mysql.fibernetdesign.com ks=http://mirror.fiber.net/ks/mysql.fibernetdesign.com.ks vnc console=ttyS0,9600
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  Centaur CentaurHauls
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fffac00 (usable)
 BIOS-e820: 000000003fffac00 - 0000000040000000 (reserved)
DMI 2.4 present.
last_pfn = 0x3fffa max_arch_pfn = 0x400000000
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
total RAM covered: 12280M
Found optimal setting for mtrr clean up
 gran_size: 64K 	chunk_size: 16M 	num_reg: 6  	lose cover RAM: 0G
init_memory_mapping: 0000000000000000-000000003fffa000
RAMDISK: 363ca000 - 37fefe28
ACPI: RSDP 00000000000eb0b0 00024 (v02    Xen)
ACPI: XSDT 00000000000eb020 00044 (v01    Xen      HVM 00000000 HVML 00000000)
ACPI: FACP 00000000000eae30 000F4 (v04    Xen      HVM 00000000 HVML 00000000)
ACPI: DSDT 00000000000ea040 00D67 (v02    Xen      HVM 00000000 INTL 20060707)
ACPI: FACS 00000000000ea000 00040
ACPI: APIC 00000000000eaf30 00072 (v02    Xen      HVM 00000000 HVML 00000000)
ACPI: HPET 00000000000eafb0 00038 (v01    Xen      HVM 00000000 HVML 00000000)
ACPI: SSDT 00000000000eafe8 00038 (v02    Xen      HVM 00000000 HVML 00000000)
No NUMA configuration found
Faking a node at 0000000000000000-000000003fffa000
Bootmem setup node 0 0000000000000000-000000003fffa000
  NODE_DATA [0000000000009000 - 000000000003cfff]
  bootmap [000000000003d000 -  0000000000044fff] pages 8
(7 early reservations) ==> bootmem [0000000000 - 003fffa000]
  #0 [0000000000 - 0000001000]   BIOS data page ==> [0000000000 - 0000001000]
  #1 [0000006000 - 0000008000]       TRAMPOLINE ==> [0000006000 - 0000008000]
  #2 [0001000000 - 0001c1bd18]    TEXT DATA BSS ==> [0001000000 - 0001c1bd18]
  #3 [00363ca000 - 0037fefe28]          RAMDISK ==> [00363ca000 - 0037fefe28]
  #4 [000009fc00 - 0000100000]    BIOS reserved ==> [000009fc00 - 0000100000]
  #5 [0001c1c000 - 0001c1c0a1]              BRK ==> [0001c1c000 - 0001c1c0a1]
  #6 [0000008000 - 0000009000]          PGTABLE ==> [0000008000 - 0000009000]
found SMP MP-table at [ffff8800000fccd0] fccd0
Zone PFN ranges:
  DMA      0x00000000 -> 0x00001000
  DMA32    0x00001000 -> 0x00100000
  Normal   0x00100000 -> 0x00100000
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
    0: 0x00000000 -> 0x0000009f
    0: 0x00000100 -> 0x0003fffa
ACPI: PM-Timer IO Port: 0x1f48
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 1, version 17, address 0xfec00000, GSI 0-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 low level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 7 global_irq 7 low level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 low level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 low level)
Using ACPI (MADT) for SMP configuration information
ACPI: HPET id: 0x8086a201 base: 0xfed00000
SMP: Allowing 1 CPUs, 0 hotplug CPUs
Allocating PCI resources starting at 40000000 (gap: 40000000:c0000000)
Booting paravirtualized kernel on bare hardware
NR_CPUS:4096 nr_cpumask_bits:1 nr_cpu_ids:1 nr_node_ids:1
PERCPU: Embedded 30 pages/cpu @ffff880001e00000 s92888 r8192 d21800 u2097152
pcpu-alloc: s92888 r8192 d21800 u2097152 alloc=1*2097152
pcpu-alloc: [0] 0 
Built 1 zonelists in Node order, mobility grouping on.  Total pages: 258357
Policy zone: DMA32
Kernel command line: ip=209.90.101.133 netmask=255.255.255.192 gateway=209.90.101.129 nameserver=216.83.130.2,216.83.130.7 hostname=mysql.fibernetdesign.com ks=http://mirror.fiber.net/ks/mysql.fibernetdesign.com.ks vnc console=ttyS0,9600
PID hash table entries: 4096 (order: 3, 32768 bytes)
Checking aperture...
No AGP bridge found
Memory: 991976k/1048552k available (4903k kernel code, 388k absent, 56188k reserved, 3902k data, 1144k init)
Hierarchical RCU implementation.
NR_IRQS:33024 nr_irqs:256
Console: colour VGA+ 80x25
console [ttyS0] enabled
allocated 10485760 bytes of page_cgroup
please try 'cgroup_disable=memory' option if you don't want memory cgroups
HPET: 3 timers in total, 0 timers will be used for per-cpu timer
Fast TSC calibration using PIT
Detected 2000.051 MHz processor.
Calibrating delay loop (skipped), value calculated using timer frequency.. 4000.10 BogoMIPS (lpj=2000051)
Security Framework initialized
SELinux:  Initializing.
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
Mount-cache hash table entries: 256
Initializing cgroup subsys ns
Initializing cgroup subsys cpuacct
Initializing cgroup subsys memory
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
Initializing cgroup subsys net_cls
Initializing cgroup subsys blkio


Expected results:
Boot into the anaconda installer.
Comment 2 RHEL Product and Program Management 2010-06-14 11:02:57 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 3 Andrew Jones 2010-06-14 11:17:10 EDT
Does any other guest boot on this host? Can you get a stack with 

/usr/lib64/xen/bin/xenctx -s <rhel6-kernel-symfile> <domain#> [cpu#]
Comment 4 Kai Meyer 2010-06-14 12:15:36 EDT
Yes, these Xen Hosts are running many HVM guests.

# /usr/lib64/xen/bin/xenctx -s symvers-2.6.32-19.el6.x86_64.gz 63
rip: ffffffff81038bf6 
rsp: ffffffff81705df0
rax: 00000001	rbx: 00000001	rcx: 00000100	rdx: 00000016
rsi: ffffffff81705e50	rdi: ffffffff81705e54	rbp: ffffffff81705df8
 r8: ffffffff81705e4c	 r9: ffffffff81705e48	r10: ffff88003e5b2748	r11: 00000070
r12: da7a8dcc	r13: ffffffff81705e54	r14: ffffffff81705e50	r15: ffffffff81705e4c
 cs: 00000010	 ds: 00000000	 fs: 00000000	 gs: 00000000

I'm not very familiar with symbol tables, I think this is what you want. 

I think that the host's 'xm info' may be useful as well. These systems are getting a little old, so I'm ok if the answer is "upgrade your Hypervisor". I just want to make sure that is the case.

host                   : ####
release                : 2.6.18-128.1.6.el5xen
version                : #1 SMP Wed Apr 1 09:53:14 EDT 2009
machine                : x86_64
nr_cpus                : 8
nr_nodes               : 1
sockets_per_node       : 2
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2000
hw_caps                : bfebfbff:28100800:00000000:00000140:009ce3bd:00000000:00000001
total_memory           : 12279
free_memory            : 2620
node_to_cpu            : node0:0-7
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-128.1.6.el5
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)
cc_compile_by          : mockbuild
cc_compile_domain      : centos.org
cc_compile_date        : Wed Apr  1 09:00:24 EDT 2009
xend_config_format     : 2
Comment 5 Andrew Jones 2010-06-15 03:34:45 EDT
Actually I meant System.map-2.6.32-19.el6.x86_64, sorry for that confusion.

You're correct though that the answer is likely just to update your HV. The RIP you captured is pointing at a cpuid call, and we're far enough along in the boot that we're most likely at identify_boot_cpu. I'm guessing RHEL6 is trying to probe or enable some of the hardware on your system that isn't supported, or hidden, by the old HV.

Let's try to get another stack, since I'm curious what exactly we're choking on. Then you can update to 5.5 and test again, which will very likely work.
Comment 6 Kai Meyer 2010-06-15 10:56:29 EDT
I find it interesting that I see something different between when I attach to the console during the boot sequence, and when I don't. 

If I create the VM with out attaching to the console, I get the following.
# /usr/lib64/xen/bin/xenctx -s System.map-2.6.32-19.el6.x86_64 67
rip: ffffffff81317977 io_serial_out+0x17
rsp: ffffffff81705d78
rax: 00000030	rbx: ffffffff81c05460	rcx: 00000000	rdx: 000003f8
rsi: 00000000	rdi: ffffffff81c05460	rbp: ffffffff81705d78
 r8: ffffffff81892b80	 r9: 00000000	r10: 0000007c	r11: 000024c0
r12: 00000030	r13: 0000004e	r14: 0000000c	r15: ffffffff81318000
 cs: 00000010	 ds: 00000000	 fs: 00000000	 gs: 00000000

Stack:
failed to map page.

If I create the VM with the '-c' flag to attach to the console, after it crashes, I exit the console, and check, and get the following.
# /usr/lib64/xen/bin/xenctx -s System.map-2.6.32-19.el6.x86_64 68
rip: ffffffff81038bf6 native_cpuid+0x16
rsp: ffffffff81705df0
rax: 00000001	rbx: 00000001	rcx: 00000100	rdx: 00000010
rsi: ffffffff81705e50	rdi: ffffffff81705e54	rbp: ffffffff81705df8
 r8: ffffffff81705e4c	 r9: ffffffff81705e48	r10: ffff88003e5b2748	r11: 00000070
r12: 00aa9c71	r13: ffffffff81705e54	r14: ffffffff81705e50	r15: ffffffff81705e4c
 cs: 00000010	 ds: 00000000	 fs: 00000000	 gs: 00000000

Stack:
failed to map page.

I'm not quite sure what the output is supposed to look like, but this out put doesn't look right either.
Comment 7 Andrew Jones 2010-06-15 11:21:41 EDT
hmm, I was hoping that passing in the wrong file, as was done for comment 4, was the reason we didn't have a backtrace. It looks like even with the correct file we don't get any more information on how we got here.

Possibly you ran the xenctx command too soon for the case where you didn't try connecting the console? Or, how many vcpus does the guest have? If it's more than one than you should run the xenctx command once per vcpu, where the last parameter of the command is the vcpu id, e.g. 0,1,2... If one vcpu was doing io while the other attempted a cpuid that triggered an unsupported code path and killed the guest, then that would explain the different outputs (i.e. guest boot (1) had serial io on vcpu 0 and then try (2) had the cpuid).
Comment 8 Kai Meyer 2010-06-15 13:01:30 EDT
Only 1 vcpu. With the console vs. non-console output, it looks like I was a little impatient. I let it sit a lot longer this time, and I get the native_cpuid rip on the first line. 

I'll just have to wait till our next iteration of the production environment is ready. It does run RHEL6 in testing. 

Thanks for your time.
Comment 9 RHEL Product and Program Management 2010-07-15 10:50:42 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 10 Andrew Jones 2010-08-02 11:54:07 EDT
Setting to 6.1. Hopefully we'll get more information for this bug, otherwise we'll have to close it as not-enough-info.

Drew
Comment 12 Andrew Jones 2010-11-16 04:10:30 EST
Haven't heard anything new on this bug in quite a while, so I'll close it as not-enough-info. If more information becomes available, then please reopen.
Comment 13 Kai Meyer 2010-11-16 10:08:59 EST
Sorry I couldn't give you more information. With the release of RHELv6 we aren't interested in solving this problem on our existing 5.3 Xen servers.

Note You need to log in before you can comment on or make changes to this bug.