239766 – [QC] kernel fails to boot on LS41 with maxcpus=1

Bug 239766 - [QC] kernel fails to boot on LS41 with maxcpus=1

Summary: [QC] kernel fails to boot on LS41 with maxcpus=1

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Steven Rostedt
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-11 07:09 UTC by IBM Bug Proxy
Modified:	2008-02-27 19:57 UTC (History)
CC List:	0 users
Fixed In Version:	2.6.21-14.el5rt
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-05-31 22:42:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
maxcpus-ignore-offline-cpus.patch (4.93 KB, text/plain) 2007-05-21 16:50 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	34431	0	None	None	None	Never

Description IBM Bug Proxy 2007-05-11 07:09:46 UTC

LTC Owner is: dvhltc.com
LTC Originator is: dvhltc.com


Reported by perf team, needs to be validated and possibly fixed.  This does not
seem to be a problem on an x460, as reported by the BULL team.

 -Darren 
----------------------------------------------------------------------------------

I was able to boot with maxcpus=1 on elm3b102 (LS41):

dvhart@elm3b102:~$ uname -a
Linux elm3b102.beaverton.ibm.com 2.6.16-rtj12.11.3smp #1 SMP PREEMPT Tue Apr 24
14:08:21 PDT 2007 i686 athlon i386 GNU/Linux

dvhart@elm3b102:~$ cat /proc/cmdline 
ro root=LABEL=/ console=tty0 console=ttyS1,19200 crashkernel=64M@16M maxcpus=1

dvhart@elm3b102:~$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 8212
stepping        : 2
cpu MHz         : 2000.276
cache size      : 1024 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext
3dnow pni cx16 lahf_lm cmp_legacy svm cr8legacy ts fid vid ttp tm stc
bogomips        : 4002.97


I'll try to get Mark Peloquin to describe the approach he took that failed, and
how it failed.

 -Darren 
----------------------------------------------------------------------------------

maxcpus=2 also works

 -Darren

-----------------------------------------------------------------------------------

Setting maxcpus=1 on elm3b210 causes the system to hang on boot with the last
message being:

pci_hotplug: PCI Hot Plug PCI Core Version: 0.5

The entry as seen in Grub on boot is:

kernel /boot/vmlinuz-2.6.20-0119.rt8 ro root=LABEL=/ console=tty0,
console=ttyS1,19200 maxcpus=1

This appears to be quite a different kernel, the uname -a output is:

Linux elm3b210.beaverton.ibm.com 2.6.20-0119.rt8 #1 SMP PREEMPT Thu Feb 15
15:53:15 CET 2007 x86_64 x86_64 x86_64 GNU/Linux

It looks like the tests done by Darren were on a x86 2.6.16 base while my system
has a x86_64 2.6.20 base.

  -KARL
---------------------------------------------------------------------------------------

confirmed on rhel5-rt 2.6.20-0119.rt8, trying with 2.6.21-rt now.

 -Darren
-----------------------------------------------------------------------------------------

2.6.21-2.el5rt fails in the same place, trying stock rhel5 kernel.

 -Darren
----------------------------------------------------------------------------------

maxcpus=1 works on stock RHEL5 (2.6.18-8.el5).  This limitation with the -rt
kernels is blocking -rt scalability analysis.

 -Darren
--------------------------------------------------------------------------------------

Also reproduced on an LS21. Adding initcall_debug to the boot line got the
following extra output:

pci_hotplug: PCI Hot Plug PCI Core version: 0.5
initcall 0xffffffff817a42a4: pci_hotplug_init+0x0/0x5a() returned 0.
initcall 0xffffffff817a42a4 ran for 24 msecs: pci_hotplug_init+0x0/0x5a()
Calling initcall 0xffffffff817a46f2: fb_console_init+0x0/0x12b()
initcall 0xffffffff817a46f2: fb_console_init+0x0/0x12b() returned 0.
initcall 0xffffffff817a46f2 ran for 0 msecs: fb_console_init+0x0/0x12b()
Calling initcall 0xffffffff817a4c37: acpi_reserve_resources+0x0/0xeb()
initcall 0xffffffff817a4c37: acpi_reserve_resources+0x0/0xeb() returned 0.
initcall 0xffffffff817a4c37 ran for 0 msecs: acpi_reserve_resources+0x0/0xeb()
Calling initcall 0xffffffff817a5abd: acpi_fan_init+0x0/0x5e()
initcall 0xffffffff817a5abd: acpi_fan_init+0x0/0x5e() returned 0.
initcall 0xffffffff817a5abd ran for 0 msecs: acpi_fan_init+0x0/0x5e()
Calling initcall 0xffffffff817a5bf8: irqrouter_init_sysfs+0x0/0x38()
initcall 0xffffffff817a5bf8: irqrouter_init_sysfs+0x0/0x38() returned 0.
initcall 0xffffffff817a5bf8 ran for 0 msecs: irqrouter_init_sysfs+0x0/0x38()
Calling initcall 0xffffffff817a5d8f: acpi_processor_init+0x0/0xdf()

 -John

Comment 1 IBM Bug Proxy 2007-05-19 07:15:34 UTC

----- Additional Comments From dvhltc.com  2007-05-19 03:12 EDT -------
I tested mainline 2.6.21 and it does boot with maxcpus=1.  2.6.21-rt1, 2, and 4
all hang at:

Calling initcall 0xffffffff817a41df: acpi_processor_init+0x0/0xdf()

when booted with initcall_debug.  I traced this only as far as the call to
acpi_bus_register_driver.  So this was definitely introduced by the -rt patch. 
I'm not sure if I should try and see when it was introduced (as 2.6.16-rt22 does
not fail) or if I should head "down the acpi rabbit hole" as John S. put it...

Comment 2 IBM Bug Proxy 2007-05-21 16:50:46 UTC

Created attachment 155114 [details]
maxcpus-ignore-offline-cpus.patch

Comment 3 IBM Bug Proxy 2007-05-21 16:50:50 UTC

----- Additional Comments From dvhltc.com  2007-05-21 12:43 EDT -------
 
Ignore bogus acpi info

Thomas Gleixner provided the attached patch.  When I first booted with this
patch I received the following in a loop:

irq 9: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 [<ffffffff8106d5a4>] dump_trace+0xaa/0x32a
 [<ffffffff8106d865>] show_trace+0x41/0x5c
 [<ffffffff8106d895>] dump_stack+0x15/0x17
 [<ffffffff810c50b8>] __report_bad_irq+0x38/0x87
 [<ffffffff810c52cb>] note_interrupt+0x1c4/0x1fc
 [<ffffffff810c458d>] thread_simple_irq+0x6c/0x7e
 [<ffffffff810c4dc3>] do_irqd+0x14a/0x3e4
 [<ffffffff81033d3a>] kthread+0xf5/0x128
 [<ffffffff8105ff68>] child_rip+0xa/0x12

handlers:
[<ffffffff8117736e>] (acpi_irq+0x0/0x1b)

I then tried to boot with acpi=noirq and I got all the way to a login prompt. 
As we have seen this "nobody cared" and child_rip dump issues before - I think
these are independent issues that should be tracked in their own bugs.

Comment 4 IBM Bug Proxy 2007-05-21 17:05:49 UTC

----- Additional Comments From dvhltc.com  2007-05-21 12:58 EDT -------
Ingo has included tglx's patch in 2.6.21-rt5

Comment 6 IBM Bug Proxy 2007-05-24 17:10:48 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |FIXEDAWAITINGTEST
         Resolution|                            |FIX_ALREADY_AVAIL




------- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-05-24 13:04 EDT -------
Verified fixed in 2.6.21-14.el5rt.

Note You need to log in before you can comment on or make changes to this bug.