Bug 448715 - Kdump failing on LS21
Kdump failing on LS21
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
beta
x86_64 All
low Severity urgent
: ---
: ---
Assigned To: Red Hat Real Time Maintenance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-05-28 08:33 EDT by IBM Bug Proxy
Modified: 2009-09-23 12:27 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-23 12:27:24 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Console log while triggering kdump on LS21 (5.00 KB, text/plain)
2008-06-05 02:00 EDT, IBM Bug Proxy
no flags Details
Fix kdump crash kernel boot memory reservation for NUMA machines (471 bytes, patch)
2008-06-05 09:27 EDT, Jeff Burke
no flags Details | Diff
Console log while triggering kdump on LS21 for -65 kernel (23.65 KB, text/plain)
2008-06-11 06:08 EDT, IBM Bug Proxy
no flags Details
kdump boot log without irqpoll option (7.24 KB, text/plain)
2008-07-17 12:51 EDT, IBM Bug Proxy
no flags Details
Boot log of kdump kernel with CONFIG_PCI_DEBUG on and 'debug' parameter (16.00 KB, text/plain)
2008-08-01 09:21 EDT, IBM Bug Proxy
no flags Details
kdump kernel boot log with CONFIG_ACPI_DEBUG on (18.20 KB, text/plain)
2008-08-04 03:00 EDT, IBM Bug Proxy
no flags Details
Dump while triggering kdump on MRG (2.25 KB, text/plain)
2008-11-07 04:03 EST, IBM Bug Proxy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 45134 None None None Never

  None (edit)
Description IBM Bug Proxy 2008-05-28 08:33:09 EDT
=Comment: #0=================================================
Chirag H. Jog1 <chirag.jog@in.ibm.com> - 2008-05-28 02:46 EDT
Problem description:
Triggering a manual dump on alpha14 kernel, fails to create a core file.
The kernel goes into an infinite loop with following logs dumped to the console

SysRq : Trigger a crashdump
irq 9: nobody cared (try booting with the "irqpoll" option)
irq 9: Some systems using an IO-APIC require a special quirk to workaround
irq 9: problems with interrupt routing. If your system requires such a quirk,
irq 9: please try booting with the "ioapic_level_quirk=1" option.
handlers:
[<ffffffff811660a0>] (acpi_irq+0x0/0x1b)
turning off IO-APIC fast mode.
irq 9: nobody cared (try booting with the "irqpoll" option)
irq 9: Some systems using an IO-APIC require a special quirk to workaround
irq 9: problems with interrupt routing. If your system requires such a quirk,
irq 9: please try booting with the "ioapic_level_quirk=1" option.
handlers:
[<ffffffff811660a0>] (acpi_irq+0x0/0x1b)
irq 9: nobody cared (try booting with the "irqpoll" option)
irq 9: Some systems using an IO-APIC require a special quirk to workaround
irq 9: problems with interrupt routing. If your system requires such a quirk,
irq 9: please try booting with the "ioapic_level_quirk=1" option.
handlers:



Hardware Environment
LS21; A 4 way AMD Opeteron Blade with 8 GB Ram

Is this reproducible?
echo c > /proc/sysrq-trigger

Is the system (not just the application) hung?
Yes
=Comment: #1=================================================
Chirag H. Jog1 <chirag.jog@in.ibm.com> - 2008-05-28 02:49 EDT
The common thing between this bug and bug 45105 and bug 45111 is hardware. All
are AMD Opterons.
=Comment: #2=================================================
Chirag H. Jog1 <chirag.jog@in.ibm.com> - 2008-05-28 03:03 EDT
I booted the kernel with irqpoll and ioapic_level_quirk=1 . The kernel still
cycles in a loop with the following logs

[root@llm54 ~]# cat /proc/cmdline
root=/dev/sda3 ro rhgb quiet crashkernel=128M@16M  console=tty1
console=ttyS1,19200 irqpoll ioapic_level_quirk=1
[root@llm54 ~]# echo c> /SysRq : Trigger a crashdump
irq 9: nobody cared (try booting with the "irqpoll" option)
handlers:
[<ffffffff811660a0>] (acpi_irq+0x0/0x1b)
turning off IO-APIC fast mode.
irq 9: nobody cared (try booting with the "irqpoll" option)
handlers:
[<ffffffff811660a0>] (acpi_irq+0x0/0x1b)
irq 9: nobody cared (try booting with the "irqpoll" option)
handlers:
[<ffffffff811660a0>] (acpi_irq+0x0/0x1b)
irq 9: nobody cared (try booting with the "irqpoll" option)
handlers:

=Comment: #4=================================================
Chirag H. Jog1 <chirag.jog@in.ibm.com> - 2008-05-28 06:19 EDT
Triggered Kdump with RHEL5.1 base kernel as kdump kernel. 
This message is dumped on the console:
 SysRq : Trigger a crashdump
 Memory for crash kernel (0x0 to 0x0) notwithin permissible range

But the kdump kernel boots, and the core is captured correctly.
Comment 1 IBM Bug Proxy 2008-05-29 07:24:39 EDT
------- Comment From ankigarg@in.ibm.com 2008-05-29 07:21 EDT-------
Comment #3 of bug #44627 indicates that the 'irq nobody cared' exists in our
R2/MRG kernels. This is showing up with kdump as well. So, am guessing that this
issue with kdump will be fixed once we have the above resolved? Just a thought...
Comment 2 IBM Bug Proxy 2008-06-02 16:32:37 EDT
------- Comment From dvhltc@us.ibm.com 2008-06-02 16:27 EDT-------
Let's test with 5.2 (RH says this is a known issues with 5.1)
Comment 3 IBM Bug Proxy 2008-06-05 02:00:33 EDT
------- Comment From chirag.jog@in.ibm.com 2008-06-05 01:59 EDT-------
Triggering a kdump on a RHEL5.2 on LS21, causes the system to hang.
The hang is caused while booting the kdump kernel.
Attaching the console log.
Comment 4 IBM Bug Proxy 2008-06-05 02:00:35 EDT
Created attachment 308412 [details]
Console log while triggering kdump on LS21
Comment 5 IBM Bug Proxy 2008-06-05 05:48:37 EDT
------- Comment From chirag.jog@in.ibm.com 2008-06-05 05:43 EDT-------
Adding acpi=noirq to the cmdline of the main kernel solves the problem.
This is verified on two LS21 boxes.
Comment 6 IBM Bug Proxy 2008-06-05 07:08:35 EDT
------- Comment From ankigarg@in.ibm.com 2008-06-05 07:02 EDT-------
Also, passing acpi-noirq only to the second kernel also resolves the issue..the
trick was to restart the kdump service after changing the kdump kernel
commandline parameter specification in /etc/sysconfig/kdump file while trying
out the option!! So, we have a fix !!!!!
Comment 7 Jeff Burke 2008-06-05 09:25:25 EDT
I went back to test kexec/kdump on the ls21 we have here. This time I used
RHEL5.2 base distro. The 2.6.24-7.62.el5rt kernel. The kernel booted fine. When
I added the crashkernel=128M@16M to the kernel command line the system would panic.

-----------------------------------------------------------------------------
 Bootmem setup node 1 0000000000000000-000000007ffa4000
 Reserving 128MB of memory at 16MB for crashkernel (System RAM: 2047MB)
 PANIC: early exception rip ffffffff814bc4e9 error 0 cr2 73c8
 Pid: 0, comm: swapper Not tainted 2.6.24.7-62.el5rt #1

 Call Trace:
  [<ffffffff814bc4e9>] ? reserve_bootmem+0x14/0x22
  [<ffffffff814aeee2>] ? setup_arch+0x45c/0x4e4
  [<ffffffff814a88c8>] ? start_kernel+0x76/0x329
  [<ffffffff814a8119>] ? _sinittext+0x119/0x120

 RIP reserve_bootmem+0x14/0x22
-----------------------------------------------------------------------------

After some digging and discussions with Dave Anderson and Vivek Goyal it was
determined that we needed a patch to correct the panic. He is what the issue was.

The code in arch/x86_64/kernel/setup_64.c to reserve boot memory for the crash
kernel uses the non-numa aware reserve_bootmem function instead of the NUMA
aware "reserve_bootmem_generic".

It was changed in commit 5c3391f9f749023a49c64d607da4fb49263690eb. Reverting
with the patch attached to the BZ fixes the issue for me.

Once I got by that part I was able to kexec/kdump into a RHEL5.2 kernel without
issue. However I do see the message:

-----------------------------------------------------------------------------
irq 9: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 <IRQ>  [<ffffffff800b7aba>] __report_bad_irq+0x30/0x7d
 [<ffffffff800b7ced>] note_interrupt+0x1e6/0x227
 [<ffffffff800b71f7>] __do_IRQ+0xbd/0x103
 [<ffffffff8006c3f4>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff80010970>] handle_IRQ_event+0x1b/0x58
 [<ffffffff800b71de>] __do_IRQ+0xa4/0x103
 [<ffffffff8006c3f4>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff800018c8>] _stext+0x8c8/0x1000
 [<ffffffff80061f7b>] memcpy_c+0xb/0x14
 [<ffffffff8003bfac>] mm_init+0x1e6/0x227
 [<ffffffff8003e84c>] do_execve+0x88/0x243
 [<ffffffff80054760>] sys_execve+0x36/0x4c
 [<ffffffff8005d4d3>] stub_execve+0x67/0xb0

handlers:
[<ffffffff801672e8>] (acpi_irq+0x0/0x1b)
Disabling IRQ #9
-----------------------------------------------------------------------------

Once but the kdump works successfully.


Comment 8 Jeff Burke 2008-06-05 09:27:52 EDT
Created attachment 308435 [details]
Fix kdump crash kernel boot memory reservation for NUMA machines

Not sure why the ls21 here in the Westford office comes up with a singgle NUMA
node 1 and not node 0.
Comment 9 IBM Bug Proxy 2008-06-11 06:08:43 EDT
------- Comment From chirag.jog@in.ibm.com 2008-06-11 06:04 EDT-------
Tried the -65 MRG kernel. There seems to be a lot of dump messages that are
thrown after which the system just freezes.
I am attaching the console log.
Ankita also observed the same.
Comment 10 IBM Bug Proxy 2008-06-11 06:08:45 EDT
Created attachment 308914 [details]
Console log while triggering kdump on LS21 for -65 kernel
Comment 11 IBM Bug Proxy 2008-06-12 01:56:46 EDT
------- Comment From ankigarg@in.ibm.com 2008-06-12 01:53 EDT-------
(In reply to comment #24)
> ------- Comment From jburke@redhat.com 2008-06-05 09:25 EST-------
> However I do see the message:
>
> -----------------------------------------------------------------------------
> irq 9: nobody cared (try booting with the "irqpoll" option)
>
> Call Trace:
> <IRQ>  [<ffffffff800b7aba>] __report_bad_irq+0x30/0x7d
> [<ffffffff800b7ced>] note_interrupt+0x1e6/0x227
> [<ffffffff800b71f7>] __do_IRQ+0xbd/0x103
> [<ffffffff8006c3f4>] do_IRQ+0xe7/0xf5
> [<ffffffff8005d615>] ret_from_intr+0x0/0xa
> [<ffffffff80010970>] handle_IRQ_event+0x1b/0x58
> [<ffffffff800b71de>] __do_IRQ+0xa4/0x103
> [<ffffffff8006c3f4>] do_IRQ+0xe7/0xf5
> [<ffffffff8005d615>] ret_from_intr+0x0/0xa
> <EOI>  [<ffffffff800018c8>] _stext+0x8c8/0x1000
> [<ffffffff80061f7b>] memcpy_c+0xb/0x14
> [<ffffffff8003bfac>] mm_init+0x1e6/0x227
> [<ffffffff8003e84c>] do_execve+0x88/0x243
> [<ffffffff80054760>] sys_execve+0x36/0x4c
> [<ffffffff8005d4d3>] stub_execve+0x67/0xb0
>
> handlers:
> [<ffffffff801672e8>] (acpi_irq+0x0/0x1b)
> Disabling IRQ #9
> -----------------------------------------------------------------------------
>

This is the issue we have seen on the LS21's we have been trying on. For this,
passing "acpi=noriq" to the second kernel has so far resolved the issue for us.
But something to note, with these messages, we have seen that the kdump kernel
hangs.

> Once but the kdump works successfully.
Comment 12 IBM Bug Proxy 2008-06-13 08:24:55 EDT
------- Comment From nivedita@us.ibm.com 2008-06-13 08:20 EDT-------
What I mean is, Ankita, can you give me the verbage that
needs to go as instructions for kdump being set up
correctly on LS21 and HS21XM?
Comment 13 IBM Bug Proxy 2008-06-13 08:40:45 EDT
------- Comment From ankigarg@in.ibm.com 2008-06-13 08:34 EDT-------
(In reply to comment #31)
> What I mean is, Ankita, can you give me the verbage that
> needs to go as instructions for kdump being set up
> correctly on LS21 and HS21XM?

Niv,

For MRG, the best and easiest documentation would be :

"For kdump, stock RHEL kernel should be used as the kdump kernel. This can be
setup by making the KDUMP_KERNELVER string in /etc/sysconfig/kdump to the
RHEL5.2 kernel version string. For example, KDUMP_KERNELVER="2.6.18-53.el5". "

The above would work for both LS21 as well as for HS21. Coming to R2, as Darren
mentioned, we have a workaround, but we will continue to investigate into the
issue till the code freeze for RC1. If unfortunate in getting the fix, we will
have to use the work-around. Hope that makes it clearer. Let me know if you need
any more clarification.
Comment 14 IBM Bug Proxy 2008-06-13 09:08:49 EDT
------- Comment From ankigarg@in.ibm.com 2008-06-13 09:03 EDT-------
(In reply to comment #32)
> (In reply to comment #31)
> > What I mean is, Ankita, can you give me the verbage that
> > needs to go as instructions for kdump being set up
> > correctly on LS21 and HS21XM?
>
> Niv,
>
> For MRG, the best and easiest documentation would be :
>
> "For kdump, stock RHEL kernel should be used as the kdump kernel. This can be
> setup by making the KDUMP_KERNELVER string in /etc/sysconfig/kdump to the
> RHEL5.2 kernel version string. For example, KDUMP_KERNELVER="2.6.18-53.el5". "
>
> The above would work for both LS21 as well as for HS21. Coming to R2, as Darren
> mentioned, we have a workaround, but we will continue to investigate into the
> issue till the code freeze for RC1. If unfortunate in getting the fix, we will
> have to use the work-around. Hope that makes it clearer. Let me know if you need
> any more clarification.

err..I meant "2.6.18-92.el5" as kernel version string for RHEL5.2 kernel.
Comment 15 IBM Bug Proxy 2008-06-17 02:57:13 EDT
------- Comment From chirag.jog@in.ibm.com 2008-06-17 02:52 EDT-------
I tried 2.6.24.4 (mainline) as the first and second kernel. This worked fine.
But with 2.6.24.4-rt4 ( as 1st and 2nd kernel) , the reported problem is still
seen.
This is all without acpi=noirq added anywhere.
So looks a RT problem?
Comment 16 IBM Bug Proxy 2008-06-17 03:57:03 EDT
------- Comment From chirag.jog@in.ibm.com 2008-06-17 03:55 EDT-------
Tried with 2.6.24.4-rt4 as 1st and 2.6.24.4 as 2nd.
kdump works fine with this permutation.
Comment 17 IBM Bug Proxy 2008-06-17 04:40:51 EDT
------- Comment From ankigarg@in.ibm.com 2008-06-17 04:32 EDT-------
(In reply to comment #37)
> Tried with 2.6.24.4-rt4 as 1st and 2.6.24.4 as 2nd.
> kdump works fine with this permutation.

Thanks a lot for doing this Chirag. So, while we are at it, how about having RT
kernel as the kdump kernel and mainline as the first kernel ?
Comment 18 IBM Bug Proxy 2008-07-02 02:17:00 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-02 02:08 EDT-------
Just as a start point, the issue is reproducible on the latest R2 kernel (-68
kernel).
Comment 19 IBM Bug Proxy 2008-07-02 03:08:40 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-02 03:01 EDT-------
Checked that RT kernel can still be used as the kdump kernel on LS21 if
'acpi=noirq' option is passed to the second kernel (as in comment #20). So, for
kdump on SAN, we still do have a work around (for R2).
Comment 20 IBM Bug Proxy 2008-07-02 04:40:49 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-02 04:34 EDT-------
Now trying out with SR3 kernel as the kdump kernel.
Comment 21 IBM Bug Proxy 2008-07-02 05:00:38 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-02 04:58 EDT-------
R2 first kernel and SR3 as second kernel, kdump works fine without the
'acpi=noirq' param.

------- Comment From ankigarg@in.ibm.com 2008-07-02 04:59 EDT-------
R2 first kernel and SR3 as second kernel, kdump works fine without the
'acpi=noirq' param. Trying the next level of kernel version with relocatable
support.
Comment 22 IBM Bug Proxy 2008-07-02 05:56:38 EDT
------- Comment From chirag.jog@in.ibm.com 2008-07-02 05:51 EDT-------
Reassigning to Ankita, as she will be working on it.
Comment 23 IBM Bug Proxy 2008-07-02 06:00:40 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-02 05:56 EDT-------
While I was not able to capture the dump with 2.6.23.1-rt4 as the kdump kernel,
the reason for the hang was is not the 'nobody cared irq' issue. Planning to try
a higher version of the RT kernel.
Comment 24 IBM Bug Proxy 2008-07-02 06:16:39 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-02 06:12 EDT-------
Oh! Turns out I was able to capture dump using 2.6.23.1-rt4 as kdump kernel
(should have waited 2 minutes longer!!). So, this implies I should look at a
higher version of the RT kernel.
Comment 25 IBM Bug Proxy 2008-07-03 03:00:49 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-03 02:57 EDT-------
To be doubly sure that 2.6.23 based RT kernel as kdump kernel was working fine,
I tried again with 2.6.23.1-rt11 and it did work fine. Trying now 2.6.23.11
kernel (this is the last 23 based RT kernel).
Comment 26 IBM Bug Proxy 2008-07-03 04:24:57 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-03 04:19 EDT-------
So ruling out 2.6.23 based RT kernels as 2.6.23.11-rt14 also worked fine. Trying
2.6.24 now. I purposely avoided having to git-bisect up until now as it is vrey
time consuming and too cumbersome. Will try a couple of 2.6.24 based RT kernels
and then decide on the best way to go forward.

------- Comment From ankigarg@in.ibm.com 2008-07-03 04:21 EDT-------
One thing to note is that I have been using SR3 config to build the kernels (not
yet sure if that makes any difference).
Comment 27 IBM Bug Proxy 2008-07-03 04:40:52 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-03 04:36 EDT-------
This is interesting. kdump worked fine with 2.6.24-rt1 as kdump kernel. From the
previous comment #36, we see that it did not for 2.6.24.4-rt4 kernel. So maybe
something in between these kernels? Worth trying out 2.6.24.4-rt4 again with the
config I have been using thus far.
Comment 28 IBM Bug Proxy 2008-07-03 08:24:56 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-03 08:16 EDT-------
So 2.6.24-rt1 kernel worked fine. But 2.6.24.4-rt4 kernel seems to be stuck
trying to copy the vmcore...however, note that this kernel still does not report
the 'nobody cared for irq' issue seen earlier by Chirag on the same kernel with
R2 config. So either this is a config issue or some kernel patch. Looking
further into this.

------- Comment From ankigarg@in.ibm.com 2008-07-03 08:21 EDT-------
(In reply to comment #55)
> So 2.6.24-rt1 kernel worked fine. But 2.6.24.4-rt4 kernel seems to be stuck
> trying to copy the vmcore...however, note that this kernel still does not report
> the 'nobody cared for irq' issue seen earlier by Chirag on the same kernel with
> R2 config. So either this is a config issue or some kernel patch. Looking
> further into this.

Now this is irritating..if only i had waited for 2 more minutes !! the copy
finished and the kernel is rebooting back into first kernel now! So this is
turning out to be a config issue? will try the same kernel with R2 config and
confirm.
Comment 29 IBM Bug Proxy 2008-07-03 11:41:14 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-03 11:37 EDT-------
So turns out that some config diff between SR3 and R2 could be causing this.
2.6.24.4-rt4 with R2 config throws the error reported in comment #1 while
doesn't when built with SR3 config. To confirm, will try to build R2 kernel with
SR3 kernel and try that as kdump kernel.
Comment 30 IBM Bug Proxy 2008-07-03 14:08:45 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-03 14:05 EDT-------
Got the following error when building R2 kernel with SR3 config.

Building modules, stage 2.
MODPOST 1151 modules
ERROR: "scsi_dh_activate" [drivers/md/dm-multipath.ko] undefined!
ERROR: "scsi_dh_handler_exist" [drivers/md/dm-multipath.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2

For this, enabled CONFIG_SCSI_DH=m and CONFIG_SCSI_DH_RDAC=m to remove this
(this is how it is in R2 as well). With this I do not get the irq errors, but
only an incomplete vmcore was captured. Got to look further.
Comment 31 IBM Bug Proxy 2008-07-04 03:49:10 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-04 03:47 EDT-------
(In reply to comment #56)
> (In reply to comment #55)
> > So 2.6.24-rt1 kernel worked fine. But 2.6.24.4-rt4 kernel seems to be stuck
> > trying to copy the vmcore...however, note that this kernel still does not report
> > the 'nobody cared for irq' issue seen earlier by Chirag on the same kernel with
> > R2 config. So either this is a config issue or some kernel patch. Looking
> > further into this.
>
> Now this is irritating..if only i had waited for 2 more minutes !! the copy
> finished and the kernel is rebooting back into first kernel now! So this is
> turning out to be a config issue? will try the same kernel with R2 config and
> confirm.

This was a mistake. With so many kernels that I am trying out, I made a mistake
here. I tried 2.6.24-rt4 with the SR3 config and found that the kdump kernel
hung. So, now looking at the kernel diff between 2.6.24-rt1 and 2.6.24.4-rt4.
Comment 32 IBM Bug Proxy 2008-07-04 04:57:04 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-04 04:51 EDT-------
2.6.24.2-rt2 kernel hung. Since there is no RT kernel release between these two
kernels, got to look at the patch difference now. Sripathi suggested looking at
the broken out patchset between the two kernels.
Comment 33 IBM Bug Proxy 2008-07-07 09:56:59 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-07 09:48 EDT-------
Last time I found that the 2.6.24.2-rt2 kernel hung. But when I tried again, it
did not. To my surprise, even 2.6.24.3-rt3 and 2.6.24.4-rt4 are working fine as
well !! However, on the R2 kernel, the problem is consistently reproducible.
Another interesting point is that aplha01 (which has been known to have issues)
is based on 2.6.24.3 RT kernel !! So am kind of confused here. I am going to try
out alpha01 now (I will be doing this first time) and decide on wht to do next
depending on the result.
Comment 34 IBM Bug Proxy 2008-07-07 10:00:54 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-07 09:57 EDT-------
The alpha01 kernel does hang, though can't say if it is the same issue. The
kdump kernel is printing the following messages:

/etc/rEXT3-fs error (device sda3): ext3_get_inode_loc: unable to read inode
block - inode=212241, block=425991
sd 0:0:0:0: rejecting I/O to offline device
EXT3-fs error (device sda3): ext3_get_inode_loc: unable to read inode block -
inode=3084891, block=6193169
sd 0:0:0:0: rejecting I/O to offline device
EXT3-fs error (device sda3): ext3_get_inode_loc: unable to read inode block -
inode=1387674, block=2785299c.d/rc.sysinit:
sd 0:0:0:0: rejecting I/O to offline device
Comment 35 IBM Bug Proxy 2008-07-08 06:00:45 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-08 05:58 EDT-------
(In reply to comment #62)
> The alpha01 kernel does hang, though can't say if it is the same issue. The
> kdump kernel is printing the following messages:
>
> /etc/rEXT3-fs error (device sda3): ext3_get_inode_loc: unable to read inode
> block - inode=212241, block=425991
> sd 0:0:0:0: rejecting I/O to offline device
> EXT3-fs error (device sda3): ext3_get_inode_loc: unable to read inode block -
> inode=3084891, block=6193169
> sd 0:0:0:0: rejecting I/O to offline device
> EXT3-fs error (device sda3): ext3_get_inode_loc: unable to read inode block -
> inode=1387674, block=2785299c.d/rc.sysinit:
> sd 0:0:0:0: rejecting I/O to offline device
>

On some machines, I do not get the above.
Comment 36 IBM Bug Proxy 2008-07-08 08:16:43 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-08 08:11 EDT-------
Ok so here is the result of all my findings so far. kdump works absolutely fine
till 2.6.24-rt1. With 2.6.24.2 and onwards, kdump kernel either hangs midway or
ext3 errors are thrown.

This could point to the x86 merge that happened around the same time.
Comment 37 IBM Bug Proxy 2008-07-10 05:10:52 EDT
------- Comment From ssant@in.ibm.com 2008-07-10 05:00 EDT-------
Ankita can you please upload dmesg log for kdump kernel as well as first kernel?
Comment 38 IBM Bug Proxy 2008-07-11 00:40:43 EDT
------- Comment From ssant@in.ibm.com 2008-07-11 00:31 EDT-------
I tried kdump with 2.6.24.7-68ibmrt2.5(as first kernel as well as dump capture
kernel) and was able to recreate the bug. During kdump boot lot's of irq related
messages(as follows ) were printed and no vmcore was captured.

irq 9: nobody cared (try booting with the "irqpoll" option)
irq 9: Some systems using an IO-APIC require a special quirk to workaround
irq 9: problems with interrupt routing. If your system requires such a quirk,
irq 9: please try booting with the "ioapic_level_quirk=1" option.
handlers:
[<ffffffff811660a0>] (acpi_irq+0x0/0x1b)

I tried adding ioapic_level_quirk=1 command line option for dump capture kernel
but that did not help. Looks like interrupts in the dump capture kernel do not
get routed proporly. I remember there was a similar problem with x86-64 boxes
with 2.6.16ish kernels and was eventually fixed.

Will debug further.
Comment 39 IBM Bug Proxy 2008-07-11 01:10:43 EDT
------- Comment From ssant@in.ibm.com 2008-07-11 01:00 EDT-------
I went through the boot log of 2.6.24.7-68ibmrt2.5 kernel. I saw lots of
messages related to IO-APIC as follows.

Fusion MPT SAS Host driver 3.04.06
IO-APIC: Detected an IO-APIC needing a special quirk. If you experience unusual
interrupt problems, try booting with "ioapic_level_quirk=0", or "noapic", if tha
t doesn't work either.
ACPI: PCI Interrupt 0000:03:04.0[A] -> GSI 19 (level, low) -> IRQ 19

Not sure if this contributing to interrupt related problems in the kdump boot.

I will try adding the above mentioned option to the kernel command line and see
if that helps.
Comment 40 IBM Bug Proxy 2008-07-17 12:50:58 EDT
------- Comment From ssant@in.ibm.com 2008-07-17 12:42 EDT-------
Here is the summary so far.

I have tried booting the kdump kernel with pci=routeirq and few other related
options, but without success.

While going through the kdump boot log i came across this message

irqpoll boot option not supported w/ CONFIG_PREEMPT_RT

So i tried removing irqpoll option from kdump kernel commandline. But that also
did not help. kdump kernel hangs at

ACPI: bus type pci registered
PCI: Using configuration type 1
ACPI: Interpreter enabled
AC

Have attached the full boot log.
Comment 41 IBM Bug Proxy 2008-07-17 12:51:01 EDT
Created attachment 312062 [details]
kdump boot log without irqpoll option
Comment 42 Clark Williams 2008-07-19 00:05:57 EDT
looks a bit like it's going south in the ACPI code; have you tried pci=noapci?
Comment 43 IBM Bug Proxy 2008-07-21 02:00:52 EDT
------- Comment From ssant@in.ibm.com 2008-07-21 01:53 EDT-------
In reply to previous comment.

> looks a bit like it's going south in the ACPI code; have you tried pci=noapci?

Yes pci=noacpi does work (and acpi=noirq as well).
Comment 44 IBM Bug Proxy 2008-07-22 02:20:51 EDT
------- Comment From sripathi@in.ibm.com 2008-07-22 02:17 EDT-------
A status update:

Sachin and I were trying out various options yesterday to narrow down this
problem, based on Ankita's earlier observation that the problem does not exist
in 2.6.24-rt1 but exists in 2.6.24.2-rt2. We pulled down broken out patches for
these two versions and tried to identify offending patches in 2.6.24.2-rt2. We
pinned out suspicion on a couple of patches, compiled kernels without these
patches and tested kdump. We were not able to get kdump working.

At this stage we wanted to reconfirm that 2.6.24-rt1 works. We compiled that
kernel and to our surprise, that kernel too hung up during kdump booting, just
like 2.6.24.2-rt2!! So we are not sure about which kernel can be taken as
baseline which works.

Later yesterday evening Sachin tried to narrow down the problem by again looking
at some more broken out patches in 2.6.24.2-rt2. He could not narrow down to a
patch that causes the problem. He has now gone back to looking at the code,
rather than narrowing down the problem.
Comment 45 IBM Bug Proxy 2008-07-22 02:30:43 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-22 02:27 EDT-------
(In reply to comment #73)
> A status update:
>
> Sachin and I were trying out various options yesterday to narrow down this
> problem, based on Ankita's earlier observation that the problem does not exist
> in 2.6.24-rt1 but exists in 2.6.24.2-rt2. We pulled down broken out patches for
> these two versions and tried to identify offending patches in 2.6.24.2-rt2. We
> pinned out suspicion on a couple of patches, compiled kernels without these
> patches and tested kdump. We were not able to get kdump working.
>
> At this stage we wanted to reconfirm that 2.6.24-rt1 works. We compiled that
> kernel and to our surprise, that kernel too hung up during kdump booting, just
> like 2.6.24.2-rt2!! So we are not sure about which kernel can be taken as
> baseline which works.
>

Yikes! 2.6.24-rt1 has always worked for me as the kdump kernel. I guess the only
difference in our setup might have been the first kernel. I stuck with our
latest R2 kernel at that time as the first kernel and the other test kernels as
kdump kernel. Not sure if this should impact the results..but can't think of any
other diff in the setup !Strange !!!!
Comment 46 IBM Bug Proxy 2008-07-22 07:11:15 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-22 07:01 EDT-------
(In reply to comment #74)
> (In reply to comment #73)
> > A status update:
> >
> > Sachin and I were trying out various options yesterday to narrow down this
> > problem, based on Ankita's earlier observation that the problem does not exist
> > in 2.6.24-rt1 but exists in 2.6.24.2-rt2. We pulled down broken out patches for
> > these two versions and tried to identify offending patches in 2.6.24.2-rt2. We
> > pinned out suspicion on a couple of patches, compiled kernels without these
> > patches and tested kdump. We were not able to get kdump working.
> >
> > At this stage we wanted to reconfirm that 2.6.24-rt1 works. We compiled that
> > kernel and to our surprise, that kernel too hung up during kdump booting, just
> > like 2.6.24.2-rt2!! So we are not sure about which kernel can be taken as
> > baseline which works.
> >
>
> Yikes! 2.6.24-rt1 has always worked for me as the kdump kernel. I guess the only
> difference in our setup might have been the first kernel. I stuck with our
> latest R2 kernel at that time as the first kernel and the other test kernels as
> kdump kernel. Not sure if this should impact the results..but can't think of any
> other diff in the setup !Strange !!!!

I just tried 2.6.24-rt1 as the kdump kernel with MRG as the first kernel and the
kdump kernel failed to boot !! This is inconsistent with what I had found last
time. I wonder where things could be different...
Comment 47 IBM Bug Proxy 2008-07-22 08:10:53 EDT
------- Comment From ssant@in.ibm.com 2008-07-22 08:05 EDT-------
So here is the current status.

During kdump boot machine hangs while inside acpi_sleep_init()
(drivers/acpi/sleep/main.c).

This function has check related to Suspend and Hibernation. One of them is
dmi_check_system() [ drivers/firmware/dmi_scan.c ]. This function walks through
various dmi_system_id structures. In this particular it checks for
acpisleep_dmi_table.

The machine just hangs while executing above code.

Will continue to debug further.

If someone has any inputs related to this piece of code or ACPI in general
please chime in.
Comment 48 IBM Bug Proxy 2008-07-31 00:01:43 EDT
------- Comment From ankigarg@in.ibm.com 2008-07-30 23:58 EDT-------
Sent mail to the kexec list regarding this issue.

http://lists.infradead.org/pipermail/kexec/2008-July/002262.html
Comment 49 IBM Bug Proxy 2008-08-01 09:21:23 EDT
Created attachment 313196 [details]
Boot log of kdump kernel with CONFIG_PCI_DEBUG on and &apos;debug&apos; parameter

Attaching the boot log of the kdump kernel. Find that the kernel hung in
acpi_init() call. I checked that the ACPI memory areas are passed correctly to
the kdump kernel.
Comment 50 IBM Bug Proxy 2008-08-04 03:00:55 EDT
Created attachment 313320 [details]
kdump kernel boot log with CONFIG_ACPI_DEBUG on

Attaching kdump kernel boot log with CONFIG_ACPI_DEBUG on. Looking into this
further.
Comment 51 IBM Bug Proxy 2008-08-04 06:11:42 EDT
(In reply to comment #76)
> So here is the current status.
>
> During kdump boot machine hangs while inside acpi_sleep_init()
> (drivers/acpi/sleep/main.c).
>
> This function has check related to Suspend and Hibernation. One of them is
> dmi_check_system() [ drivers/firmware/dmi_scan.c ]. This function walks through
> various dmi_system_id structures. In this particular it checks for
> acpisleep_dmi_table.
>
> The machine just hangs while executing above code.
>
> Will continue to debug further.
>
> If someone has any inputs related to this piece of code or ACPI in general
> please chime in.

Hi Sachin,

So was trying to instrument the kernel to narrow down the location of the kernel
hang. Found that the kernel does not return from acpi_enable_subsystem().

acpi_init -> acpi_bus_init -> acpi_enable_subsystem ->
acpi_ev_install_xrupt_handlers

In this code path, request irq is performed for acpi_irq. ACPI continue to
remain a black box..so got to dig in deeper...
Comment 52 IBM Bug Proxy 2008-08-04 07:32:46 EDT
Not being sure if the driver code is even called at the time of pci subsystem
initialization, tried kdump without the nvidia driver being compiled as a module
and passing a generic vga argument to the kdump kernel. The kdump kernel still
hung. So, maybe some part of pci code is not able to init the card properly. Not
at all sure...we should seek help from nvidia folks.


(In reply to comment #82)
> Not being sure if the driver code is even called at the time of pci subsystem
> initialization, tried kdump without the nvidia driver being compiled as a module
> and passing a generic vga argument to the kdump kernel. The kdump kernel still
> hung. So, maybe some part of pci code is not able to init the card properly. Not
> at all sure...we should seek help from nvidia folks.

The above update is for bug #45111. So pl ignore.
Comment 53 IBM Bug Proxy 2008-08-04 12:11:08 EDT
With an instrumented kernel, found that the kdump kernel seems to be hanging in
the setup_irq routine(), soon after an irq thread is created for threaded
interrupts. Now this is also one of the differences between mainline and RT
kernels..so my guess right now would be that its the threaded nature of the acpi
interrupt that could be somehow leading to this...got to dig in further though.
Comment 54 IBM Bug Proxy 2008-08-05 02:20:55 EDT
(In reply to comment #84)
> With an instrumented kernel, found that the kdump kernel seems to be hanging in
> the setup_irq routine(), soon after an irq thread is created for threaded
> interrupts. Now this is also one of the differences between mainline and RT
> kernels..so my guess right now would be that its the threaded nature of the acpi
> interrupt that could be somehow leading to this...got to dig in further though.

One catch here however is that the same piece of code is working for Intel machines.
Comment 55 IBM Bug Proxy 2008-08-22 05:01:23 EDT
We have decided to go with the workaround of using RHEL kernel as kdump kernel
for the current release. Hence I am rejecting this bug as ALT_SOLUTION_AVAIL. We
are going to work on the problem of using real-time kernel as kdump kernel, but
we will raise a new bug for that.
Comment 56 IBM Bug Proxy 2008-11-07 04:03:56 EST
Created attachment 322829 [details]
Dump while triggering kdump on MRG
Comment 57 Clark Williams 2009-09-23 12:27:24 EDT
closing

Note You need to log in before you can comment on or make changes to this bug.