Bug 443364 - [5.2][kdump] kdump not work on ibm-l4b-lp1 anymore
Summary: [5.2][kdump] kdump not work on ibm-l4b-lp1 anymore
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: ppc64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Steve Best
QA Contact: Han Pingtian
URL:
Whiteboard:
: 592231 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-04-21 06:51 UTC by Qian Cai
Modified: 2011-04-25 03:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-25 03:05:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kdump works fine with RHEL5U1 kernel (14.03 KB, text/plain)
2008-04-21 06:51 UTC, Qian Cai
no flags Details
sosreport (2.23 MB, application/octet-stream)
2008-04-21 06:52 UTC, Qian Cai
no flags Details
lsmod output from the power5 box where kdump works. (3.28 KB, text/plain)
2008-11-17 08:51 UTC, IBM Bug Proxy
no flags Details
lspci output from the power5 box where kdump works (135 bytes, text/plain)
2008-11-17 08:51 UTC, IBM Bug Proxy
no flags Details
Contents of /proc/device-tree/ (18.54 KB, application/octet-stream)
2008-12-18 17:03 UTC, IBM Bug Proxy
no flags Details
Debug patch (3.12 KB, text/plain)
2008-12-22 06:11 UTC, IBM Bug Proxy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 49876 0 None None None Never

Description Qian Cai 2008-04-21 06:51:38 UTC
Description of problem:
Capture kernel failed to start on one ppc64 box, ibm-l4b-lp1.test.redhat.com,

ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump
Sending IPI to other cpus...
Partition configured for 4 cpus.
Starting Linux PPC64 #1 SMP Tue Apr 15 18:43:25 EDT 2008
-----------------------------------------------------
ppc64_pft_size                = 0x17
physicalMemorySize            = 0x12000000
ppc64_caches.dcache_line_size = 0x80
ppc64_caches.icache_line_size = 0x80
htab_address                  = 0x0000000000000000
htab_hash_mask                = 0xffff
physical_start                = 0x2000000
-----------------------------------------------------
Linux version 2.6.18-90.el5kdump (brewbuilder.redhat.com) (gcc
version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 15 18:43:25 EDT 2008
Machine check in kernel mode.
Caused by (from SRR1=8000000000001000): Transfer error ack signal
cpu 0x1: Vector: 200 (Machine Check) at [c000000002583930]
    pc: 0000000000000200
    lr: c00000000201dfa0: .rtas_progress+0x54/0x3e0
    sp: c000000002583bb0
   msr: 8000000000001000
  current = 0xc000000002464430
  paca    = 0xc000000002464f00
    pid   = 0, comm = swapper
WARNING: exception is not recoverable, can't continue

Thought, it works fine with RHEL5U1 kernel -53.el5.

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080416.0
kernel-2.6.18-90.el5
kexec-tools-1.102pre-21.el5

How reproducible:
Always on ibm-l4b-lp1.test.redhat.com

Steps to Reproduce:
1. configured kdump with crashkernel=256M@32M
2. SysRq-C

Comment 1 Qian Cai 2008-04-21 06:51:38 UTC
Created attachment 303113 [details]
kdump works fine with RHEL5U1 kernel

Comment 2 Qian Cai 2008-04-21 06:52:17 UTC
Created attachment 303114 [details]
sosreport

Comment 3 Ed Pollard 2008-04-30 15:54:17 UTC
Assigning to Brad as he works most of the Power issues.

Comment 4 Qian Cai 2008-10-22 11:22:37 UTC
It hangs in RHEL 5.3 Beta Kernel,

kernel-2.6.18-120.el5
kexec-tools-1.102pre-46.el5

Red Hat Enterprise Linux Server release 5.3 Beta (Tikanga)
Kernel 2.6.18-120.el5 on an ppc64

ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump
Sending IPI to other cpus...


Although I have only seen it on this particular PPC64 machine, it is a regression. So, I propose for an exception to see if we could fix this in RHEL 5.3.

Comment 5 RHEL Program Management 2008-10-22 11:30:33 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 6 IBM Bug Proxy 2008-11-13 08:41:44 UTC
To RedHat :

What kind of hardware is this ?  Any specific reason for trying out .90-el5 kernel ?
I have couple of  power box with 5.2 GA (92-el5) where kdump works fine. So this could be specific to this hardware.

Comment 7 Qian Cai 2008-11-13 09:00:43 UTC
.90.el5 Kernel was the version when I was testing for RHEL 5.2 before release. It also did not work for RHEL 5.3 Beta. Please see comment #4 (it was a private comment before). The only information I have for this system is,

LOCATION  	RDU, Lab 331
CPUSPEED  	1656
MODEL           IBM,9117-570

Do you think if that information is sufficient? If not, we may need to have the system administrators to give us more information.

Comment 8 IBM Bug Proxy 2008-11-13 09:51:44 UTC
In reply to previous comment

Thanks Cia for the information. Seems like a Power5 box. I will try to setup a similar box  with 5.3 beta and try out kdump.

Comment 9 Qian Cai 2008-11-13 10:39:49 UTC
While we are on this, there might be another Kdump bug on Power5 machines.

 Bug 471204 -  [5.3] Kdump Kernel Hangs for ipr Device Driver

Comment 10 IBM Bug Proxy 2008-11-14 18:53:10 UTC
I just tried kdump with RHEL 5.3 latest kernel on a power5 box and did not face any problem. I was able to capture a vmcore file. Here are the details

Kernel version  :
Linux xxxxxxxx.xx.xxx.xxx  2.6.18-122.el5 #1 SMP Mon Nov 3 18:23:41 EST 2008 ppc64 ppc64 ppc64 GNU/Linux

cpuinfo :

cpu             : POWER5 (gr)
clock           : 1656.384000MHz
revision        : 2.3 (pvr 003a 0203)

timebase        : 207048000
platform        : pSeries
machine         : CHRP IBM,9117-570

Cia @RH, can you enable early boot debug messages and see if that provides any more information ?

Comment 11 IBM Bug Proxy 2008-11-17 08:51:40 UTC
Created attachment 323746 [details]
lsmod output from the power5 box where kdump works.

Comment 12 IBM Bug Proxy 2008-11-17 08:51:44 UTC
Created attachment 323747 [details]
lspci output from the power5 box where kdump works

Comment 14 RHEL Program Management 2008-11-18 13:11:23 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 15 IBM Bug Proxy 2008-11-18 14:43:05 UTC
If this can be of any help here is the firmware level from the power5 box where kdump works fine.

# cat /proc/device-tree/openprom/ibm,fw-vernum_encoded

SF225_096

#

Comment 16 IBM Bug Proxy 2008-11-18 17:07:14 UTC
Cia,

Any luck reproducing on a system with FW mentioned above?

Comment 17 Qian Cai 2008-11-19 11:03:03 UTC
The machine is currently unavailable for me at the moment. I'll get it back to you when I have it.

Comment 18 Qian Cai 2008-11-24 02:18:52 UTC
(In reply to comment #10)
> I just tried kdump with RHEL 5.3 latest kernel on a power5 box and did not face
> any problem. I was able to capture a vmcore file. Here are the details
> 
> Kernel version  :
> Linux xxxxxxxx.xx.xxx.xxx  2.6.18-122.el5 #1 SMP Mon Nov 3 18:23:41 EST 2008
> ppc64 ppc64 ppc64 GNU/Linux
> 
> cpuinfo :
> 
> cpu             : POWER5 (gr)
> clock           : 1656.384000MHz
> revision        : 2.3 (pvr 003a 0203)
> 
> timebase        : 207048000
> platform        : pSeries
> machine         : CHRP IBM,9117-570
> 
> Cia @RH, can you enable early boot debug messages and see if that provides any
> more information ?

I have tried both of these with the latest kexec-tools (1.102pre-50.el5) and kernel (2.6.18-124.el5) packages,

earlyprintk=serial,ttyS0,115200

earlyprintk=serial,hvc0

It was the same failure,

# SysRq : Trigger a crashdump
Sending IPI to other cpus...
cpu 0x1: Vector: 200 (Machine Check) at [c000000002593bf0]
    pc: 0000000000000200
    lr: c000000002024d78: .of_find_node_by_path+0x30/0xac
    sp: c000000002593e70
   msr: 8000000000001000
  current = 0xc0000000024c8930
  paca    = 0xc0000000024c9400
    pid   = 0, comm = swapper
WARNING: exception is not recoverable, can't continue

Some quick information as below. Otherwise, please see "sosreport" in comment #2.

# cat /proc/device-tree/openprom/ibm,fw-vernum_encoded
SF240_358

# cat /proc/cpuinfo 
processor	: 0
cpu		: POWER5 (gr)
clock		: 1654.344000MHz
revision	: 2.1 (pvr 003a 0201)

processor	: 1
cpu		: POWER5 (gr)
clock		: 1654.344000MHz
revision	: 2.1 (pvr 003a 0201)

processor	: 2
cpu		: POWER5 (gr)
clock		: 1654.344000MHz
revision	: 2.1 (pvr 003a 0201)

processor	: 3
cpu		: POWER5 (gr)
clock		: 1654.344000MHz
revision	: 2.1 (pvr 003a 0201)

timebase	: 207050000
platform	: pSeries
machine		: CHRP IBM,9117-570

# lsmod
Module                  Size  Used by
autofs4                52601  2 
hidp                   59969  2 
rfcomm                 93689  0 
l2cap                  68313  10 hidp,rfcomm
bluetooth             113173  5 hidp,rfcomm,l2cap
sunrpc                317257  1 
ipv6                  494273  30 
xfrm_nalgo             29653  1 ipv6
crypto_api             30905  1 xfrm_nalgo
dm_multipath           49945  0 
scsi_dh                29981  1 dm_multipath
snd_powermac           97457  0 
snd_seq_dummy          23189  0 
snd_seq_oss            75441  0 
snd_seq_midi_event     28009  1 snd_seq_oss
snd_seq               107193  5 snd_seq_dummy,snd_seq_oss,snd_seq_midi_event
snd_seq_device         29405  3 snd_seq_dummy,snd_seq_oss,snd_seq
snd_pcm_oss            84001  0 
snd_mixer_oss          43297  1 snd_pcm_oss
snd_pcm               144501  2 snd_powermac,snd_pcm_oss
snd_page_alloc         32401  1 snd_pcm
snd_timer              53025  2 snd_seq,snd_pcm
snd                   117457  8 snd_powermac,snd_seq_oss,snd_seq,snd_seq_device,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer
soundcore              28601  1 snd
i2c_core               47409  1 snd_powermac
parport_pc             63913  0 
lp                     36425  0 
parport                75197  2 parport_pc,lp
ibmveth                45721  0 
sg                     69609  0 
dm_snapshot            46313  0 
dm_zero                20017  0 
dm_mirror              46481  0 
dm_log                 35469  1 dm_mirror
dm_mod                119889  10 dm_multipath,dm_snapshot,dm_zero,dm_mirror,dm_log
ibmvscsic              51921  3 
sd_mod                 48985  5 
scsi_mod              242121  4 scsi_dh,sg,ibmvscsic,sd_mod
ext3                  210033  2 
jbd                   110825  1 ext3
uhci_hcd               57241  0 
ohci_hcd               55085  0 
ehci_hcd               69897  0 

The machine in question has been reserved by me at the moment. Feel free to grab it.

Comment 20 IBM Bug Proxy 2008-11-27 07:22:43 UTC
In reply to previous comment :

> # SysRq : Trigger a crashdump
> Sending IPI to other cpus...
> cpu 0x1: Vector: 200 (Machine Check) at [c000000002593bf0]
> pc: 0000000000000200
> lr: c000000002024d78: .of_find_node_by_path+0x30/0xac
> sp: c000000002593e70

So this looks like a different call trace that it was reported previously. Earlier call trace showed that the kdump kernel started booting and then crashed.

........
Linux version 2.6.18-90.el5kdump (brewbuilder.redhat.com)
(gcc
version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 15 18:43:25 EDT 2008
Machine check in kernel mode.
Caused by (from SRR1=8000000000001000): Transfer error ack signal
cpu 0x1: Vector: 200 (Machine Check) at [c000000002583930]
pc: 0000000000000200
lr: c00000000201dfa0: .rtas_progress+0x54/0x3e0
sp: c000000002583bb0

.......

So from the latest trace it seems like the kdump kernel did not boot. (Atleast going by the messages ).

Since i am still not able to recreate this on a local box , can you please add some instrumentation in the kexec/kdump specific code of first kernel and see if that gives any more information. Also enable early debugging kernel config option.

Comment 21 IBM Bug Proxy 2008-12-03 06:31:57 UTC
Cia, were you able to gather more information on this ?

Comment 22 Qian Cai 2008-12-03 07:32:24 UTC
(In reply to comment #20)
> Since i am still not able to recreate this on a local box , can you please add
> some instrumentation in the kexec/kdump specific code of first kernel and see
> if that gives any more information. 

What debug code do you want me to add there?

> Also enable early debugging kernel config option.

Can you let me know which option do you want to me to add to the second kernel? As I mentioned before, I have tried,

earlyprintk=serial,ttyS0,115200

earlyprintk=serial,hvc0

Unfortunately, they did not give me any obvious indication of the underlying problem.

In addition, since the problematic machine had SF240_358 version of firmware, maybe you could try to update the firmware to see if it will let you reproduce the problem?

Comment 23 IBM Bug Proxy 2008-12-03 15:33:42 UTC
I tried kdump on another machine [ 9117 - p570 ] with similar firmware level SF240 and kdump worked just fine. I was able to save vmcore file without any problems. I tried with 2.6.18-120.el5 level of kernel.

Cia, can you try to recreate this on a different machine ?

Also can you make sure CONFIG_PPC_EARLY_DEBUG is enabled in both first and second kernel ? That should print some extra messages during boot.

Another thing you could try is to compile a kdump kernel and make the following change

#undef DEBUG =======>>  #define DEBUG

in files

arch/powerpc/kernel/prom.c
arch/powerpc/kernel/prom_init.c

This should print few more debug messages during kdump boot. That way we will know if the kdump kernel has started booting or not.

Comment 24 Qian Cai 2008-12-04 06:13:52 UTC
(In reply to comment #23)
> I tried kdump on another machine [ 9117 - p570 ] with similar firmware level
> SF240 and kdump worked just fine. I was able to save vmcore file without any
> problems. I tried with 2.6.18-120.el5 level of kernel.
> 
> Cia, can you try to recreate this on a different machine ?
> 

I have tried on a different machine,

MODEL  	IBM,9117-570
VENDOR  IBM,0210D815C
PROCESSORS  8
MEMORY  8192
CPUSPEED 2200

but have not see any problem there.

> Also can you make sure CONFIG_PPC_EARLY_DEBUG is enabled in both first and
> second kernel ? That should print some extra messages during boot.
> 
> Another thing you could try is to compile a kdump kernel and make the following
> change
> 
> #undef DEBUG =======>>  #define DEBUG
> 
> in files
> 
> arch/powerpc/kernel/prom.c
> arch/powerpc/kernel/prom_init.c
> 
> This should print few more debug messages during kdump boot. That way we will
> know if the kdump kernel has started booting or not.

OK, I'll try to enable debug options and re-test it as soon as possible.

Comment 25 IBM Bug Proxy 2008-12-10 10:12:05 UTC
Cia, did you get a chance to add some debug code and recreate this bug ?

Comment 26 Qian Cai 2008-12-10 10:56:44 UTC
Not yet. I put a reservation for the machine, but have had no luck after nearly a week.

Comment 27 Qian Cai 2008-12-11 08:54:48 UTC
I have compiled the new kernel (say 2.6.18-126.el5.ppcdebug) with CONFIG_PPC_EARLY_DEBUG, with DEBUG (prom.c) and DEBUG_PROM (prom_init.c). I have tried the following combination of normal and kdump kernel.

------------------------------
kernel-2.6.18-126.el5.ppcdebug
kernel-kdump-2.6.18-126.el5.ppcdebug
Result in kdump kernel hung without further message,

ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump
Sending IPI to other cpus...


------------------------------
kernel-2.6.18-126.el5.ppcdebug
kernel-kdump-2.6.18-92.el5
Result in kdump kernel reset,

ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump
Sending IPI to other cpus...
Partition configured for 4 cpus.
Starting Linux PPC64 #1 SMP Tue Apr 29 13:56:48 EDT 2008
-----------------------------------------------------
ppc64_pft_size                = 0x17
physicalMemorySize            = 0x12000000
ppc64_caches.dcache_line_size = 0x80
ppc64_caches.icache_line_size = 0x80
htab_address                  = 0x0000000000000000
htab_hash_mask                = 0xffff
physical_start                = 0x2000000
-----------------------------------------------------
Linux version 2.6.18-92.el5kdump (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:56:48 EDT 2008
Machine check in kernel mode.
Caused by (from SRR1=8000000000001000): Transfer error ack signal
cpu 0x1: Vector: 200 (Machine Check) at [c000000002583930]
    pc: 0000000000000200
    lr: c00000000201dfa0: .rtas_progress+0x54/0x3e0
    sp: c000000002583bb0
   msr: 8000000000001000
  current = 0xc000000002464430
  paca    = 0xc000000002464f00
    pid   = 0, comm = swapper
WARNING: exception is not recoverable, can't continue


--------------------
kernel-2.6.18-92.el5
kernel-kdump-2.6.18-126.el5.ppcdebug
Result in kdump kernel reset by either,

ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump
Sending IPI to other cpus...
Partition configured for 4 cpus.
Starting Linux PPC64 #1 SMP Wed Dec 10 23:44:53 EST 2008
-----------------------------------------------------
ppc64_pft_size                = 0x17
physicalMemorySize            = 0x12000000
ppc64_caches.dcache_line_size = 0x80
ppc64_caches.icache_line_size = 0x80
htab_address                  = 0x0000000000000000
htab_hash_mask                = 0xffff
physical_start                = 0x2000000
-----------------------------------------------------
Linux version 2.6.18-126.el5.ppcdebugandnoxwkdump (mockbuild.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Wed Dec 10 23:44:53 EST 2008
Machine check in kernel mode.
Caused by (from SRR1=8000000000001000): Transfer error ack signal
cpu 0x1: Vector: 200 (Machine Check) at [c000000002593930]
    pc: 0000000000000200
    lr: c00000000201e368: .rtas_progress+0x54/0x3e0
    sp: c000000002593bb0
   msr: 8000000000001000
  current = 0xc0000000024c8930
  paca    = 0xc0000000024c9400
    pid   = 0, comm = swapper
WARNING: exception is not recoverable, can't continue

or,

ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump
Sending IPI to other cpus...
cpu 0x1: Vector: 200 (Machine Check) at [c000000002593bf0]
    pc: 0000000000000200
    lr: c000000002024d7c: .of_find_node_by_path+0x30/0xac
    sp: c000000002593e70
   msr: 8000000000001000
  current = 0xc0000000024c8930
  paca    = 0xc0000000024c9400
    pid   = 0, comm = swapper
WARNING: exception is not recoverable, can't continue


--------------------
kernel-2.6.18-92.el5
kernel-kdump-2.6.18-92.el5
This is the only situation it works.

Comment 28 IBM Bug Proxy 2008-12-16 06:11:59 UTC
> ------------------------------
> kernel-2.6.18-126.el5.ppcdebug
> kernel-kdump-2.6.18-126.el5.ppcdebug
> Result in kdump kernel hung without further message,
>
> ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump
> Sending IPI to other cpus...
>
> ------------------------------

Hm .. so this does not provide any clue about the problem.  I think we will need to add extra printk's in prom.c and prom_init.c to find the root cause for this. Ameet can you help Cia debug this issue ?

Also can you upload the device tree[contents of /proc/device-tree ] o/p from that machine ? I can try to compare that with device-tree present on the system where i have successfully tested kdump.

Comment 29 Ameet Paranjape 2008-12-16 15:14:56 UTC
>Ameet can you help Cia debug this issue ?

Yes, I taking a look.

Comment 33 IBM Bug Proxy 2008-12-18 17:03:51 UTC
Created attachment 327340 [details]
Contents of /proc/device-tree/



Attached as per the request in comment #37.

Comment 34 IBM Bug Proxy 2008-12-19 22:31:56 UTC
I have tried this a couple of times now, but my custom kernel with debug params and printks just seems to sit here:

[root@ibm-l4b-lp1 ~]# echo "c" > /proc/sysrq-trigger
SysRq : Trigger a crashdump
Sending IPI to other cpus...

I have given this third attempt 1.5 hours to display something.  I will see what is says overnight because I am not sure what else to do...

Comment 35 IBM Bug Proxy 2008-12-22 06:11:58 UTC
Created attachment 327615 [details]
Debug patch



Ameet, here is a debug patch i have created for this issue. It just add's some printks to the code. Please compile both the kernels (first as well as kdump kernel) with this patch. Please make sure you have enabled the CONFIG_PPC_EARLY_DEBUG option. Also make sure  earlyprintk=  command line option is added to kdump boot.

Let's hope this gives us some information about the problem.

Comment 36 IBM Bug Proxy 2008-12-22 06:21:50 UTC
In reply to previous comment :

> Also make sure  earlyprintk=  command line option is added to kdump boot.

The option should be console=<Serial console related option > . I don't know if earlyprintk option is supported with ppc64. Probably that is only supported on ia32/x86_64.

Comment 38 RHEL Program Management 2009-02-16 15:26:52 UTC
Updating PM score.

Comment 42 Qian Cai 2009-05-01 16:26:55 UTC
This machine is currently unavailable due to fail to install. I'll close this bug for now, and re-open it when have seen another occurrence of it.

Comment 43 Neil Horman 2010-05-18 10:26:58 UTC
*** Bug 592231 has been marked as a duplicate of this bug. ***

Comment 44 Neil Horman 2010-05-18 10:27:41 UTC
Looks like Han has access to this system again and can reproduce this failure, as per bz 592231

Comment 46 Neil Horman 2010-06-02 14:02:07 UTC
Ameet, whats the status on this bz?

Comment 51 Steve Best 2010-11-30 16:21:04 UTC
CAI Qian,

Does this failed on the latest nightly for RHEL 5.6?

-Steve

Comment 52 Han Pingtian 2010-12-01 02:38:32 UTC
(In reply to comment #51)
> CAI Qian,
> 
> Does this failed on the latest nightly for RHEL 5.6?
> 
> -Steve

Yeah, it failed on snapshot3. I have to exclude all ibm-l4b-lp* from our kdump testing.

Comment 55 Han Pingtian 2011-04-25 02:57:46 UTC
Kdump tests passed with RHEL5.7-20110409.3 on ibm-l4b-lp1.rhts.eng.rdu.redhat.com.


Note You need to log in before you can comment on or make changes to this bug.