Description of problem: Capture kernel failed to start on one ppc64 box, ibm-l4b-lp1.test.redhat.com, ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump Sending IPI to other cpus... Partition configured for 4 cpus. Starting Linux PPC64 #1 SMP Tue Apr 15 18:43:25 EDT 2008 ----------------------------------------------------- ppc64_pft_size = 0x17 physicalMemorySize = 0x12000000 ppc64_caches.dcache_line_size = 0x80 ppc64_caches.icache_line_size = 0x80 htab_address = 0x0000000000000000 htab_hash_mask = 0xffff physical_start = 0x2000000 ----------------------------------------------------- Linux version 2.6.18-90.el5kdump (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 15 18:43:25 EDT 2008 Machine check in kernel mode. Caused by (from SRR1=8000000000001000): Transfer error ack signal cpu 0x1: Vector: 200 (Machine Check) at [c000000002583930] pc: 0000000000000200 lr: c00000000201dfa0: .rtas_progress+0x54/0x3e0 sp: c000000002583bb0 msr: 8000000000001000 current = 0xc000000002464430 paca = 0xc000000002464f00 pid = 0, comm = swapper WARNING: exception is not recoverable, can't continue Thought, it works fine with RHEL5U1 kernel -53.el5. Version-Release number of selected component (if applicable): RHEL5.2-Server-20080416.0 kernel-2.6.18-90.el5 kexec-tools-1.102pre-21.el5 How reproducible: Always on ibm-l4b-lp1.test.redhat.com Steps to Reproduce: 1. configured kdump with crashkernel=256M@32M 2. SysRq-C
Created attachment 303113 [details] kdump works fine with RHEL5U1 kernel
Created attachment 303114 [details] sosreport
Assigning to Brad as he works most of the Power issues.
It hangs in RHEL 5.3 Beta Kernel, kernel-2.6.18-120.el5 kexec-tools-1.102pre-46.el5 Red Hat Enterprise Linux Server release 5.3 Beta (Tikanga) Kernel 2.6.18-120.el5 on an ppc64 ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump Sending IPI to other cpus... Although I have only seen it on this particular PPC64 machine, it is a regression. So, I propose for an exception to see if we could fix this in RHEL 5.3.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
To RedHat : What kind of hardware is this ? Any specific reason for trying out .90-el5 kernel ? I have couple of power box with 5.2 GA (92-el5) where kdump works fine. So this could be specific to this hardware.
.90.el5 Kernel was the version when I was testing for RHEL 5.2 before release. It also did not work for RHEL 5.3 Beta. Please see comment #4 (it was a private comment before). The only information I have for this system is, LOCATION RDU, Lab 331 CPUSPEED 1656 MODEL IBM,9117-570 Do you think if that information is sufficient? If not, we may need to have the system administrators to give us more information.
In reply to previous comment Thanks Cia for the information. Seems like a Power5 box. I will try to setup a similar box with 5.3 beta and try out kdump.
While we are on this, there might be another Kdump bug on Power5 machines. Bug 471204 - [5.3] Kdump Kernel Hangs for ipr Device Driver
I just tried kdump with RHEL 5.3 latest kernel on a power5 box and did not face any problem. I was able to capture a vmcore file. Here are the details Kernel version : Linux xxxxxxxx.xx.xxx.xxx 2.6.18-122.el5 #1 SMP Mon Nov 3 18:23:41 EST 2008 ppc64 ppc64 ppc64 GNU/Linux cpuinfo : cpu : POWER5 (gr) clock : 1656.384000MHz revision : 2.3 (pvr 003a 0203) timebase : 207048000 platform : pSeries machine : CHRP IBM,9117-570 Cia @RH, can you enable early boot debug messages and see if that provides any more information ?
Created attachment 323746 [details] lsmod output from the power5 box where kdump works.
Created attachment 323747 [details] lspci output from the power5 box where kdump works
If this can be of any help here is the firmware level from the power5 box where kdump works fine. # cat /proc/device-tree/openprom/ibm,fw-vernum_encoded SF225_096 #
Cia, Any luck reproducing on a system with FW mentioned above?
The machine is currently unavailable for me at the moment. I'll get it back to you when I have it.
(In reply to comment #10) > I just tried kdump with RHEL 5.3 latest kernel on a power5 box and did not face > any problem. I was able to capture a vmcore file. Here are the details > > Kernel version : > Linux xxxxxxxx.xx.xxx.xxx 2.6.18-122.el5 #1 SMP Mon Nov 3 18:23:41 EST 2008 > ppc64 ppc64 ppc64 GNU/Linux > > cpuinfo : > > cpu : POWER5 (gr) > clock : 1656.384000MHz > revision : 2.3 (pvr 003a 0203) > > timebase : 207048000 > platform : pSeries > machine : CHRP IBM,9117-570 > > Cia @RH, can you enable early boot debug messages and see if that provides any > more information ? I have tried both of these with the latest kexec-tools (1.102pre-50.el5) and kernel (2.6.18-124.el5) packages, earlyprintk=serial,ttyS0,115200 earlyprintk=serial,hvc0 It was the same failure, # SysRq : Trigger a crashdump Sending IPI to other cpus... cpu 0x1: Vector: 200 (Machine Check) at [c000000002593bf0] pc: 0000000000000200 lr: c000000002024d78: .of_find_node_by_path+0x30/0xac sp: c000000002593e70 msr: 8000000000001000 current = 0xc0000000024c8930 paca = 0xc0000000024c9400 pid = 0, comm = swapper WARNING: exception is not recoverable, can't continue Some quick information as below. Otherwise, please see "sosreport" in comment #2. # cat /proc/device-tree/openprom/ibm,fw-vernum_encoded SF240_358 # cat /proc/cpuinfo processor : 0 cpu : POWER5 (gr) clock : 1654.344000MHz revision : 2.1 (pvr 003a 0201) processor : 1 cpu : POWER5 (gr) clock : 1654.344000MHz revision : 2.1 (pvr 003a 0201) processor : 2 cpu : POWER5 (gr) clock : 1654.344000MHz revision : 2.1 (pvr 003a 0201) processor : 3 cpu : POWER5 (gr) clock : 1654.344000MHz revision : 2.1 (pvr 003a 0201) timebase : 207050000 platform : pSeries machine : CHRP IBM,9117-570 # lsmod Module Size Used by autofs4 52601 2 hidp 59969 2 rfcomm 93689 0 l2cap 68313 10 hidp,rfcomm bluetooth 113173 5 hidp,rfcomm,l2cap sunrpc 317257 1 ipv6 494273 30 xfrm_nalgo 29653 1 ipv6 crypto_api 30905 1 xfrm_nalgo dm_multipath 49945 0 scsi_dh 29981 1 dm_multipath snd_powermac 97457 0 snd_seq_dummy 23189 0 snd_seq_oss 75441 0 snd_seq_midi_event 28009 1 snd_seq_oss snd_seq 107193 5 snd_seq_dummy,snd_seq_oss,snd_seq_midi_event snd_seq_device 29405 3 snd_seq_dummy,snd_seq_oss,snd_seq snd_pcm_oss 84001 0 snd_mixer_oss 43297 1 snd_pcm_oss snd_pcm 144501 2 snd_powermac,snd_pcm_oss snd_page_alloc 32401 1 snd_pcm snd_timer 53025 2 snd_seq,snd_pcm snd 117457 8 snd_powermac,snd_seq_oss,snd_seq,snd_seq_device,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer soundcore 28601 1 snd i2c_core 47409 1 snd_powermac parport_pc 63913 0 lp 36425 0 parport 75197 2 parport_pc,lp ibmveth 45721 0 sg 69609 0 dm_snapshot 46313 0 dm_zero 20017 0 dm_mirror 46481 0 dm_log 35469 1 dm_mirror dm_mod 119889 10 dm_multipath,dm_snapshot,dm_zero,dm_mirror,dm_log ibmvscsic 51921 3 sd_mod 48985 5 scsi_mod 242121 4 scsi_dh,sg,ibmvscsic,sd_mod ext3 210033 2 jbd 110825 1 ext3 uhci_hcd 57241 0 ohci_hcd 55085 0 ehci_hcd 69897 0 The machine in question has been reserved by me at the moment. Feel free to grab it.
In reply to previous comment : > # SysRq : Trigger a crashdump > Sending IPI to other cpus... > cpu 0x1: Vector: 200 (Machine Check) at [c000000002593bf0] > pc: 0000000000000200 > lr: c000000002024d78: .of_find_node_by_path+0x30/0xac > sp: c000000002593e70 So this looks like a different call trace that it was reported previously. Earlier call trace showed that the kdump kernel started booting and then crashed. ........ Linux version 2.6.18-90.el5kdump (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 15 18:43:25 EDT 2008 Machine check in kernel mode. Caused by (from SRR1=8000000000001000): Transfer error ack signal cpu 0x1: Vector: 200 (Machine Check) at [c000000002583930] pc: 0000000000000200 lr: c00000000201dfa0: .rtas_progress+0x54/0x3e0 sp: c000000002583bb0 ....... So from the latest trace it seems like the kdump kernel did not boot. (Atleast going by the messages ). Since i am still not able to recreate this on a local box , can you please add some instrumentation in the kexec/kdump specific code of first kernel and see if that gives any more information. Also enable early debugging kernel config option.
Cia, were you able to gather more information on this ?
(In reply to comment #20) > Since i am still not able to recreate this on a local box , can you please add > some instrumentation in the kexec/kdump specific code of first kernel and see > if that gives any more information. What debug code do you want me to add there? > Also enable early debugging kernel config option. Can you let me know which option do you want to me to add to the second kernel? As I mentioned before, I have tried, earlyprintk=serial,ttyS0,115200 earlyprintk=serial,hvc0 Unfortunately, they did not give me any obvious indication of the underlying problem. In addition, since the problematic machine had SF240_358 version of firmware, maybe you could try to update the firmware to see if it will let you reproduce the problem?
I tried kdump on another machine [ 9117 - p570 ] with similar firmware level SF240 and kdump worked just fine. I was able to save vmcore file without any problems. I tried with 2.6.18-120.el5 level of kernel. Cia, can you try to recreate this on a different machine ? Also can you make sure CONFIG_PPC_EARLY_DEBUG is enabled in both first and second kernel ? That should print some extra messages during boot. Another thing you could try is to compile a kdump kernel and make the following change #undef DEBUG =======>> #define DEBUG in files arch/powerpc/kernel/prom.c arch/powerpc/kernel/prom_init.c This should print few more debug messages during kdump boot. That way we will know if the kdump kernel has started booting or not.
(In reply to comment #23) > I tried kdump on another machine [ 9117 - p570 ] with similar firmware level > SF240 and kdump worked just fine. I was able to save vmcore file without any > problems. I tried with 2.6.18-120.el5 level of kernel. > > Cia, can you try to recreate this on a different machine ? > I have tried on a different machine, MODEL IBM,9117-570 VENDOR IBM,0210D815C PROCESSORS 8 MEMORY 8192 CPUSPEED 2200 but have not see any problem there. > Also can you make sure CONFIG_PPC_EARLY_DEBUG is enabled in both first and > second kernel ? That should print some extra messages during boot. > > Another thing you could try is to compile a kdump kernel and make the following > change > > #undef DEBUG =======>> #define DEBUG > > in files > > arch/powerpc/kernel/prom.c > arch/powerpc/kernel/prom_init.c > > This should print few more debug messages during kdump boot. That way we will > know if the kdump kernel has started booting or not. OK, I'll try to enable debug options and re-test it as soon as possible.
Cia, did you get a chance to add some debug code and recreate this bug ?
Not yet. I put a reservation for the machine, but have had no luck after nearly a week.
I have compiled the new kernel (say 2.6.18-126.el5.ppcdebug) with CONFIG_PPC_EARLY_DEBUG, with DEBUG (prom.c) and DEBUG_PROM (prom_init.c). I have tried the following combination of normal and kdump kernel. ------------------------------ kernel-2.6.18-126.el5.ppcdebug kernel-kdump-2.6.18-126.el5.ppcdebug Result in kdump kernel hung without further message, ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump Sending IPI to other cpus... ------------------------------ kernel-2.6.18-126.el5.ppcdebug kernel-kdump-2.6.18-92.el5 Result in kdump kernel reset, ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump Sending IPI to other cpus... Partition configured for 4 cpus. Starting Linux PPC64 #1 SMP Tue Apr 29 13:56:48 EDT 2008 ----------------------------------------------------- ppc64_pft_size = 0x17 physicalMemorySize = 0x12000000 ppc64_caches.dcache_line_size = 0x80 ppc64_caches.icache_line_size = 0x80 htab_address = 0x0000000000000000 htab_hash_mask = 0xffff physical_start = 0x2000000 ----------------------------------------------------- Linux version 2.6.18-92.el5kdump (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:56:48 EDT 2008 Machine check in kernel mode. Caused by (from SRR1=8000000000001000): Transfer error ack signal cpu 0x1: Vector: 200 (Machine Check) at [c000000002583930] pc: 0000000000000200 lr: c00000000201dfa0: .rtas_progress+0x54/0x3e0 sp: c000000002583bb0 msr: 8000000000001000 current = 0xc000000002464430 paca = 0xc000000002464f00 pid = 0, comm = swapper WARNING: exception is not recoverable, can't continue -------------------- kernel-2.6.18-92.el5 kernel-kdump-2.6.18-126.el5.ppcdebug Result in kdump kernel reset by either, ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump Sending IPI to other cpus... Partition configured for 4 cpus. Starting Linux PPC64 #1 SMP Wed Dec 10 23:44:53 EST 2008 ----------------------------------------------------- ppc64_pft_size = 0x17 physicalMemorySize = 0x12000000 ppc64_caches.dcache_line_size = 0x80 ppc64_caches.icache_line_size = 0x80 htab_address = 0x0000000000000000 htab_hash_mask = 0xffff physical_start = 0x2000000 ----------------------------------------------------- Linux version 2.6.18-126.el5.ppcdebugandnoxwkdump (mockbuild.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Wed Dec 10 23:44:53 EST 2008 Machine check in kernel mode. Caused by (from SRR1=8000000000001000): Transfer error ack signal cpu 0x1: Vector: 200 (Machine Check) at [c000000002593930] pc: 0000000000000200 lr: c00000000201e368: .rtas_progress+0x54/0x3e0 sp: c000000002593bb0 msr: 8000000000001000 current = 0xc0000000024c8930 paca = 0xc0000000024c9400 pid = 0, comm = swapper WARNING: exception is not recoverable, can't continue or, ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump Sending IPI to other cpus... cpu 0x1: Vector: 200 (Machine Check) at [c000000002593bf0] pc: 0000000000000200 lr: c000000002024d7c: .of_find_node_by_path+0x30/0xac sp: c000000002593e70 msr: 8000000000001000 current = 0xc0000000024c8930 paca = 0xc0000000024c9400 pid = 0, comm = swapper WARNING: exception is not recoverable, can't continue -------------------- kernel-2.6.18-92.el5 kernel-kdump-2.6.18-92.el5 This is the only situation it works.
> ------------------------------ > kernel-2.6.18-126.el5.ppcdebug > kernel-kdump-2.6.18-126.el5.ppcdebug > Result in kdump kernel hung without further message, > > ibm-l4b-lp1.test.redhat.com login: SysRq : Trigger a crashdump > Sending IPI to other cpus... > > ------------------------------ Hm .. so this does not provide any clue about the problem. I think we will need to add extra printk's in prom.c and prom_init.c to find the root cause for this. Ameet can you help Cia debug this issue ? Also can you upload the device tree[contents of /proc/device-tree ] o/p from that machine ? I can try to compare that with device-tree present on the system where i have successfully tested kdump.
>Ameet can you help Cia debug this issue ? Yes, I taking a look.
Created attachment 327340 [details] Contents of /proc/device-tree/ Attached as per the request in comment #37.
I have tried this a couple of times now, but my custom kernel with debug params and printks just seems to sit here: [root@ibm-l4b-lp1 ~]# echo "c" > /proc/sysrq-trigger SysRq : Trigger a crashdump Sending IPI to other cpus... I have given this third attempt 1.5 hours to display something. I will see what is says overnight because I am not sure what else to do...
Created attachment 327615 [details] Debug patch Ameet, here is a debug patch i have created for this issue. It just add's some printks to the code. Please compile both the kernels (first as well as kdump kernel) with this patch. Please make sure you have enabled the CONFIG_PPC_EARLY_DEBUG option. Also make sure earlyprintk= command line option is added to kdump boot. Let's hope this gives us some information about the problem.
In reply to previous comment : > Also make sure earlyprintk= command line option is added to kdump boot. The option should be console=<Serial console related option > . I don't know if earlyprintk option is supported with ppc64. Probably that is only supported on ia32/x86_64.
Updating PM score.
This machine is currently unavailable due to fail to install. I'll close this bug for now, and re-open it when have seen another occurrence of it.
*** Bug 592231 has been marked as a duplicate of this bug. ***
Looks like Han has access to this system again and can reproduce this failure, as per bz 592231
Ameet, whats the status on this bz?
CAI Qian, Does this failed on the latest nightly for RHEL 5.6? -Steve
(In reply to comment #51) > CAI Qian, > > Does this failed on the latest nightly for RHEL 5.6? > > -Steve Yeah, it failed on snapshot3. I have to exclude all ibm-l4b-lp* from our kdump testing.
Kdump tests passed with RHEL5.7-20110409.3 on ibm-l4b-lp1.rhts.eng.rdu.redhat.com.