Bug 241212

Summary: kdump fails on LS20/LS21
Product: Red Hat Enterprise MRG Reporter: IBM Bug Proxy <bugproxy>
Component: realtime-kernelAssignee: Guy Streeter <streeter>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 1.0CC: alan, anderson, nhorman, nobody
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL5.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-22 15:29:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 239398    
Bug Blocks:    
Attachments:
Description Flags
Kernel boot log
none
fix_kdump_panic_k8_edac.patch
none
dmesg_kdump_kernel_edac_debug_on
none
llm38_edac_addr_printk_2.log
none
Pass reset_devices=1 parameter to kdump kernel none

Description IBM Bug Proxy 2007-05-24 13:24:03 UTC
LTC Owner is: jstultz.com
LTC Originator is: ankigarg.com


Problem description:
Kdump kernel panics while loading EDAC modules

Provide output from "uname -a", if possible:
Linux llm39.in.ibm.com 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64
x86_64 x86_64 GNU/Linux


Hardware Environment
    Machine type (p650, x235, SF2, etc.): LS20/LS21
    Cpu type (Power4, Power5, IA-64, etc.):Dual Core AMD Opteron(tm) Processor 275

Is this reproducible? Yes.
service kdump start
echo c > /proc/sysrq-trigger

---
Had posted the attached patch to the edac mailing list. The maintainer, Doug
Thompson has agreed on the fix and would be picking it up in the next release of
k8_edac module. Following is the link to the discussion on the mailing list: 

http://sourceforge.net/mailarchive/forum.php?thread_name=20070424100935.GA3039%40in.ibm.com&forum_name=bluesmoke-devel

Comment 1 IBM Bug Proxy 2007-05-24 13:24:03 UTC
Created attachment 155341 [details]
Kernel boot log

Comment 2 IBM Bug Proxy 2007-05-24 13:25:25 UTC
Created attachment 155342 [details]
fix_kdump_panic_k8_edac.patch

Fix for kdump panic due to k8_edac modules.

Comment 4 Clark Williams 2007-05-25 17:29:48 UTC
Hmmmm. My tree doesn't have a drivers/edac/k8_edac.c...

Am I missing another patch to create k8_edac.c?

Clark


Comment 5 IBM Bug Proxy 2007-05-25 18:30:26 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mannthey.com




------- Additional Comments From mannthey.com (prefers email at kmannth.com)  2007-05-25 14:29 EDT -------
Clark,
  The k8_edac driver is in RHEL5 only, it is not in mainline.  That code only
currently exists in the EDAC cvs tree. I have had some issues with the driver
that are on my todo to fix. 

I will be asking Redhat to pickup the k8_edac fixed driver as part of our
userspace ECC error detection work within the next 2 weeks. 

Comment 6 IBM Bug Proxy 2007-06-02 10:20:40 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-02 06:19 EDT -------
Clark,

Sorry for not making things clear. Here are the details:

On RHEL5, the kernel is relocatable. Thus the same kernel is used as the kdump
kernel as well. No extra kernel need be shipped. But for -rt, the support for
relocatable kernel is not yet in. This would require an extra rpm be shipped for
the kdump kernel. But to cut down on the extra effort to build and maintain an
additional kernel rpm from RedHat's and our perspective, it would be nice to use
RHEL5 kernel itself as the kdump kernel for -rt, made possible because of
relocatable kernel support. But, RHEL5 kernel panics on LS20/21 due to EDAC
drivers, which as Keith mentioned is not mainline but present in RHEL5. With
this patch, RHEL5 kernel would work absolutely fine with -rt as the first
kernel. Also, this would fix RHEL5 on LS20/21. 

Comment 7 Alan Cox 2007-06-05 09:23:15 UTC
None of this answers the question *WHY* did the system log a corrupt processor
context, that according to the docs I have is a serious hardware failure event
and it would be nice to know why such an event is "just lying around" when the
kdump kernel is loaded ?


Comment 8 IBM Bug Proxy 2007-06-05 09:40:46 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-05 05:38 EDT -------
There is some description in RedHat Issue Tracker No. 119116. The following is
from the findings by  Chandru from the IS team.

The k8_edac module checks for memory context of the processor. In the kdump
kernel, the kernel tries to access memory outside of its own (while copying
/proc/vmcore, it accesses the memory of the first kernel, which is outside of
what it is allowed). This is reported by the EDAC modules as corrupt memory
context and panics.

Hope that helps. 

Comment 9 Alan Cox 2007-06-05 09:52:20 UTC
This doesn't make any sense. It's at odds with the documentation and at odds
with tested behaviour of fault handling on those processors.

Accessing the memory of the first kernel shouldn't be causing the CPU to log an
unrecoverable CPU error, and if it does the kernel MCE traps ought to be
panicing when it does this. What exactly is being touched when this occurs - are
you erroneously copying I/O mappings and thus confusing hardware (which needs to
be fixed properly) or what ?

What occurs if you check the the bit every time you copy a page of the old
kernel - what bus address range is triggering the fault and what is located there ?


Comment 10 IBM Bug Proxy 2007-06-05 10:15:29 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-05 06:09 EDT -------
Chandru, could you pl provide some details on the copying of /proc/vmcore 

Comment 11 IBM Bug Proxy 2007-06-06 09:05:21 UTC
----- Additional Comments From chandru.s.com  2007-06-06 05:03 EDT -------
From earlier investigation, edac was in a loop within the following code 
running every poll_msec time interval. 
-----------
EDAC DEBUG: do_edac_check()
EDAC DEBUG: check_mc_devices()
EDAC DEBUG: k8_check()
EDAC DEBUG: k8_check()
EDAC DEBUG: do_pci_parity_check()
-----------

Once we attempt to copy /proc/vmcore ( via 'cp /proc/vmcore <destination> ), 
we used to be successful in copying a partial vmcore of different sizes at 
different runs (mostly because of running out of poll interval ) and the above 
loop would detect an error condition during the next polling cycle and log 
the 'GART TLB' and 'processor context corrupt' error and call panic. Probably 
need to find a way to check the condition (regs->nbsh & BIT(25) ) every time a 
page of the old kernel is copied. 

Comment 13 Alan Cox 2007-06-06 15:35:33 UTC
Or wait twice the poll time after each page during the copy so you know about
which page was hit. My guess is still that there are I/O space mappings or ACPI
mappings which are being copied and causing the CPU faults. If so these need to
be fixed not the edac code.


Comment 14 IBM Bug Proxy 2007-06-06 18:10:39 UTC
----- Additional Comments From mannthey.com (prefers email at kmannth.com)  2007-06-06 14:04 EDT -------
I looking into this I have the following thoughts.  

1.  We do need to track down why the error is happening.  The MCE Processor
framework thinks something is wrong. 

2.  Changing this error to a prink is the right thing to do.  As stated in the
mailing list the error that is being raised is a Processor Context Corrupt and
poll this bit may not make alot of sense and panic is too heavy handed.  Eric
Bebiederm explains the situation the best. 

"
Re: [PATCH] Fix to make k8_edac kdump aware
From: <ebiederm@xm...> - 2007-05-07 06:24
Doug Thompson <norsk5@ya...> writes:

> In another email from Eric, he did point out that the PCC error really
> does NOT have much information on what it really means. I am leaning to
> just PULL the check of the PCC bit, and thus pull the logging and the
> panic as well.
>
> AMD's BKDG does not give much information on it really.


PCC is extremely well defined. Processor Context Corrupt means that
the machine check handler does not have enough information to resume
the instruction stream, the exception interrupted. It is part of the
generic machine check infrastructure.

However if you are not in a machine check handler PCC is much less
meaningful. The notion that you can't return to the interrupted
exception stream doesn't mean much when you haven't interrupted
an exception stream. All you know is that the error is BAD.

So the ambiguity of PCC comes from the fact that we are polling.

Doug does that make sense?

Eric 

"

This change to a prink from the panic has been changed in the current
edac/blusmoke cvs tree. 

Comment 15 IBM Bug Proxy 2007-06-07 01:30:37 UTC
----- Additional Comments From mannthey.com (prefers email at kmannth.com)  2007-06-06 21:26 EDT -------
Hmm I was able to dd /proc/vmcore without and issue for the kexec kernel
context.   I was booted to a shell after the panic and I did the following. 

root:/> dd if=/proc/vmcore of=/dev/null bs=512
16532583+1 records in
16532583+1 records out
root:/> cat /proc/version 
Linux version 2.6.18-8.el5 (brewbuilder.redhat.com) 

I have tried a few other things but have not been able to recreated the EDAC
panic.  I am on an LS21.  

I tired moving the kexec kernel hole to 128M@16M but kexec fails to load the
kernel.  How can I recreate this issue? 

Comment 16 IBM Bug Proxy 2007-06-07 10:30:18 UTC
------- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-07 06:28 EDT -------
(In reply to comment #20)
> 
> I tired moving the kexec kernel hole to 128M@16M but kexec fails to load the
> kernel.  How can I recreate this issue? 

Keith,

To recreate, instead of dropping into a shell, enable copying of vmcore to a
particular location, like you could uncomment the 'path /var/crash/' option in
/etc/kdump.conf Moreover, we would want this option to be working, as in this
case the user would not need to do anything extra to store the dump. 

I tried the above on a LS21. 

Comment 17 IBM Bug Proxy 2007-06-07 23:45:37 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|block                       |high




------- Additional Comments From dvhltc.com  2007-06-07 19:39 EDT -------
Dropping Severity to high as IBM has a workaround for internal use while it's
being worked. 

Comment 18 IBM Bug Proxy 2007-06-07 23:46:43 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P1                          |P2




------- Additional Comments From dvhltc.com  2007-06-07 19:43 EDT -------
Dropping prio to P2 as we have a workaround for now. 

Comment 20 Guy Streeter 2007-06-19 18:46:25 UTC
Is the attached patch, changing the panic to a printk, the only thing needed here?

Comment 21 IBM Bug Proxy 2007-06-20 11:51:00 UTC
------- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-20 07:46 EDT -------
(In reply to comment #29)
>
> ------- Additional Comments From streeter  2007-06-19 14:46 EST -------
> Is the attached patch, changing the panic to a printk, the only thing needed here?
>
I have tested it out and this seems to be fixing the issue for me. I have tried
it on a LS20 and a LS21. 

Comment 22 Guy Streeter 2007-06-20 20:02:25 UTC
I don't understand this bug report at all. It appears to be a bug in something
that is not in our RT kernel. What do you expect us to do about it?

Comment 23 IBM Bug Proxy 2007-06-20 22:45:40 UTC
----- Additional Comments From mannthey.com (prefers email at kmannth.com)  2007-06-20 18:39 EDT -------
I beleive this bug was filed to keep track of the issue more than anything else
as we were without a kdump solution for RHEL5-RT. 

The fix was just to change the bug to a prink and we beleive 2.6.18-23.el5 has
the patch we need. 

We have tested kdump and the EDAC error has been fixed (or masked if you will)
so it should be safe to close this bug. There are still outstanding issues of
why the error was raised in the first place but we have been a little scattered
about understanding the root cause. 

Comment 24 Alan Cox 2007-06-21 11:39:56 UTC
The patch you presented is utterly bogus. You've completely failed to do the
neccessary trivial debugging to understand why it occurs and the results could
be really messy in future if the bug is (as I suspect) that you are writing out
MMIO mapped pages from the old kernel.


You've not fixed a bug. The CPU is still reporting you did something terrible
and undefined and unsafe. You've papered over it and prayed. That kind of
horrible hack doesn't belong in an enterprise grade Linux product.

Guy: Please ensure this horrible hack doesn't get into the kernel. 


Comment 25 IBM Bug Proxy 2007-06-21 14:29:39 UTC
------- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-21 08:01 EDT -------
The(In reply to comment #35)
> 
> We have tested kdump and the EDAC error has been fixed (or masked if you will)
> so it should be safe to close this bug. 

On most of the machines I tested, EDAC issue is still seen. So, the patch is
required. But, yes we could close this particular bug, as the same is now being
actively tracked in 33374 for inclusion of the patch into plain RHEL5.

Keith, I have pinged Redhat to point me to the sources of -23 kernel so that I
could verify if the patch is in, as this kernel fails to boot even as the first
kernel on many machines. 

Comment 26 IBM Bug Proxy 2007-06-25 12:26:30 UTC
----- Additional Comments From smaneesh.com (prefers email at maneesh.com)  2007-06-25 08:22 EDT -------
As per the latest email exchanges, it seems that RH is now working around this
issue by using "reset_devices" flag. IIUC, the plan is to check this flag in
edac driver while kdump boot and ignore the error.

Meanwhile, I tried the edac driver code from upstream
(http://sourceforge.net/projects/bluesmoke/), edac-2007-may-2 release on
RHEL5-RT kernel, I could copy the kdump without any error message or panic. I am
looking at the diffs but there seems to be large restructuring has been done.
Currently trying to narrow down to minimum changes. 

Rt would be helpful if some edac expert can also look at these changes. 

Comment 27 IBM Bug Proxy 2007-06-25 12:35:46 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-25 08:34 EDT -------
Ok, I updated my findings in the wrong bug.

So as was suggested, tried kdump with iommu=off and swiotlb=force parameters
independently set and also together. But the issue persisted. 

Comment 28 Alan Cox 2007-06-25 12:39:26 UTC
Upstream contains the erroneous panic/printk change


Comment 29 Alan Cox 2007-06-25 12:40:19 UTC
From the l/k list its now looking like the issue is not programs referencing the
GART but the fact GART mappings are in use by existing drivers during the
kexec/kdump and the kdump kernel then invalidates the mappings on them.


Comment 30 IBM Bug Proxy 2007-06-25 12:40:40 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-25 08:34 EDT -------
Attaching edac debug messages obtained from the second kernel, before panic is
triggered. 

Comment 31 IBM Bug Proxy 2007-06-25 12:41:09 UTC
Created attachment 157749 [details]
dmesg_kdump_kernel_edac_debug_on

Comment 32 IBM Bug Proxy 2007-06-25 12:41:14 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-25 08:35 EDT -------
 
edac_debug_messages_from_second_kernel 

Comment 33 IBM Bug Proxy 2007-06-25 12:45:33 UTC
------- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-25 08:39 EDT -------
(In reply to comment #38)
> As per the latest email exchanges, it seems that RH is now working around this
> issue by using "reset_devices" flag. IIUC, the plan is to check this flag in
> edac driver while kdump boot and ignore the error.

But this approach has not been ACKed by the community and is indeed a wrong
usage of the the reset_devices flag. Infact, by converting the panic to printk,
atleast we would know that went amiss, but by completely avoiding it we would
miss that piece of information.

Also, I would like to point out that the AMD documentation says that if the 25th
bit of the Northbridge Status High register is 1, then the processor context
_might_ be corrupted. So, it is possible that it is a false alarm. 

Investigating further to obtain some more information from the EDAC code. 

Comment 34 IBM Bug Proxy 2007-06-25 12:55:32 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P2                          |P1




------- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-25 08:50 EDT -------
bumping prio as the code freeze is on 27 June. 

Comment 35 IBM Bug Proxy 2007-06-25 16:20:32 UTC
------- Additional Comments From smaneesh.com (prefers email at maneesh.com)  2007-06-25 12:14 EDT -------
(In reply to comment #43)
> ----- Additional Comments From alan  2007-06-25 08:39 EST -------
> Upstream contains the erroneous panic/printk change
> 
> 
> -- 

I have verified the source code and also tested "edac-2007-may-2" release from
soureforge link I mentioned and panic/printk change is _not_ yet there. 

Comment 36 IBM Bug Proxy 2007-06-25 17:05:52 UTC
----- Additional Comments From mannthey.com (prefers email at kmannth.com)  2007-06-25 13:03 EDT -------
Maneesh,
  The panic/printk change was in cvs (last week).  May 2 is too old.  Please see
the current cvs tree and the thread from comment #4. 

Comment 37 IBM Bug Proxy 2007-06-25 18:05:19 UTC
------- Additional Comments From smaneesh.com (prefers email at maneesh.com)  2007-06-25 14:03 EDT -------
(In reply to comment #47)
> Maneesh,
>   The panic/printk change was in cvs (last week).  May 2 is too old.  Please see
> the current cvs tree and the thread from comment #4. 

Keith, agreed but May 2 release is more recent then RHEL5 level and does work
well. I picked up edac-2007-may-2 release, as this is the last stable release
for edac from sourceforge site. 

Comment 38 IBM Bug Proxy 2007-06-26 17:30:50 UTC
----- Additional Comments From smaneesh.com (prefers email at maneesh.com)  2007-06-26 13:27 EDT -------
I could narrow it down to one patch, driver-edac-add-nmi.patch from
edac-2007-may-2 release which made the difference. But close analysis of the
code revealed that the error condition was bypassed there also. It has made main
error checking loop as conditional, in edac_kernel_thread() routine.

                if(edac_assert_error_check_and_clear())
                        do_edac_check();

And the code for enabling the assert is missing for K8 chipset. So, the assert
never gets fired. IOW, no point backporting upstream edac code.

Ankita, any more information from your debugging? 

Comment 39 IBM Bug Proxy 2007-06-26 18:20:30 UTC
------- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-26 14:15 EDT -------
(In reply to comment #49)
> I could narrow it down to one patch, driver-edac-add-nmi.patch from
> edac-2007-may-2 release which made the difference. But close analysis of the
> code revealed that the error condition was bypassed there also. It has made main
> error checking loop as conditional, in edac_kernel_thread() routine.
> 
>                 if(edac_assert_error_check_and_clear())
>                         do_edac_check();
> 
> And the code for enabling the assert is missing for K8 chipset. So, the assert
> never gets fired. IOW, no point backporting upstream edac code.
> 
Yeah, so here also we are trying to bypass the checking!

> Ankita, any more information from your debugging?
> 
I tried to print the addresses that 'copy vmcore' file was trying to access. At
the point the EDAC error showed up, the addresses were very much within the
System RAM range. 

Comment 40 IBM Bug Proxy 2007-06-26 18:30:31 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-26 14:28 EDT -------
Came across a few relevant commandline options which I need to try next. 

Comment 41 IBM Bug Proxy 2007-06-27 11:16:27 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-27 07:14 EDT -------
Since the panic shows up only at the time of reading the vmcore file, trying to
get more information on the addresses accessed. Just by printing the address
might miss the offending one. Using other aids to do so. 

Comment 42 IBM Bug Proxy 2007-06-27 13:45:39 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-27 09:41 EDT -------
Ok so here is some debug data that I managed to collect. At the time of copying
the vmcore file, as each page is accessed, I call do_edac_check routine to
perform status check and also save the value of the page addr into a global
variable. When the panic situation is reported, print the global addr value. 

Also pasted is the output of /proc/iomem & /proc/meminfo from the first kernel.

[root@llm38 ~]# cat ~ankita/latest_iomem 
00000000-0009d3ff : System RAM
0009d400-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c8fff : Video ROM
000c9000-000ca5ff : Adapter ROM
000f0000-000fffff : System ROM
00100000-edfcddbf : System RAM
  00200000-0045a997 : Kernel code
  0045a998-0059052f : Kernel data
  01000000-08ffffff : Crash kernel
edfcddc0-edfcffff : ACPI Tables
edfd0000-edffffff : reserved
ee000000-efffffff : PCI Bus #02
  effa0000-effbffff : 0000:02:02.0
  effc0000-effdffff : 0000:02:02.0
  effe0000-effeffff : 0000:02:01.1
    effe0000-effeffff : tg3
  efff0000-efffffff : 0000:02:01.0
    efff0000-efffffff : tg3
f0000000-fcffffff : PCI Bus #01
  f0000000-f7ffffff : 0000:01:04.0
  f8000000-f801ffff : 0000:01:04.0
fd000000-feafffff : PCI Bus #01
  feae0000-feaeffff : 0000:01:04.0
  feafe000-feafefff : 0000:01:00.1
    feafe000-feafefff : ohci_hcd
  feaff000-feafffff : 0000:01:00.0
    feaff000-feafffff : ohci_hcd
feb00000-febfffff : PCI Bus #02
  feb00000-febfffff : 0000:02:02.0
fec00000-ffffffff : reserved
100000000-151ffffff : System RAM

[root@llm38 ~]# cat ~ankita/latest_meminfo 
MemTotal:      4950964 kB
MemFree:       4693984 kB
Buffers:         15168 kB
Cached:         169424 kB
SwapCached:          0 kB
Active:          62376 kB
Inactive:       157244 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      4950964 kB
LowFree:       4693984 kB
SwapTotal:     2040244 kB
SwapFree:      2040244 kB
Dirty:             168 kB
Writeback:           0 kB
AnonPages:       35008 kB
Mapped:           9120 kB
Slab:            17104 kB
PageTables:       3516 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   4515724 kB
Committed_AS:    76636 kB
VmallocTotal: 34359738367 kB
VmallocUsed:      1360 kB
VmallocChunk: 34359736783 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB


Data regarding the page addresses is attached. Inidicates 32 contiguous page
accesses. 

Comment 43 IBM Bug Proxy 2007-06-27 13:50:27 UTC
Created attachment 158007 [details]
llm38_edac_addr_printk_2.log

Comment 44 IBM Bug Proxy 2007-06-27 13:50:44 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-27 09:44 EDT -------
 
addresses of vmcore page access that result in edac error

The format is:
NorthBridge ERROR: mci(0xffff810008bf4000) node(1) ErrAddr(0x00000000-37c90008)
nbsh(0xa6000002) nbsl(0x0005001b)
EDAC k8 MC1: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC1: extended error code: GART error
MC1: processor context corruptthe addr value is : 201337056
						  ^^^^^^^^^ is the page
address. 

Comment 45 IBM Bug Proxy 2007-06-28 05:10:43 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-06-28 01:08 EDT -------
From the log messages, the addresses being accessed while reading vmcore are
within the System RAM areas from the first kernel. 

Comment 46 IBM Bug Proxy 2007-07-02 05:35:30 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-07-02 01:31 EDT -------
Here is the thread that was started on the EDAC mailing list to get some
information on the possible causes for the processor context corruption in
t=kdump context.


--- Ankita Garg <ankita.com> wrote:

> On Wed, Jun 27, 2007 at 10:39:40AM -0700, Doug Thompson wrote:
> 
> Hi,
> 
> > Yes, we had a conversation this issue.
> > The panic for a MCC error has been removed/changed to a warning instead.
> >
> 
> Thanks Doug for your response. Yes, I am aware that the panic call has been
changed to 
> a warning message in the k8 module when checking for the 25th bit.

ok, great

 But at this
> time we want to debug this further from kdump perspective, to try and
> see if it is really doing something that it should not, for e.g, trying
> to map pages non-RAM pages (IOMMU pages, etc). For this, it would help
> if someone could point out some example scenarios that might lead to the
> hardware setting the 25th bit of the Northbridge Satus High register,
> indicating the context corruption. 

like other "bits", there is NOT much doc on its semantics, as you probably know.

While at Linux Networx, we would get some PCC events during system burnin, but
it was infrequent
and we had no known mechanism to trigger this event.

Are you at OLS this week? I attended Vivek's paper presentation this morning, friday
If you are or one of your guys, we can meet.
Yet Eric B knows just as much as well, but we can discuss options

doug t

> 
> This could provide us a good starting point.
> 
> > With Kdump, we think the changing of the kernels (via kexec)
sets/alters/mods the bit. 
> > so we concluded to remove that panic check from the k8 module.
> > 
> > If you are building from source, go and comment out that panic call when the
bit 25 is
> checked.
> > 
> > doug t
> > 
> > 
> > --- Ankita Garg <ankita.com> wrote:
> > 
> > > Hi all,
> > > 
> > > When trying to copy the vmcore file while in the kdump kernel, I hit the
> > > following panic:
> > > 
> > > NorthBridge ERROR: mci(0xffff8100086cf000) node(0)
> > > ErrAddr(0x00000000-37f00030) nbsh(0xa6000001) nbsl(0x0005001b)
> > > EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache
> > > level(generic)
> > > EDAC k8 MC0: extended error code: GART error
> > > Kernel panic - not syncing: MC0: processor context corrupt
> > > 
> > > AMD documentation mentions that the 25th bit of the Northbridge Status
High Register
> > > indicates a probable processor context corruption. 
> > > 
> > > Could someone please provide some information on the possible reasons of
> > > processor context getting corrupt ? in general or in the kdump scenario?
> > > 

> -- 
> Regards,
> Ankita Garg (ankita.com)
> Linux Technology Center
> IBM India Systems & Technology Labs, 
> Bangalore, India   
> 

Comment 47 Tim Burke 2007-07-02 19:52:01 UTC
The patch in bug #237950 comment #38 appears to be what we will be putting into
RHEL5.1.  This is also being worked upstream (primarily by IBM), so a better
approach may come from that.

Currently this is closed/notabug, which is wrong, so I am reopening. 

Comment 48 IBM Bug Proxy 2007-07-03 18:15:19 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-07-03 14:09 EDT -------
Trying the approach of shutting down GART when doing kexec (patch for this has
been posted on LKML: http://lkml.org/lkml/2007/6/25/242). For kdump kernel, the
shutdown action needs to be performed in the second kernel. Also, the second
kernel uses swiotlb to use software iommu. 

Comment 49 IBM Bug Proxy 2007-07-31 14:06:25 UTC
----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-07-31 10:02 EDT -------
The fix is in RHEL5 U1 kernel. Verified that the kdump kernel no longer pacnis
on our hardware. 

Comment 50 IBM Bug Proxy 2007-08-20 12:05:48 UTC
------- Comment From ankigarg.com 2007-08-20 08:00 EDT-------
Will verify with an LS21 and confirm. After which we could close this bug.

Comment 51 Ankita Garg 2007-10-05 05:03:33 UTC
Verified that the patch in bug #237950 comment #38 is in RHEL5.1 kernel. But the
error still persists as this patch requires the kdump kernel to be passed the
'reset_devices=1' parameter. Found that RHEL5.1 kexec-tools rpm does not pass
this parameter in /etc/sysconfig/kdump file. This needs to be fixed before we
can close this bug. Sample patch attached.

Comment 52 IBM Bug Proxy 2007-10-05 05:05:45 UTC
------- Comment From ankigarg.com 2007-10-05 01:04 EDT-------
Updated in RH bugzilla:

Comment #51 From Ankita Garg (ankita.com) 	on 2007-10-05 01:03 EST
[reply]

Comment 53 IBM Bug Proxy 2007-10-05 05:05:48 UTC
Created attachment 217061 [details]
Pass reset_devices=1 parameter to kdump kernel

Comment 54 Guy Streeter 2007-10-05 14:15:28 UTC
Is there a reason to universally set reset_devices, or should it simply be
documented as necessary for some controllers?

Comment 55 IBM Bug Proxy 2007-10-08 11:21:42 UTC
------- Comment From ankigarg.com 2007-10-08 07:17 EDT-------
It has been agreed in mainline to use reset_devices flag in the kdump kernel for
signaling the various devices of the context and to reset accordingly.
Currently, aacraid driver is using this flag (in current mainline) and more
drivers are expected to use it in future. Besides, for EDAC driver, we need this
flag being passed to the kdump kernel.

Comment 56 Dave Anderson 2007-10-08 14:58:07 UTC
I believe Neil Horman will be adding the reset_devices argument
to the command line argument in /etc/sysconfig/kdump by default,
but I note that it doesn't appear to be there in the version
to be released with the RHEL5.1 kexec-tools user-package errata.
Perhaps it's queued for RHEL5.2?

But I've added Neil to the cc: list for his take on the matter.

Comment 57 Neil Horman 2007-10-08 15:13:10 UTC
I do have it queued for 5.2, yes.  If you need to do it in the interim, you can
use the KDUMP_COMMANDLINE_APPEND variable in /etc/sysconfig/kdump to get it in
place.

Comment 58 IBM Bug Proxy 2007-10-22 08:15:40 UTC
------- Comment From ankigarg.com 2007-10-22 04:12 EDT-------
Neil,

On our LS20/LS21, we are currently editing /etc/sysconfig/kdump file to pass the
flag. But wouldnt we need to document that this flag needs to be passed on
x86_64 systems?

Thanks,
Ankita

Comment 59 Guy Streeter 2007-10-22 15:28:54 UTC
This has been added to the online Release Notes and kdump/kexec HowTo.

Comment 60 IBM Bug Proxy 2008-02-11 11:08:47 UTC
------- Comment From ankigarg.com 2008-02-11 06:06 EDT-------
Sripathi, am not yet sure if we can close this bug yet. Inorder to resolve this
issue, besides the kernel patch, our build system was modified to pass certain
kernel parameters to the kdump kernel. Will need to confirm if we do so from the
RHEL5RT setup.

Comment 61 IBM Bug Proxy 2008-02-11 12:00:44 UTC
------- Comment From ankigarg.com 2008-02-11 06:56 EDT-------
From comment #69, looks like it has been documented in the release notes of
kexec/kdump. So the user will need to manually edit the /etc/sysconfig/kdump
file to pass the reset_devices parameter to the kdump kernel.

Only other thing to verify is the try kdump on the latest RHEL5RT src and verify
kdump is working fine on the LS21. Will test and confirm.