223901 – [RHEL5] kernel: MCA occurs during ext3 stress test on ia64

Bug 223901 - [RHEL5] kernel: MCA occurs during ext3 stress test on ia64

Summary: [RHEL5] kernel: MCA occurs during ext3 stress test on ia64

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	ia64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Luming Yu
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	227613 228988 243319 246139 296411 391501 420521 422431 422441 445799
TreeView+	depends on / blocked

Reported:	2007-01-22 23:07 UTC by Kiyoshi Ueda
Modified:	2018-10-19 21:12 UTC (History)
CC List:	17 users (show)
Fixed In Version:	2.6.18-85.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-04-14 08:09:39 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Debug patch (8.17 KB, patch) 2007-04-03 00:31 UTC, Kiyoshi Ueda	no flags	Details \| Diff
View All

Description Kiyoshi Ueda 2007-01-22 23:07:53 UTC

Description of problem:
MCA occurs on IA64 during I/O high-load testing.


Version-Release number of selected component:
kernel-2.6.18-4.el5


How reproducible:
3 times after 3 trials.
It takes a long time (at least 12 hours).


Steps to Reproduce:
 1. Create ext3 partitions for I/O testing.
 2. Mount the partitions.
 3. Run file system stress test on them in parallel for a long time
    (60 hours in my case).


Actual results:
MCA and system reboot occur.


Expected results:
MCA should not occur.


Additional info:
The problem is observed on 2 different machines.
And the problem hasn't been observed with upstream 2.6.19 kernel.
So it's not likely the hardware problem.
As for RHEL5, at least, 2.6.18-1.2747.el5 has passed the same test
without problem.

The information from kdump is below.
The MCA occurred when CPU#15 was running in journal_write_metadata_buffer().
-----------------------------------------------------------------------
[root@nec-tx7-2 127.0.0.1-2007-01-20-13:48:20]# crash
/usr/lib/debug/lib/modules/2.6.18-4.el5/vmlinux vmcore

crash 4.0-3.14
Copyright (C) 2002, 2003, 2004, 2005, 2006  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005  Fujitsu Limited
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ia64-unknown-linux-gnu"...

WARNING: active task e0000001000f0000 on cpu 15 not found in PID hash

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-4.el5/vmlinux
    DUMPFILE: vmcore
        CPUS: 16
        DATE: Sat Jan 20 13:45:42 2007
      UPTIME: 19:53:05
LOAD AVERAGE: 26.04, 26.72, 27.15
       TASKS: 366
    NODENAME: nec-tx7-2.lab.boston.redhat.com
     RELEASE: 2.6.18-4.el5
     VERSION: #1 SMP Wed Jan 17 23:03:41 EST 2007
     MACHINE: ia64  (899 Mhz)
      MEMORY: 63.4 GB
       PANIC: (MCA)
         PID: 0
     COMMAND: "MCA 2740"
        TASK: e0000001000f0000  [THREAD_INFO: e0000001000f1040]
         CPU: 15
       STATE: TASK_UNINTERRUPTIBLE (MCA)

crash> bt
PID: 0      TASK: e0000001000f0000  CPU: 15  COMMAND: "MCA 2740"
 #0 [BSP:e0000001000f12d0] machine_kexec at a000000100058ad0
 #1 [BSP:e0000001000f12b8] machine_kdump_on_init at a00000010005dab0
 #2 [BSP:e0000001000f1280] kdump_init_notifier at a00000010005dcd0
 #3 [BSP:e0000001000f1248] notifier_call_chain at a00000010061a570
 #4 [BSP:e0000001000f1218] atomic_notifier_call_chain at a00000010009e6f0
 #5 [BSP:e0000001000f11b8] ia64_mca_handler at a000000100047760
(MCA) INTERRUPTED TASK
PID: 2740   TASK: e000000c078c0000  CPU: 15  COMMAND: "kjournald"
 #0 [BSP:e000000c078c1288] __ia64_leave_kernel at a00000010000c700
  EFRAME: e000000c078c7b20
      B0: a00000020f250e40      CR_IIP: a00000020f250ea0
 CR_IPSR: 00001210085a6010      CR_IFS: 8000000000000794
  AR_PFS: 0000000000000794      AR_RSC: 0000000000000003
 AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
  AR_CCV: 0000000000204000     AR_FPSR: 0009804c8a70433f
  LOADRS: 0000000001200000 AR_BSPSTORE: e000000c078c1168
      B6: a00000020f311140          B7: a000000100299640
      PR: 0000000000009541          R1: a00000020f25c050
      R2: e00000014f289000          R3: e00000014f288000
      R8: 0000000000000007          R9: 0000000000053ca2
     R10: 000000000029e510         R11: c000000000053ca2
     R12: e000000c078c7ce0         R13: e000000c078c0000
     R14: e0000001040a8628         R15: a000000100cd1600
     R16: 000000000024a86e         R17: e000000f5d50bee8
     R18: 5ffffffffff04a1a         R19: 0003ed797ff04a1a
     R20: 5ffc128680000000         R21: 00000007dae34a1a
     R22: 0003ed71a50d0000         R23: 000000081bc154c0
     R24: e000000f5d50bed8         R25: e00000101d7f46b0
     R26: e00000101d7f4580         R27: e000000f5d50bee0
     R28: 0000000000001000         R29: e00000026a61dcc0
     R30: e00000026a61dca0         R31: e000000f230960e8
      F6: 1003e0000000000000000     F7: 1003e00000000000000a0
      F8: 1003e0000000000000060     F9: 1003e0000000000000001
     F10: 1003e00000000001d5b18    F11: 1003e0044b82fa09b5a53
 #1 [BSP:e000000c078c11e0] journal_write_metadata_buffer at a00000020f250ea0
crash>
-----------------------------------------------------------------------

Comment 1 Eric Sandeen 2007-01-27 05:43:27 UTC

If this looks like a recent regression since 2.6.18-1.2747.el5, I suppose it
might be interesting to try a bisecting search for the update which introduced
it...?

Comment 2 Kiyoshi Ueda 2007-01-29 19:41:58 UTC

According to my current test results below, this problem looks like
a regression at 2.6.18-2.el5.  So I added the "Regression" keyword.

    o 2.6.18-1.2961.el5: MCA doesn't occur for over 30 hours
    o 2.6.18-1.3014.el5: MCA doesn't occur for over 130 hours
    o 2.6.18-2.el5     : MCA occurred within 4 hours (2 times out of 2 trials)
    o 2.6.18-4.el5     : MCA occurred within 4 hours (3 times out of 4 trials)

The MCA occurs on 16CPUs ia64 box but doesn't occur on 2CPUs ia64 box so far.
And the MCA always occurs in ext3's functions (mostly
journal_write_metadata_buffer)

I'm trying on 8CPUs x86_64 box too now.

Comment 3 RHEL Program Management 2007-01-29 19:45:42 UTC

This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 4 Eric Sandeen 2007-01-29 21:20:54 UTC

These are the patches that were added between 2.6.18-1.3014.el5 and 2.6.18-2.el5:

+Patch20023: xen-register-pit-handlers-to-the-correct-domain.patch
+Patch20024: xen-quick-fix-for-cannot-allocate-memory.patch
+Patch21215: linux-2.6-misc-fix-vdso-in-core-dumps.patch
+Patch21216: linux-2.6-sata-ahci-support-ahci-class-code.patch
+Patch21217: linux-2.6-sata-support-legacy-ide-mode-of-sb600-sata.patch
+Patch21218: linux-2.6-rng-check-to-see-if-bios-locked-device.patch
+Patch21219: linux-2.6-mm-handle-map-of-memory-without-page-backing.patch

Is there any chance the reporter could try to identify which patch may have
caused the regression?  None of them are specifically related to ext3 or jbd,
perhaps the last patch listed above could be related?

Comment 5 Kiyoshi Ueda 2007-01-29 22:56:30 UTC

I'm trying that, but it may take long time a little bit.

Comment 6 Eric Sandeen 2007-01-29 23:02:01 UTC

Thank you, I know backing out patches is a bit tedious :)  But I'm not sure that
I can reproduce it here...

Comment 7 Kiyoshi Ueda 2007-01-31 17:00:01 UTC

This is an update of my testing results.
According to the current results, the suspect seems to be
Patch21219 (linux-2.6-mm-handle-map-of-memory-without-page-backing.patch)
as esandeen guesses.
  o 2.6.18-2.el5 without Patch21215: the MCA occurs within 4 hours
  o 2.6.18-2.el5 without Patch21218: the MCA occurs within 5 hours
  o 2.6.18-2.el5 without Patch21219: the MCA doesn't occur for 32 hours
  o 2.6.18-2.el5 without Patch20023(xen fix), Patch20024(xen fix),
    Patch21216(sata fix) and Patch21217(sata fix) are not tested,
    because I think these patches are not related with this problem.

Comment 8 Eric Sandeen 2007-01-31 18:35:35 UTC

Thank you for that update, narrowing it downto one patch is very helpful.  Are
there any other interesting messages before the MCA?  Can you share the testcase
or describe what it is doing?

Thanks,
-Eric

Comment 9 erikj 2007-01-31 18:44:56 UTC

If I understand things right, patch 21219 /
linux-2.6-mm-handle-map-of-memory-without-page-backing.patch

Is in the community.
http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f4b81804a2d1ab341a4613089dc31ecce0800ed8

So it's an interesting point that you don't hit it with the latest community
kernel.

For bug 221029, we took that patch in addition to Prarit's intel_rng patch to
fix a large memory x86_64 problem.  However, there is some evidence that
we may just need the intel_rng part and could possibly drop the other
patch that you seem to hit a problem with.

This needs careful testing because it risks breaking x86_64 if we're not
careful.  The original proposal for the do no pfn / 
linux-2.6-mm-handle-map-of-memory-without-page-backing.patch 21219 patch was in
bugzilla 211854.  It was later pulled because a different red hat xen patch
actually fixed our problem back then (at the time, sgi didn't know it was
even in the kernel or pulled).

So this has a bit of a complicated history to it.

I'd be interested to know about the test case perhaps and any ideas you might
have on how to proceed.

I don't quite know why this patch would cause trouble and it's had quite a bit
of exercise here -- but perhaps not the same type of tests you're running.

We could try the test case on our machines as well.  Let us know if that
would help.  In other words, should we try running the "filesystem stress
tests" -- and where might we find them.  Thanks!

Comment 10 Kiyoshi Ueda 2007-01-31 22:17:22 UTC

Re: comment#8

No interesting messages.  The messages below appear suddenly.
----------------------------------------------------------------
Entered OS MCA handler. PSP=a8000000fff21330 cpu=10 monarch=1
All OS MCA slaves have reached rendezvous
----------------------------------------------------------------

The testcase which I'm using is a little complicated.
So I'm trying to make a simple testcase.

By the way, it seems no user of the Patch21219(nopfn) in 2.6.18-2.el5.
So I'm confused why the problem doesn't occur by only dropping the Patch21219.
I'm still investigating.

Comment 11 Peter Zijlstra 2007-02-01 12:49:25 UTC

Puzzling indeed. Esp since it doesn't happen with an upstream kernel.

Reading through 221029 and 211854 makes me worried though, the no_pfn patch
should not do anything on its own, and apparently it seems to affect both the
intel_rng thingy and this. Weird!

Also, if anything would go amiss, I'd expect something in the fault handler to
go bang, not some random ext3 stuff.

What's not clear to me is whether SGI's mspec driver is loaded (although I
suspect not), and if so, does it make use of the nopfn handler or the hack in
vm_normal_page()? (or are there any other 3th party modules loaded for that matter?)

Would it be possible to pinpoint the exact place in
journal_write_metadata_buffer() where it goes bang? Or is that varying a lot?

Comment 13 Jes Sorensen 2007-02-01 13:06:53 UTC

(In reply to comment #11)
> Puzzling indeed. Esp since it doesn't happen with an upstream kernel.
> 
> Reading through 221029 and 211854 makes me worried though, the no_pfn patch
> should not do anything on its own, and apparently it seems to affect both the
> intel_rng thingy and this. Weird!

Hi Peter,

Thats my take too, I really can't see how the nopfn patch can cause this. I
strongly suspect it's something else thats hitting it and the nopfn path just
being the thing that pushes it over the limit. The only thing I could imagine
would be if someone had a struct vm_operations_struct that was allocated and
populated manually without memset'ing it to zero first. Doing so could possibly
result in something jumping to a garbage address, but the real bug would be the
place allocating it without zeroing it.

> What's not clear to me is whether SGI's mspec driver is loaded (although I
> suspect not), and if so, does it make use of the nopfn handler or the hack in
> vm_normal_page()? (or are there any other 3th party modules loaded for that
matter?)

The only current user of nopfn I know of is mspec and it will normally only ever
use the nopfn path if some app tries to use the fetchop space. The common case
is the MPI library, but it can be used manually too. In either case it would
require the tester to actively take action to use it.

Cheers,
Jes

Comment 14 Prarit Bhargava 2007-02-01 13:15:21 UTC

Ueda-san,

Is this ia64 NEC system available here in Westford?

If so, I'd like to know how to access it.

If not, then I have a few questions:  What is the size/memory configuration of
the system?  32p/32G?

Could you send us an lspci output?

If we don't have the system, we might be able to use another one our bigger ia64
systems to reproduce the issue.

P.

Comment 15 Prarit Bhargava 2007-02-01 13:21:08 UTC

Ueda-san,

lsmod output would be appreciated too.

P.

Comment 16 Dean Nelson 2007-02-01 14:19:24 UTC

(In reply to comment #13)
> (In reply to comment #11)
> > What's not clear to me is whether SGI's mspec driver is loaded (although I
> > suspect not), and if so, does it make use of the nopfn handler or the hack in
> > vm_normal_page()? (or are there any other 3th party modules loaded for that
> matter?)
> 
> The only current user of nopfn I know of is mspec and it will normally only ever
> use the nopfn path if some app tries to use the fetchop space. The common case
> is the MPI library, but it can be used manually too. In either case it would
> require the tester to actively take action to use it.

Actually, only the upstream version of mspec uses nopfn. The version of
mspec that we ship with rhel5 does not, but it does use the hack in
vm_normal_page().

Comment 17 Kiyoshi Ueda 2007-02-06 18:18:00 UTC

I have sent Prarit the requested HW information in Comment#14 and Comment#15.


Re: Comment#11

I haven't investigated the exact place which the MCA occurs
because "bt -l" of crash command doesn't work to decide the place
and the places are not always same exactly like below.

- 2.6.18-2.el5 without Patch21215
  o [BSP:e0000008067d11e0] journal_write_metadata_buffer at a00000020f250ea0

- 2.6.18-2.el5 without Patch21218
  o [BSP:e000000805b011e0] journal_write_metadata_buffer at a00000020f250ed0
  o [BSP:e000000c02131290] __journal_file_buffer at a00000020f23c5b0

- 2.6.18-2.el5
  o [BSP:e000000806a391e0] journal_write_metadata_buffer at a00000020f250e70
  o [BSP:e000000877be11e0] journal_write_metadata_buffer at a00000020f250e70
  o [BSP:e000000406f811e0] journal_write_metadata_buffer at a00000020f250ea0
  o [BSP:e000000801ea11e0] journal_write_metadata_buffer at a00000020f250ed0

- 2.6.18-4.el5
  o [BSP:e000000801f391e0] journal_write_metadata_buffer at a00000020f250e90
  o [BSP:e000000c078c11e0] journal_write_metadata_buffer at a00000020f250ea0
  o [BSP:e00000011b8011e0] journal_write_metadata_buffer at a00000020f250ee0
  o [BSP:e000000406409290] __journal_file_buffer at a00000020f23c540

I'm trying 2.6.18-8.el5 now, and no MCA occurs for 20 hours so far.
It's very strange, since it seems no related fix between -4.el5 and -8.el5.
I'll investigate more, but I change the BZ state back to "ASSIGNED".

Comment 18 Kiyoshi Ueda 2007-02-13 18:52:00 UTC

This is an testing/investigation status update.

  o 2.6.18-1.2961.el5: MCA doesn't occur for over 30 hours
  o 2.6.18-1.3014.el5: MCA doesn't occur for over 130 hours
  o 2.6.18-2.el5     : MCA occurs within 5 hours
  o 2.6.18-3.el5     : MCA occurs within 5 hours
  o 2.6.18-4.el5     : MCA occurs within 5 hours
  o 2.6.18-5.el5     : MCA doesn't occur for over 120 hours
  o 2.6.18-6.el5     : MCA doesn't occur for over 25 hours
  o 2.6.18-7.el5     : MCA doesn't occur for over 25 hours
  o 2.6.18-8.el5     : MCA doesn't occur for over 130 hours

This problem seems to be fixed in -5.el5, though I can't see
any related fix.  I'm trying to specify which patch fixes the MCA.


And I got exact places where the MCA occurs.
The results show that the MCA sometimes occurs at 'nop' instruction
or 'adds' instruction.

- 2.6.18-2.el5 (Rebuild kernel on my local machine)
  o [BSP:e00000011fe191e0] journal_write_metadata_buffer at a00000020f250e90
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 370
    <journal_write_metadata_buffer+2352>:        [MMI]       adds r25=304,r26;;
    R25: 00000000949b004a       R26: e000000404a1bd00
  o [BSP:e0000010030c11e0] journal_write_metadata_buffer at a00000020f250e70
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 369
    <journal_write_metadata_buffer+2320>:        [MMI]       ld8 r28=[r29];;
    R28: 0000000000001000       R29: e0000008054354e0

- 2.6.18-2.el5 without Patch21215 (Rebuild kernel on my local machine)
  o [BSP:e0000008067d11e0] journal_write_metadata_buffer at a00000020f250ea0
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 371
    <journal_write_metadata_buffer+2368>:        [MMI]       st8 [r24]=r35;;
    R24: e000000c71e39e38

- 2.6.18-2.el5 without Patch21218 (Rebuild kernel on my local machine)
  o [BSP:e000000805b011e0] journal_write_metadata_buffer at a00000020f250ed0
    include/asm/bitops.h: 46
    <journal_write_metadata_buffer+2416>:        [MIB]       nop.m 0x0
  o [BSP:e000000c02131290] __journal_file_buffer at a00000020f23c5b0
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/transaction.c: 1951
    <__journal_file_buffer+144>: [MMI]       ld8 r9=[r33];;
    R9: e00000101d28f900

- 2.6.18-2.el5
  o [BSP:e000000806a391e0] journal_write_metadata_buffer at a00000020f250e70
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 369
    <journal_write_metadata_buffer+2320>:        [MMI]       ld8 r28=[r29];;
    R28: 0000000000001000       R29: e00000101bfe7a60
  o [BSP:e000000406f811e0] journal_write_metadata_buffer at a00000020f250ea0
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 371
    <journal_write_metadata_buffer+2368>:        [MMI]       st8 [r24]=r35;;
    R24: e00000087f78c6f8
  o [BSP:e000000877be11e0] journal_write_metadata_buffer at a00000020f250e70
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 369
    <journal_write_metadata_buffer+2320>:        [MMI]       ld8 r28=[r29];;
    R28: 0000000000001000       R29: e000000e6d314ee0
  o [BSP:e000000801ea11e0] journal_write_metadata_buffer at a00000020f250ed0
    include/asm/bitops.h: 46
    <journal_write_metadata_buffer+2416>:        [MIB]       nop.m 0x0

- 2.6.18-3.el5
  o [BSP:e000000c078d1260] journal_file_buffer at a00000020f242af0
    include/linux/bit_spinlock.h: 22
    <journal_file_buffer+80>:    [MMI]       mov r18=1048576
    R18: bfffffffffc94b34

- 2.6.18-4.el5
  o [BSP:e000000c078c11e0] journal_write_metadata_buffer at a00000020f250ea0
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 371
    <journal_write_metadata_buffer+2368>:        [MMI]       st8 [r24]=r35;;
    R24: e000000f5d50bed8
  o [BSP:e000000801f391e0] journal_write_metadata_buffer at a00000020f250e90
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 370
    <journal_write_metadata_buffer+2352>:        [MMI]       adds r25=304,r26;;
    R25: e00000101ee41630       R26: e00000101ee41500
  o [BSP:e00000011b8011e0] journal_write_metadata_buffer at a00000020f250ee0
    include/asm/bitops.h: 44
    <journal_write_metadata_buffer+2432>:        [MMI]       ld4.acq r2=[r38];;
    R2: 0000000000004020
  o [BSP:e000000406409290] __journal_file_buffer at a00000020f23c540
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/transaction.c: 1950
    <__journal_file_buffer+32>:  [MIB]       nop.m 0x0

Comment 19 Kiyoshi Ueda 2007-02-20 22:16:31 UTC

The MCA has occurred on 2.6.18-8.el5 twice during the last weekend running.
The first one took 19 hours, and the second one took 38 hours.

- 2.6.18-8.el5
  o [BSP:e0000005066810a0] kernel_thread_helper at a0000001000126a0
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/arch/ia64/kernel/process.c: 710
  o [BSP:e00000069d8e90a0] kernel_thread_helper at a0000001000126a0
    /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/arch/ia64/kernel/process.c: 710

Comment 21 Luming Yu 2007-04-02 03:27:10 UTC

Any news on this bug?  Is it fixed or still a problem on 2.6.18-9.el5?

Comment 22 Kiyoshi Ueda 2007-04-03 00:29:13 UTC

I confirmed that it occurs on:
    o 2.6.18-10.el5
    o upstream 2.6.19.1
    o upstream 2.6.20

Also, it occurs on ext2-only environment (with 2.6.18-4.el5).
IIP at the time points to kernel_thread_helper() according to the dump.
(So ext3 is not a suspect now.)

The reason of the MCA was the chipset detected a READ access to
physical address 0x1000000000000 (256[TB]) from a CPU.
Because this type of MCA is asynchronous according to the Intel's
IPF Manual, the instruction pointed to by the IIP is not necessarily
the instruction which issued the READ.

So to trap the 256[TB] READ before MCA, I made the attached debug patch.
It should cause kernel panic when strange TLB insertions or physical
mode memory access is attempted..
But I still get the MCA instead of panic.

Currently, I'm trying to point out which kernel did get this bug.
(Trying 2.6.18-1.3014.el5, but no problem for 400 hours so far,
 though it occurs within 200 hours on other kernels even the worst case.)

Comment 23 Kiyoshi Ueda 2007-04-03 00:31:29 UTC

Created attachment 151496 [details]
Debug patch

Comment 24 Luming Yu 2007-04-09 03:36:27 UTC

what is the test results of the debug patch?

Comment 25 Jun'ichi Nomura (Red Hat) 2007-04-09 03:42:16 UTC

RE: comment#24,

We expected the patch to catch the illegal access,
but it couldn't. (i.e. MCA still occurred.)

Comment 26 Luming Yu 2007-04-09 03:52:54 UTC

I try to reproduce it on my tiger4. What do you want to recommend on the test
cases/kernel vesion,..etc to reproduce this issue.

Comment 28 Jun'ichi Nomura (Red Hat) 2007-04-09 04:22:38 UTC

So far, it seems key points are:
  - a lot of CPUs (please see comment #2)
  - a lot of file systems mounted (about 20)
  - parallel fs stress on those mount points
  - kernel version is flexible (2.6.18-8.el5, 2.6.19.1, 2.6.20)
  - reproduction takes time varying from a few hours to a few weeks
  - context switch to/from kernel thread might be related

Comment 29 Luming Yu 2007-04-09 05:23:16 UTC

Can you share the test cases with me?

Comment 30 Jun'ichi Nomura (Red Hat) 2007-04-09 16:51:09 UTC

The test case is sent to Luming.

Comment 32 Luming Yu 2007-05-09 07:04:12 UTC

ok I guess it is NOT easy to reproduce this problem...so I'd like to analyze the
kdump image. Would you please let me have a look at the kdump image?

Comment 34 Eric Sandeen 2007-05-31 18:05:13 UTC

Is it still the thought that the patch identified in comment #7 is the culprit?

Comment 35 Kiyoshi Ueda 2007-06-01 18:53:46 UTC

Some kdump images have been sent to Luming.


Re: Comment#34
I think Patch21219 is probably not the cause.
Current status is below:

The reason of the MCA:
    The chipset detected a READ access of 128 bytes to
    physical address 0x1000000000000 (256[TB]) from a CPU.
    (Found from HW log.)

    Analyzing from OS dump is difficult because this type of
    MCA is asynchronous according to the Intel's IPF Manual and
    the instruction pointed to by the IIP is not necessarily
    the instruction which issued the READ.

Results of trials:
    o 2.6.18-1.3014.el5    : doesn't occur so far (over 700 hours)
    o 2.6.18-[2-10].el5    : occur (Some tests took about 200 hours though)
    o upstream 2.6.[19-21] : occur
    o ext2-only environment: occur (ext3 is not a suspect now)
    o with the trap patch  : occur (the patch is in Comment#23)

Because the MCA doesn't occur on 2.6.18-1.3014.el5, Patch21219 could
be a suspect.  But I think the percentage is very low, because it
looks no user of the nopfn() in RHEL5 and I have confirmed do_no_pfn()
isn't called when the MCA occurs by another trap patch.
(So I believe I can reproduce it on the kernel without Patch21219
 like 2.6.18-1.3014.el5 by very long time testing.)

Currently, I'm thinking about trying new FWs like PAL of IA64,
because the READ access is from a CPU but the chipset can't tell
whether it is from OS or FW.
Since I confirmed that a newer PAL is available (though I'm not sure
what king of updates are included), I'll try it.

Comment 36 Kiyoshi Ueda 2007-06-13 13:52:03 UTC

This is an status update for the commnet#35:
  2.6.21 without Patch21219(nopfn)        : the MCA occurs
  The latest PAL (PAL_A:7.31, PAL_B:7.79) : the MCA occurs

I started a long time testing (about a few months)
on 2.6.18-1.3014.el5.

Comment 37 Kiyoshi Ueda 2007-07-09 13:38:19 UTC

Status update:
Confirmed the MCA occurs on 2.6.18-1.3014.el5. (It took 600 hours.)
So Patch21219(nopfn) is not a suspect now.

Next plan:
To confirm whether this problem occurs on only RHEL5(recent kernel),
I will start the testing on RHEL4.

Comment 38 Kiyoshi Ueda 2007-10-04 14:30:24 UTC

No MCA occurs on RHEL4.5 kernel-2.6.9-55.EL for 1512 hours.

Current testing results:
  o 2.6.9-55.EL           : doesn't occur so far (over 1500 hours)
  o 2.6.18-1.3014.el5    : occur (Took 600 hours. Patch21219 isn't suspect)
  o 2.6.18-[2-10].el5    : occur (Some tests took about 200 hours though)
  o upstream 2.6.[19-21] : occur
  o ext2-only environment: occur (ext3 is not a suspect now)
  o with the trap patch  : occur (the patch is in Comment#23)
  o with the latest PAL  : occur (PAL_A:7.31, PAL_B:7.79)

Next plan:
  o I should try the latest RHEL5 kernel-2.6.18-52.el5 first because
    various changes are included since 2.6.18-10.el5 (1500 hours)
  o If MCA still occurs on 2.6.18-52.el5, I'll try upstream kernels
    from 2.6.18 through 2.6.9 to point out a suspect kernel version

Comment 39 Kiyoshi Ueda 2007-10-17 16:19:57 UTC

The MCA still occurs on 2.6.18-52.el5 after 147 hours running.

Current testing results:
  o 2.6.9-55.EL           : doesn't occur so far (over 1500 hours)
  o 2.6.18-1.3014.el5    : occur (Took 600 hours. Patch21219 isn't suspect)
  o 2.6.18-[2-52].el5    : occur (Some tests took about 200 hours though)
  o upstream 2.6.[19-21] : occur
  o ext2-only environment: occur (ext3 is not a suspect now)
  o with the trap patch  : occur (the patch is in Comment#23)
  o with the latest PAL  : occur (PAL_A:7.31, PAL_B:7.79)

Next plan:
  o Try upstream kernels from 2.6.18 through 2.6.9 to point out
    a suspect kernel version

Comment 40 Luming Yu 2007-11-05 08:19:32 UTC

It is really appreciated to consistently update the testing results on the
bugzilla. Just curious if it is possible to capture the whole TLBs in processor
on MCA? Probably we could find out if the MCA triggered by memory access came
from processor..

Comment 41 Kiyoshi Ueda 2007-11-05 15:11:59 UTC

The TLBs information isn't included in the chipset log nor the MCA log.
So I can't see it.

Comment 43 RHEL Program Management 2007-11-09 14:12:50 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 44 Kiyoshi Ueda 2007-11-09 17:03:19 UTC

Why was this bugzilla closed with WONTFIX?
I think this is a critical bug and needs to be fixed in the future.
So I reopen this bugzilla.

Comment 45 Kiyoshi Ueda 2007-11-16 19:21:37 UTC

Confirmed that the MCA occurs on upstream 2.6.18 and 2.6.18-53.el5.

Current testing results:
  o 2.6.9-55.EL           : doesn't occur so far (over 1500 hours)
  o 2.6.18-1.3014.el5    : occur (Took 600 hours. Patch21219 isn't suspect)
  o 2.6.18-[2-53].el5    : occur (Some tests took about 200 hours though)
  o upstream 2.6.[18-21] : occur
  o ext2-only environment: occur (ext3 is not a suspect now)
  o with the trap patch  : occur (the patch is in Comment#23)
  o with the latest PAL  : occur (PAL_A:7.31, PAL_B:7.79)

Next plan:
  o Try upstream kernels from 2.6.17 through 2.6.9 to point out
    a suspect kernel version.
    But 2.6.[9-17] built on RHEL5 environment doesn't boot.
    (Same version/same config built on RHEL4 environment can boot.)
    So I need some more confirmation before starting 2.6.17 testing.
    (E.g. testing 2.6.18, which built on RHEL5, on RHEL4 environment
           and vice versa.)

Comment 46 Jay Turner 2007-12-13 14:20:56 UTC

QE nack for 5.2 based on comment 42.

Comment 47 Luming Yu 2007-12-18 06:44:09 UTC

This will not make RHEL5.2

- it took quite sometime to reproduce this systemon only on specific platform...,
- the investigation is still ongoing,

I suggest moving this to RHEL5.3,

-Luming

Comment 48 Kiyoshi Ueda 2007-12-18 16:01:24 UTC

> - it took quite sometime to reproduce this systemon only on specific platform...

Just for a record, this MCA occurred on another NEC ia64 system,
which is Montecito platform, too, not only on the old McKinley platform.
So I guess this problem is not platform specific.

Comment 52 Luming Yu 2008-02-14 09:08:42 UTC

To comment#48,

How do you know the MCA on another NEC montecito platform is same ?

Comment 53 Kiyoshi Ueda 2008-02-14 15:02:05 UTC

Re: comment#52,

I can see it from the chipset log, which is hardware specific
binary data.
The MCA on the Montecito platform happened 4 times in the past,
and then, the chipset detected out of range access from a CPU.

On the McKinley platform, the accessed physical address is
always 0x1000000000000 (256[TB]).
On the other hand, on the Montecite platform, the accessed
physical address is not same like below:
  - 0x00FFFFFFB4000 (around 16[TB])
  - 0x00FFFFE610000 (around 16[TB])
  - 0x00FFFFFC1C000 (around 16[TB])
  - 0x228455C3D4000 (around 552[TB])
It's a different point from the phenomenon on the McKinley
platform.
But the back-trace from kernel memory dump was similar to
that on the McKinley platform, so I guess it's same problem.

Comment 54 Luming Yu 2008-03-18 07:55:47 UTC

2.6.18-53.el5 : MCA occrs in less than 10hrs under stress test.
2.6.18-85.el5 : pass 100hrs stress test without MCA.

Sounds there are some really improvements..

I'm going to kick off 2000 hrs stress test..

Comment 55 Luming Yu 2008-03-27 02:40:21 UTC

Under fs stress testing workload,  No MCA for almost 13 days 

[root@nec-tx7-1 ~]# uptime
 22:19:36 up 12 days, 22:49,  2 users,  load average: 35.16, 40.90, 43.33

Comment 56 Luming Yu 2008-04-02 09:05:08 UTC

Under fs stress testing workload,  No MCA for almost 20 days 

[root@nec-tx7-1 ~]# uptime
 05:04:18 up 19 days,  5:34,  2 users,  load average: 57.74, 48.30, 42.71

Comment 57 Luming Yu 2008-04-07 02:44:12 UTC

[root@nec-tx7-1 ~]# uptime
 22:43:19 up 23 days, 23:13,  2 users,  load average: 40.48, 38.14, 38.56

Comment 58 Luming Yu 2008-04-09 02:51:14 UTC

The box has been running the fs stress test for over 620 hrs,
Still No MCA..., It is the second best score comparing with the results in
comment# 45.  
[root@nec-tx7-1 ~]# uptime
 22:44:41 up 25 days, 23:14,  2 users,  load average: 43.05, 41.77, 41.86

Could NEC run same test on a different box to confirm the results are consistent...

Comment 59 Kiyoshi Ueda 2008-04-09 15:56:41 UTC

Re: Comment#58

OK.
But currently other boxes are being used for other testings.
And for a internal reason, those boxes will be unavailable temporarily
in late April through early May.
So I will be able to start the test in early May or so.

Comment 60 Luming Yu 2008-04-14 08:09:39 UTC

For internal reason, the box got manually reboot without seeing MCA nearly 700hrs
[root@nec-tx7-1 ~]# top
top - 13:25:14 up 28 days, 13:55,  2 users,  load average: 41.74, 39.91, 41.31
Tasks: 354 total,   8 running, 346 sleeping,   0 stopped,   0 zombie
Cpu(s): 22.4%us, 42.5%sy,  0.0%ni, 14.9%id, 13.8%wa,  0.0%hi,  6.5%si,  0.0%st
Mem:  66612864k total, 66292416k used,   320448k free,  6642944k buffers
Swap:        0k total,        0k used,        0k free,  1665200k cached


I will schedule another 1000hrs testing to verify if it is stable.
For now, I'm pretty sure this problem has disappeared in 2.6.18-85.el5.
With bisection, probably I can identified the patches that cure the problem with
its good side effect. But for now, I don't think we need to pursue a kernel
patch for this problem. 

Closing this bug as CURRENTRELEASE for now.
If the problem could occur again, please feel free to re-open the bug.

Note You need to log in before you can comment on or make changes to this bug.