Description of problem: MCA occurs on IA64 during I/O high-load testing. Version-Release number of selected component: kernel-2.6.18-4.el5 How reproducible: 3 times after 3 trials. It takes a long time (at least 12 hours). Steps to Reproduce: 1. Create ext3 partitions for I/O testing. 2. Mount the partitions. 3. Run file system stress test on them in parallel for a long time (60 hours in my case). Actual results: MCA and system reboot occur. Expected results: MCA should not occur. Additional info: The problem is observed on 2 different machines. And the problem hasn't been observed with upstream 2.6.19 kernel. So it's not likely the hardware problem. As for RHEL5, at least, 2.6.18-1.2747.el5 has passed the same test without problem. The information from kdump is below. The MCA occurred when CPU#15 was running in journal_write_metadata_buffer(). ----------------------------------------------------------------------- [root@nec-tx7-2 127.0.0.1-2007-01-20-13:48:20]# crash /usr/lib/debug/lib/modules/2.6.18-4.el5/vmlinux vmcore crash 4.0-3.14 Copyright (C) 2002, 2003, 2004, 2005, 2006 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005 Fujitsu Limited Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ia64-unknown-linux-gnu"... WARNING: active task e0000001000f0000 on cpu 15 not found in PID hash KERNEL: /usr/lib/debug/lib/modules/2.6.18-4.el5/vmlinux DUMPFILE: vmcore CPUS: 16 DATE: Sat Jan 20 13:45:42 2007 UPTIME: 19:53:05 LOAD AVERAGE: 26.04, 26.72, 27.15 TASKS: 366 NODENAME: nec-tx7-2.lab.boston.redhat.com RELEASE: 2.6.18-4.el5 VERSION: #1 SMP Wed Jan 17 23:03:41 EST 2007 MACHINE: ia64 (899 Mhz) MEMORY: 63.4 GB PANIC: (MCA) PID: 0 COMMAND: "MCA 2740" TASK: e0000001000f0000 [THREAD_INFO: e0000001000f1040] CPU: 15 STATE: TASK_UNINTERRUPTIBLE (MCA) crash> bt PID: 0 TASK: e0000001000f0000 CPU: 15 COMMAND: "MCA 2740" #0 [BSP:e0000001000f12d0] machine_kexec at a000000100058ad0 #1 [BSP:e0000001000f12b8] machine_kdump_on_init at a00000010005dab0 #2 [BSP:e0000001000f1280] kdump_init_notifier at a00000010005dcd0 #3 [BSP:e0000001000f1248] notifier_call_chain at a00000010061a570 #4 [BSP:e0000001000f1218] atomic_notifier_call_chain at a00000010009e6f0 #5 [BSP:e0000001000f11b8] ia64_mca_handler at a000000100047760 (MCA) INTERRUPTED TASK PID: 2740 TASK: e000000c078c0000 CPU: 15 COMMAND: "kjournald" #0 [BSP:e000000c078c1288] __ia64_leave_kernel at a00000010000c700 EFRAME: e000000c078c7b20 B0: a00000020f250e40 CR_IIP: a00000020f250ea0 CR_IPSR: 00001210085a6010 CR_IFS: 8000000000000794 AR_PFS: 0000000000000794 AR_RSC: 0000000000000003 AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000 AR_CCV: 0000000000204000 AR_FPSR: 0009804c8a70433f LOADRS: 0000000001200000 AR_BSPSTORE: e000000c078c1168 B6: a00000020f311140 B7: a000000100299640 PR: 0000000000009541 R1: a00000020f25c050 R2: e00000014f289000 R3: e00000014f288000 R8: 0000000000000007 R9: 0000000000053ca2 R10: 000000000029e510 R11: c000000000053ca2 R12: e000000c078c7ce0 R13: e000000c078c0000 R14: e0000001040a8628 R15: a000000100cd1600 R16: 000000000024a86e R17: e000000f5d50bee8 R18: 5ffffffffff04a1a R19: 0003ed797ff04a1a R20: 5ffc128680000000 R21: 00000007dae34a1a R22: 0003ed71a50d0000 R23: 000000081bc154c0 R24: e000000f5d50bed8 R25: e00000101d7f46b0 R26: e00000101d7f4580 R27: e000000f5d50bee0 R28: 0000000000001000 R29: e00000026a61dcc0 R30: e00000026a61dca0 R31: e000000f230960e8 F6: 1003e0000000000000000 F7: 1003e00000000000000a0 F8: 1003e0000000000000060 F9: 1003e0000000000000001 F10: 1003e00000000001d5b18 F11: 1003e0044b82fa09b5a53 #1 [BSP:e000000c078c11e0] journal_write_metadata_buffer at a00000020f250ea0 crash> -----------------------------------------------------------------------
If this looks like a recent regression since 2.6.18-1.2747.el5, I suppose it might be interesting to try a bisecting search for the update which introduced it...?
According to my current test results below, this problem looks like a regression at 2.6.18-2.el5. So I added the "Regression" keyword. o 2.6.18-1.2961.el5: MCA doesn't occur for over 30 hours o 2.6.18-1.3014.el5: MCA doesn't occur for over 130 hours o 2.6.18-2.el5 : MCA occurred within 4 hours (2 times out of 2 trials) o 2.6.18-4.el5 : MCA occurred within 4 hours (3 times out of 4 trials) The MCA occurs on 16CPUs ia64 box but doesn't occur on 2CPUs ia64 box so far. And the MCA always occurs in ext3's functions (mostly journal_write_metadata_buffer) I'm trying on 8CPUs x86_64 box too now.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
These are the patches that were added between 2.6.18-1.3014.el5 and 2.6.18-2.el5: +Patch20023: xen-register-pit-handlers-to-the-correct-domain.patch +Patch20024: xen-quick-fix-for-cannot-allocate-memory.patch +Patch21215: linux-2.6-misc-fix-vdso-in-core-dumps.patch +Patch21216: linux-2.6-sata-ahci-support-ahci-class-code.patch +Patch21217: linux-2.6-sata-support-legacy-ide-mode-of-sb600-sata.patch +Patch21218: linux-2.6-rng-check-to-see-if-bios-locked-device.patch +Patch21219: linux-2.6-mm-handle-map-of-memory-without-page-backing.patch Is there any chance the reporter could try to identify which patch may have caused the regression? None of them are specifically related to ext3 or jbd, perhaps the last patch listed above could be related?
I'm trying that, but it may take long time a little bit.
Thank you, I know backing out patches is a bit tedious :) But I'm not sure that I can reproduce it here...
This is an update of my testing results. According to the current results, the suspect seems to be Patch21219 (linux-2.6-mm-handle-map-of-memory-without-page-backing.patch) as esandeen guesses. o 2.6.18-2.el5 without Patch21215: the MCA occurs within 4 hours o 2.6.18-2.el5 without Patch21218: the MCA occurs within 5 hours o 2.6.18-2.el5 without Patch21219: the MCA doesn't occur for 32 hours o 2.6.18-2.el5 without Patch20023(xen fix), Patch20024(xen fix), Patch21216(sata fix) and Patch21217(sata fix) are not tested, because I think these patches are not related with this problem.
Thank you for that update, narrowing it downto one patch is very helpful. Are there any other interesting messages before the MCA? Can you share the testcase or describe what it is doing? Thanks, -Eric
If I understand things right, patch 21219 / linux-2.6-mm-handle-map-of-memory-without-page-backing.patch Is in the community. http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f4b81804a2d1ab341a4613089dc31ecce0800ed8 So it's an interesting point that you don't hit it with the latest community kernel. For bug 221029, we took that patch in addition to Prarit's intel_rng patch to fix a large memory x86_64 problem. However, there is some evidence that we may just need the intel_rng part and could possibly drop the other patch that you seem to hit a problem with. This needs careful testing because it risks breaking x86_64 if we're not careful. The original proposal for the do no pfn / linux-2.6-mm-handle-map-of-memory-without-page-backing.patch 21219 patch was in bugzilla 211854. It was later pulled because a different red hat xen patch actually fixed our problem back then (at the time, sgi didn't know it was even in the kernel or pulled). So this has a bit of a complicated history to it. I'd be interested to know about the test case perhaps and any ideas you might have on how to proceed. I don't quite know why this patch would cause trouble and it's had quite a bit of exercise here -- but perhaps not the same type of tests you're running. We could try the test case on our machines as well. Let us know if that would help. In other words, should we try running the "filesystem stress tests" -- and where might we find them. Thanks!
Re: comment#8 No interesting messages. The messages below appear suddenly. ---------------------------------------------------------------- Entered OS MCA handler. PSP=a8000000fff21330 cpu=10 monarch=1 All OS MCA slaves have reached rendezvous ---------------------------------------------------------------- The testcase which I'm using is a little complicated. So I'm trying to make a simple testcase. By the way, it seems no user of the Patch21219(nopfn) in 2.6.18-2.el5. So I'm confused why the problem doesn't occur by only dropping the Patch21219. I'm still investigating.
Puzzling indeed. Esp since it doesn't happen with an upstream kernel. Reading through 221029 and 211854 makes me worried though, the no_pfn patch should not do anything on its own, and apparently it seems to affect both the intel_rng thingy and this. Weird! Also, if anything would go amiss, I'd expect something in the fault handler to go bang, not some random ext3 stuff. What's not clear to me is whether SGI's mspec driver is loaded (although I suspect not), and if so, does it make use of the nopfn handler or the hack in vm_normal_page()? (or are there any other 3th party modules loaded for that matter?) Would it be possible to pinpoint the exact place in journal_write_metadata_buffer() where it goes bang? Or is that varying a lot?
(In reply to comment #11) > Puzzling indeed. Esp since it doesn't happen with an upstream kernel. > > Reading through 221029 and 211854 makes me worried though, the no_pfn patch > should not do anything on its own, and apparently it seems to affect both the > intel_rng thingy and this. Weird! Hi Peter, Thats my take too, I really can't see how the nopfn patch can cause this. I strongly suspect it's something else thats hitting it and the nopfn path just being the thing that pushes it over the limit. The only thing I could imagine would be if someone had a struct vm_operations_struct that was allocated and populated manually without memset'ing it to zero first. Doing so could possibly result in something jumping to a garbage address, but the real bug would be the place allocating it without zeroing it. > What's not clear to me is whether SGI's mspec driver is loaded (although I > suspect not), and if so, does it make use of the nopfn handler or the hack in > vm_normal_page()? (or are there any other 3th party modules loaded for that matter?) The only current user of nopfn I know of is mspec and it will normally only ever use the nopfn path if some app tries to use the fetchop space. The common case is the MPI library, but it can be used manually too. In either case it would require the tester to actively take action to use it. Cheers, Jes
Ueda-san, Is this ia64 NEC system available here in Westford? If so, I'd like to know how to access it. If not, then I have a few questions: What is the size/memory configuration of the system? 32p/32G? Could you send us an lspci output? If we don't have the system, we might be able to use another one our bigger ia64 systems to reproduce the issue. P.
Ueda-san, lsmod output would be appreciated too. P.
(In reply to comment #13) > (In reply to comment #11) > > What's not clear to me is whether SGI's mspec driver is loaded (although I > > suspect not), and if so, does it make use of the nopfn handler or the hack in > > vm_normal_page()? (or are there any other 3th party modules loaded for that > matter?) > > The only current user of nopfn I know of is mspec and it will normally only ever > use the nopfn path if some app tries to use the fetchop space. The common case > is the MPI library, but it can be used manually too. In either case it would > require the tester to actively take action to use it. Actually, only the upstream version of mspec uses nopfn. The version of mspec that we ship with rhel5 does not, but it does use the hack in vm_normal_page().
I have sent Prarit the requested HW information in Comment#14 and Comment#15. Re: Comment#11 I haven't investigated the exact place which the MCA occurs because "bt -l" of crash command doesn't work to decide the place and the places are not always same exactly like below. - 2.6.18-2.el5 without Patch21215 o [BSP:e0000008067d11e0] journal_write_metadata_buffer at a00000020f250ea0 - 2.6.18-2.el5 without Patch21218 o [BSP:e000000805b011e0] journal_write_metadata_buffer at a00000020f250ed0 o [BSP:e000000c02131290] __journal_file_buffer at a00000020f23c5b0 - 2.6.18-2.el5 o [BSP:e000000806a391e0] journal_write_metadata_buffer at a00000020f250e70 o [BSP:e000000877be11e0] journal_write_metadata_buffer at a00000020f250e70 o [BSP:e000000406f811e0] journal_write_metadata_buffer at a00000020f250ea0 o [BSP:e000000801ea11e0] journal_write_metadata_buffer at a00000020f250ed0 - 2.6.18-4.el5 o [BSP:e000000801f391e0] journal_write_metadata_buffer at a00000020f250e90 o [BSP:e000000c078c11e0] journal_write_metadata_buffer at a00000020f250ea0 o [BSP:e00000011b8011e0] journal_write_metadata_buffer at a00000020f250ee0 o [BSP:e000000406409290] __journal_file_buffer at a00000020f23c540 I'm trying 2.6.18-8.el5 now, and no MCA occurs for 20 hours so far. It's very strange, since it seems no related fix between -4.el5 and -8.el5. I'll investigate more, but I change the BZ state back to "ASSIGNED".
This is an testing/investigation status update. o 2.6.18-1.2961.el5: MCA doesn't occur for over 30 hours o 2.6.18-1.3014.el5: MCA doesn't occur for over 130 hours o 2.6.18-2.el5 : MCA occurs within 5 hours o 2.6.18-3.el5 : MCA occurs within 5 hours o 2.6.18-4.el5 : MCA occurs within 5 hours o 2.6.18-5.el5 : MCA doesn't occur for over 120 hours o 2.6.18-6.el5 : MCA doesn't occur for over 25 hours o 2.6.18-7.el5 : MCA doesn't occur for over 25 hours o 2.6.18-8.el5 : MCA doesn't occur for over 130 hours This problem seems to be fixed in -5.el5, though I can't see any related fix. I'm trying to specify which patch fixes the MCA. And I got exact places where the MCA occurs. The results show that the MCA sometimes occurs at 'nop' instruction or 'adds' instruction. - 2.6.18-2.el5 (Rebuild kernel on my local machine) o [BSP:e00000011fe191e0] journal_write_metadata_buffer at a00000020f250e90 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 370 <journal_write_metadata_buffer+2352>: [MMI] adds r25=304,r26;; R25: 00000000949b004a R26: e000000404a1bd00 o [BSP:e0000010030c11e0] journal_write_metadata_buffer at a00000020f250e70 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 369 <journal_write_metadata_buffer+2320>: [MMI] ld8 r28=[r29];; R28: 0000000000001000 R29: e0000008054354e0 - 2.6.18-2.el5 without Patch21215 (Rebuild kernel on my local machine) o [BSP:e0000008067d11e0] journal_write_metadata_buffer at a00000020f250ea0 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 371 <journal_write_metadata_buffer+2368>: [MMI] st8 [r24]=r35;; R24: e000000c71e39e38 - 2.6.18-2.el5 without Patch21218 (Rebuild kernel on my local machine) o [BSP:e000000805b011e0] journal_write_metadata_buffer at a00000020f250ed0 include/asm/bitops.h: 46 <journal_write_metadata_buffer+2416>: [MIB] nop.m 0x0 o [BSP:e000000c02131290] __journal_file_buffer at a00000020f23c5b0 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/transaction.c: 1951 <__journal_file_buffer+144>: [MMI] ld8 r9=[r33];; R9: e00000101d28f900 - 2.6.18-2.el5 o [BSP:e000000806a391e0] journal_write_metadata_buffer at a00000020f250e70 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 369 <journal_write_metadata_buffer+2320>: [MMI] ld8 r28=[r29];; R28: 0000000000001000 R29: e00000101bfe7a60 o [BSP:e000000406f811e0] journal_write_metadata_buffer at a00000020f250ea0 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 371 <journal_write_metadata_buffer+2368>: [MMI] st8 [r24]=r35;; R24: e00000087f78c6f8 o [BSP:e000000877be11e0] journal_write_metadata_buffer at a00000020f250e70 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 369 <journal_write_metadata_buffer+2320>: [MMI] ld8 r28=[r29];; R28: 0000000000001000 R29: e000000e6d314ee0 o [BSP:e000000801ea11e0] journal_write_metadata_buffer at a00000020f250ed0 include/asm/bitops.h: 46 <journal_write_metadata_buffer+2416>: [MIB] nop.m 0x0 - 2.6.18-3.el5 o [BSP:e000000c078d1260] journal_file_buffer at a00000020f242af0 include/linux/bit_spinlock.h: 22 <journal_file_buffer+80>: [MMI] mov r18=1048576 R18: bfffffffffc94b34 - 2.6.18-4.el5 o [BSP:e000000c078c11e0] journal_write_metadata_buffer at a00000020f250ea0 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 371 <journal_write_metadata_buffer+2368>: [MMI] st8 [r24]=r35;; R24: e000000f5d50bed8 o [BSP:e000000801f391e0] journal_write_metadata_buffer at a00000020f250e90 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/journal.c: 370 <journal_write_metadata_buffer+2352>: [MMI] adds r25=304,r26;; R25: e00000101ee41630 R26: e00000101ee41500 o [BSP:e00000011b8011e0] journal_write_metadata_buffer at a00000020f250ee0 include/asm/bitops.h: 44 <journal_write_metadata_buffer+2432>: [MMI] ld4.acq r2=[r38];; R2: 0000000000004020 o [BSP:e000000406409290] __journal_file_buffer at a00000020f23c540 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/fs/jbd/transaction.c: 1950 <__journal_file_buffer+32>: [MIB] nop.m 0x0
The MCA has occurred on 2.6.18-8.el5 twice during the last weekend running. The first one took 19 hours, and the second one took 38 hours. - 2.6.18-8.el5 o [BSP:e0000005066810a0] kernel_thread_helper at a0000001000126a0 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/arch/ia64/kernel/process.c: 710 o [BSP:e00000069d8e90a0] kernel_thread_helper at a0000001000126a0 /usr/src/debug/kernel-2.6.18/linux-2.6.18.ia64/arch/ia64/kernel/process.c: 710
Any news on this bug? Is it fixed or still a problem on 2.6.18-9.el5?
I confirmed that it occurs on: o 2.6.18-10.el5 o upstream 2.6.19.1 o upstream 2.6.20 Also, it occurs on ext2-only environment (with 2.6.18-4.el5). IIP at the time points to kernel_thread_helper() according to the dump. (So ext3 is not a suspect now.) The reason of the MCA was the chipset detected a READ access to physical address 0x1000000000000 (256[TB]) from a CPU. Because this type of MCA is asynchronous according to the Intel's IPF Manual, the instruction pointed to by the IIP is not necessarily the instruction which issued the READ. So to trap the 256[TB] READ before MCA, I made the attached debug patch. It should cause kernel panic when strange TLB insertions or physical mode memory access is attempted.. But I still get the MCA instead of panic. Currently, I'm trying to point out which kernel did get this bug. (Trying 2.6.18-1.3014.el5, but no problem for 400 hours so far, though it occurs within 200 hours on other kernels even the worst case.)
Created attachment 151496 [details] Debug patch
what is the test results of the debug patch?
RE: comment#24, We expected the patch to catch the illegal access, but it couldn't. (i.e. MCA still occurred.)
I try to reproduce it on my tiger4. What do you want to recommend on the test cases/kernel vesion,..etc to reproduce this issue.
So far, it seems key points are: - a lot of CPUs (please see comment #2) - a lot of file systems mounted (about 20) - parallel fs stress on those mount points - kernel version is flexible (2.6.18-8.el5, 2.6.19.1, 2.6.20) - reproduction takes time varying from a few hours to a few weeks - context switch to/from kernel thread might be related
Can you share the test cases with me?
The test case is sent to Luming.
ok I guess it is NOT easy to reproduce this problem...so I'd like to analyze the kdump image. Would you please let me have a look at the kdump image?
Is it still the thought that the patch identified in comment #7 is the culprit?
Some kdump images have been sent to Luming. Re: Comment#34 I think Patch21219 is probably not the cause. Current status is below: The reason of the MCA: The chipset detected a READ access of 128 bytes to physical address 0x1000000000000 (256[TB]) from a CPU. (Found from HW log.) Analyzing from OS dump is difficult because this type of MCA is asynchronous according to the Intel's IPF Manual and the instruction pointed to by the IIP is not necessarily the instruction which issued the READ. Results of trials: o 2.6.18-1.3014.el5 : doesn't occur so far (over 700 hours) o 2.6.18-[2-10].el5 : occur (Some tests took about 200 hours though) o upstream 2.6.[19-21] : occur o ext2-only environment: occur (ext3 is not a suspect now) o with the trap patch : occur (the patch is in Comment#23) Because the MCA doesn't occur on 2.6.18-1.3014.el5, Patch21219 could be a suspect. But I think the percentage is very low, because it looks no user of the nopfn() in RHEL5 and I have confirmed do_no_pfn() isn't called when the MCA occurs by another trap patch. (So I believe I can reproduce it on the kernel without Patch21219 like 2.6.18-1.3014.el5 by very long time testing.) Currently, I'm thinking about trying new FWs like PAL of IA64, because the READ access is from a CPU but the chipset can't tell whether it is from OS or FW. Since I confirmed that a newer PAL is available (though I'm not sure what king of updates are included), I'll try it.
This is an status update for the commnet#35: 2.6.21 without Patch21219(nopfn) : the MCA occurs The latest PAL (PAL_A:7.31, PAL_B:7.79) : the MCA occurs I started a long time testing (about a few months) on 2.6.18-1.3014.el5.
Status update: Confirmed the MCA occurs on 2.6.18-1.3014.el5. (It took 600 hours.) So Patch21219(nopfn) is not a suspect now. Next plan: To confirm whether this problem occurs on only RHEL5(recent kernel), I will start the testing on RHEL4.
No MCA occurs on RHEL4.5 kernel-2.6.9-55.EL for 1512 hours. Current testing results: o 2.6.9-55.EL : doesn't occur so far (over 1500 hours) o 2.6.18-1.3014.el5 : occur (Took 600 hours. Patch21219 isn't suspect) o 2.6.18-[2-10].el5 : occur (Some tests took about 200 hours though) o upstream 2.6.[19-21] : occur o ext2-only environment: occur (ext3 is not a suspect now) o with the trap patch : occur (the patch is in Comment#23) o with the latest PAL : occur (PAL_A:7.31, PAL_B:7.79) Next plan: o I should try the latest RHEL5 kernel-2.6.18-52.el5 first because various changes are included since 2.6.18-10.el5 (1500 hours) o If MCA still occurs on 2.6.18-52.el5, I'll try upstream kernels from 2.6.18 through 2.6.9 to point out a suspect kernel version
The MCA still occurs on 2.6.18-52.el5 after 147 hours running. Current testing results: o 2.6.9-55.EL : doesn't occur so far (over 1500 hours) o 2.6.18-1.3014.el5 : occur (Took 600 hours. Patch21219 isn't suspect) o 2.6.18-[2-52].el5 : occur (Some tests took about 200 hours though) o upstream 2.6.[19-21] : occur o ext2-only environment: occur (ext3 is not a suspect now) o with the trap patch : occur (the patch is in Comment#23) o with the latest PAL : occur (PAL_A:7.31, PAL_B:7.79) Next plan: o Try upstream kernels from 2.6.18 through 2.6.9 to point out a suspect kernel version
It is really appreciated to consistently update the testing results on the bugzilla. Just curious if it is possible to capture the whole TLBs in processor on MCA? Probably we could find out if the MCA triggered by memory access came from processor..
The TLBs information isn't included in the chipset log nor the MCA log. So I can't see it.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Why was this bugzilla closed with WONTFIX? I think this is a critical bug and needs to be fixed in the future. So I reopen this bugzilla.
Confirmed that the MCA occurs on upstream 2.6.18 and 2.6.18-53.el5. Current testing results: o 2.6.9-55.EL : doesn't occur so far (over 1500 hours) o 2.6.18-1.3014.el5 : occur (Took 600 hours. Patch21219 isn't suspect) o 2.6.18-[2-53].el5 : occur (Some tests took about 200 hours though) o upstream 2.6.[18-21] : occur o ext2-only environment: occur (ext3 is not a suspect now) o with the trap patch : occur (the patch is in Comment#23) o with the latest PAL : occur (PAL_A:7.31, PAL_B:7.79) Next plan: o Try upstream kernels from 2.6.17 through 2.6.9 to point out a suspect kernel version. But 2.6.[9-17] built on RHEL5 environment doesn't boot. (Same version/same config built on RHEL4 environment can boot.) So I need some more confirmation before starting 2.6.17 testing. (E.g. testing 2.6.18, which built on RHEL5, on RHEL4 environment and vice versa.)
QE nack for 5.2 based on comment 42.
This will not make RHEL5.2 - it took quite sometime to reproduce this systemon only on specific platform..., - the investigation is still ongoing, I suggest moving this to RHEL5.3, -Luming
> - it took quite sometime to reproduce this systemon only on specific platform... Just for a record, this MCA occurred on another NEC ia64 system, which is Montecito platform, too, not only on the old McKinley platform. So I guess this problem is not platform specific.
To comment#48, How do you know the MCA on another NEC montecito platform is same ?
Re: comment#52, I can see it from the chipset log, which is hardware specific binary data. The MCA on the Montecito platform happened 4 times in the past, and then, the chipset detected out of range access from a CPU. On the McKinley platform, the accessed physical address is always 0x1000000000000 (256[TB]). On the other hand, on the Montecite platform, the accessed physical address is not same like below: - 0x00FFFFFFB4000 (around 16[TB]) - 0x00FFFFE610000 (around 16[TB]) - 0x00FFFFFC1C000 (around 16[TB]) - 0x228455C3D4000 (around 552[TB]) It's a different point from the phenomenon on the McKinley platform. But the back-trace from kernel memory dump was similar to that on the McKinley platform, so I guess it's same problem.
2.6.18-53.el5 : MCA occrs in less than 10hrs under stress test. 2.6.18-85.el5 : pass 100hrs stress test without MCA. Sounds there are some really improvements.. I'm going to kick off 2000 hrs stress test..
Under fs stress testing workload, No MCA for almost 13 days [root@nec-tx7-1 ~]# uptime 22:19:36 up 12 days, 22:49, 2 users, load average: 35.16, 40.90, 43.33
Under fs stress testing workload, No MCA for almost 20 days [root@nec-tx7-1 ~]# uptime 05:04:18 up 19 days, 5:34, 2 users, load average: 57.74, 48.30, 42.71
[root@nec-tx7-1 ~]# uptime 22:43:19 up 23 days, 23:13, 2 users, load average: 40.48, 38.14, 38.56
The box has been running the fs stress test for over 620 hrs, Still No MCA..., It is the second best score comparing with the results in comment# 45. [root@nec-tx7-1 ~]# uptime 22:44:41 up 25 days, 23:14, 2 users, load average: 43.05, 41.77, 41.86 Could NEC run same test on a different box to confirm the results are consistent...
Re: Comment#58 OK. But currently other boxes are being used for other testings. And for a internal reason, those boxes will be unavailable temporarily in late April through early May. So I will be able to start the test in early May or so.
For internal reason, the box got manually reboot without seeing MCA nearly 700hrs [root@nec-tx7-1 ~]# top top - 13:25:14 up 28 days, 13:55, 2 users, load average: 41.74, 39.91, 41.31 Tasks: 354 total, 8 running, 346 sleeping, 0 stopped, 0 zombie Cpu(s): 22.4%us, 42.5%sy, 0.0%ni, 14.9%id, 13.8%wa, 0.0%hi, 6.5%si, 0.0%st Mem: 66612864k total, 66292416k used, 320448k free, 6642944k buffers Swap: 0k total, 0k used, 0k free, 1665200k cached I will schedule another 1000hrs testing to verify if it is stable. For now, I'm pretty sure this problem has disappeared in 2.6.18-85.el5. With bisection, probably I can identified the patches that cure the problem with its good side effect. But for now, I don't think we need to pursue a kernel patch for this problem. Closing this bug as CURRENTRELEASE for now. If the problem could occur again, please feel free to re-open the bug.