Description of problem: When using Adaptec 2200s raid card with IBM e325 dual opteron system, buiding a filesystem on the disk array causes the machine to lock up with: MCG_STATUS: unrecoverable Northbridge Machine Check exception b40000000005001b 0 Uncorrectable condition Unrecoverable condition Northbridge status b40000000005001b GART error 11 Lost an northbridge error NB error address 00000000eff60000 Error uncorrected Address: 00000000eff60000 MCE at EIP ffffffff8010de3e ESP ffffffff80633fc8 CPU 0: Machine Check Exception: 0000000000000000 Kernel panic: Unable to continue In idle task - not syncing Have tried both the stock smp kernel with driver 1.1.2 and a custom kernel with driver 1.1.4 (from adaptec's sources), same results. Version-Release number of selected component (if applicable): IBM e325, dual opteron, 5g memory. Adaptec 2200s card firmware 4.0-4[6008]. Aacraid driver in the redhat-supplied 2.4.21-4.0.1.ELsmp kernel is 1.1.2, adaptec has driver source on their web site for 1.1.4 which does behave better than 1.1.2 but still has this problem. How reproducible: Readily reproducable. Steps to Reproduce: 1.Build an external RAID group (mine is 550GB) using the adaptec bios utility. Allow the array initialization to complete. 2. Use fdisk to create a partition of the entire device 3. Run mkfs -t ext3 /dev/sdb1 4. Wait 5. Note system crash on console Actual results: System crashes/hangs Expected results: Mountable filesystem Additional info: My raid is 9 x 73G Seagate drives split across 2 channels built as raid5.
I had also opened service request 277987 on this, a tech there responded suggesting to add "nomce" to the kernel command. This is effective is eliminating the crash and the systems appear stable. I am not familiar with the details of the machine check exception but if this is a valid fix then please close this ticket.
my customer has also reported a similar case since updated kernel to 2.4.21-20.EL.x86_64 and using Optron64. Sep 7 04:02:23 opteron kernel: Northbridge status a60000010005001b Sep 7 04:02:23 opteron kernel: GART error 11 Sep 7 04:02:23 opteron kernel: Lost an northbridge error Sep 7 04:02:23 opteron kernel: NB status: unrecoverable Sep 7 04:02:23 opteron kernel: NB error address 00000000fbf61258 Sep 7 04:02:23 opteron kernel: Error uncorrected
Documentation for AMD Opteron MCE architecture may be found at http://www.amd.com/us- en/assets/content_type/white_papers_and_tech_docs/26094.PDF This appears to decode to be a GART TLB Error with a valid cause address of 00000000fbf61258. Given the address (it's very near where I would expect mmio space would be allocated) I would take a look in /proc/iomem and see if the controller in question has memory near this address.
Thanks for the information David. Shinya, Jeff, Please check /proc/iomem to see if the Adaptec 2200 has memory at the address shown in the machine check. Also please check with Adaptec and make sure you have the latest firmware for that board.
Since we have not received the feedback we requested, we will assume the problem was not reproduceable or has been fixed in a later update for this product. Users who have experienced this problem are encouraged to upgrade to the latest update release, and if this issue is still reproduceable, please contact the Red Hat Global Support Services page on our website for technical support options: https://www.redhat.com/support If you have a telephone based support contract, you may contact Red Hat at 1-888-GO-REDHAT for technical support for the problem you are experiencing.