Bug 183232

Summary:

Problems with EDAC module during first boot

Product:

Red Hat Enterprise Linux 4

Reporter:

Linda Wang <lwang>

Component:

kernel

Assignee:

Alan Cox <alan>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.3

CC:

arozansk, jbaron, notting, ppokorny, rhentosh, tburke

Target Milestone:

---

Keywords:

Regression

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2006-0068

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-03-07 18:08:51 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
lspci-vxxx output (pre and post)	none
edac quirks patch	none
blacklist patch	none
patch: avoid touching dev0:fun1 if it's hidden	none

Description Linda Wang 2006-02-27 17:04:04 UTC

Description of problem:

I am testing the RHEL4 U3 Beta on an Intel EM64T based system.  This is the
x86-64/EM64T version of the distribution.

The install completed successfully, but upon reboot, the system panic's during
rc.sysinit around "remounting root" or "No Software RAID found" (from dmraid -ay).

The panic is:

MC0: Uncorrected Error

That's clearly from the new EDAC feature which was added in the release.

I've tried two different motherboard/CPU sets and two completely different sets
of RAM.  None of this hardware has exhibited any problems in the past.  So I'm
fairly certain this is a false positive.

I tried several different ways to disable the "panic_on_ue" behavior on the
kernel command line, but "edac_mc.panic_on_ue=0" didn't work, nor did any of the
others.

Ultimately, I had to boot into rescue mode and kill the edac modules with the
following in /etc/modprobe.conf:

alias e752x_edac /dev/null
alias edac_mc /dev/null

Then I was able to boot and the system appears to be running without problems.

Further, I am able to "insmod edac_mc panic_on_ue=0" and load e752x_edac without
problems.  The e752x_edac module does *not* log any memory errors after I
manually load the modules.

Now that I had the system up, I changed the /etc/modprobe.conf to read:

options edac_mc panic_on_ue=0

and tried rebooting the system.  Now the system boots and runs just fine except
that the log is filling up with the attached error message.  Clearly something
isn't initialized or being read correctly.

But after unloading and reloading the e752x_edac module, everything is fine:

> MC0: Removed device 0 for e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0)
> tolm = 20000, remapbase = ffc000, remaplimit = 0
> MC0: Giving out device to e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0)



And no further errors are reported.

So it seems that the hotplug loading of e752x_edac in /etc/rc.sysinit (via
kmodule) is causing things to be initialized badly.  Perhaps there is a race
condition of some kind between edac_mc and e752x_edac loading?

What additional information and tests can I run to track down the root of the
problem?

I've searched bugzilla, but I haven't found any bugs *at all* against the
RHEL4U3 Beta.  Perhaps I'm searching the wrong catagories?

I'll bugzilla if I can get some advice on the right product/release/component to
log against.

Thanks!
:v)

PS. I've got a system with a different motherboard but the same chipset that
I'll try on next.



Fatal Error PCI Express C1
Fatal Error PCI Express C
Fatal Error PCI Express B1
Fatal Error PCI Express B
Fatal Error PCI Express A1
Fatal Error PCI Express A
Fatal Error DMA Controler
Fatal Error HUB Interface
Fatal Error System Bus
Fatal Error DRAM Controler
Non-Fatal Error PCI Express C1
Non-Fatal Error PCI Express C
Non-Fatal Error PCI Express B1
Non-Fatal Error PCI Express B
Non-Fatal Error PCI Express A1
Non-Fatal Error PCI Express A
Non-Fatal Error DMA Controler
Non-Fatal Error HUB Interface
Non-Fatal Error System Bus
Non-Fatal Error DRAM Controler
Non-Fatal Error Internal Buffer
Fatal Error PCI Express C1
Fatal Error PCI Express C
Fatal Error PCI Express B1
Fatal Error PCI Express B
Fatal Error PCI Express A1
Fatal Error PCI Express A
Fatal Error DMA Controler
Fatal Error HUB Interface
Fatal Error System Bus
Fatal Error DRAM Controler
Non-Fatal Error PCI Express C1
Non-Fatal Error PCI Express C
Non-Fatal Error PCI Express B1
Non-Fatal Error PCI Express B
Non-Fatal Error PCI Express A1
Non-Fatal Error PCI Express A
Non-Fatal Error DMA Controler
Non-Fatal Error HUB Interface
Non-Fatal Error System Bus
Non-Fatal Error DRAM Controler
Non-Fatal Error Internal Buffer
Fatal Error HI Address or Command Parity
Fatal Error HI Illegal Access
Fatal Error Out of Range Access
Fatal Error Enhanced Config Access
Non-Fatal Error HI Internal Parity
Non-Fatal Error HI Data Parity
Non-Fatal Error Hub Interface Target Abort
Fatal Error HI Address or Command Parity
Fatal Error HI Illegal Access
Fatal Error Out of Range Access
Fatal Error Enhanced Config Access
Non-Fatal Error HI Internal Parity
Non-Fatal Error HI Data Parity
Non-Fatal Error Hub Interface Target Abort
Fatal Error System Bus PCI Express C1
Fatal Error System Bus PCI Express C
Fatal Error System Bus HUB Interface
Non-Fatal Error System Bus PCI Express B1
Non-Fatal Error System Bus PCI Express B
Non-Fatal Error System Bus PCI Express A1
Non-Fatal Error System Bus PCI Express A
Non-Fatal Error System Bus DMA Controler
Non-Fatal Error System Bus System Bus
Non-Fatal Error System Bus DRAM Controler
Fatal Error System Bus PCI Express C1
Fatal Error System Bus PCI Express C
Fatal Error System Bus HUB Interface
Non-Fatal Error System Bus PCI Express B1
Non-Fatal Error System Bus PCI Express B
Non-Fatal Error System Bus PCI Express A1
Non-Fatal Error System Bus PCI Express A
Non-Fatal Error System Bus DMA Controler
Non-Fatal Error System Bus System Bus
Non-Fatal Error System Bus DRAM Controler
Non-Fatal Error Internal PMWB to DRAM parity
Non-Fatal Error Internal PMWB to System Bus Parity
Non-Fatal Error Internal System Bus or IO to PMWB Parity
Non-Fatal Error Internal DRAM to PMWB Parity
Non-Fatal Error Internal PMWB to DRAM parity
Non-Fatal Error Internal PMWB to System Bus Parity
Non-Fatal Error Internal System Bus or IO to PMWB Parity
Non-Fatal Error Internal DRAM to PMWB Parity
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: CE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: CE - no information available: INTERNAL ERROR
MC0: UE - no information available: e752x UE log memory write
MC0: UE - no information available: e752x UE log memory write
MC0: could not look up page error address ffffff
MC0: CE page 0xffffff, row -1 : Memory read retry
MC0: could not look up page error address ffffff
MC0: CE page 0xffffff, row -1 : Memory read retry
MC0: Memory threshold CE
MC0: Memory threshold CE
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR


Version-Release number of selected component (if applicable):

2.6.9-27.EL kernel

How reproducible:
Install the RC1 release

Steps to Reproduce:
1. install the RC beta release
2. 
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Linda Wang 2006-02-27 17:05:49 UTC

Addition email exchange and information:

Following up on my previous post...

Philip Pokorny wrote:

Jason Baron wrote:

> On Fri, 24 Feb 2006, Philip Pokorny wrote:
>  
>
>> I am testing the RHEL4 U3 Beta on an Intel EM64T based system.  This is the
>> x86-64/EM64T version of the distribution.
>>   
>
>
> hi Philip,
>
> thanks for testing! what exact kernel version did you encounter this error on?
>  
>
The kernel is 2.6.9-27.ELsmp. I've also seen the problem with the UP kernel.

Linux 2.6.9-27.ELsmp #1 SMP Tue Dec 20 19:21:06 EST 2005 x86_64 x86_64 x86_64
GNU/Linux

I'm following up on this because I have now got some results for another system
with a different motherboard, but the same chipset.

> [root@eng103 ~]# lspci -s 0:0.0
> 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c)

This system has 6GB of memory. Initially I didn't think it suffered the same
problem, but after running some reboot and memory stress (gzip/gunzip/md5sum of
an 8G file) tests, I have the following results. NOTE: I have disabled "panic on
UE" in /etc/modprobe.conf on this system.

The syslog lines show when the system was rebooted. The MC0 entries are repeated
instances of the problem. The first two are limited to 76 because this was
during reboot testing where the system automatically rebooted after completing
the init.d scripts. (I added a "shutdown -r now" to /etc/inittab at the end).
Notice the gap between 16:20 and 20:45. That is when I was running the
gzip/gunzip/md5sum tests.

> Feb 24 16:04:25 eng103 syslogd 1.4.1: restart.
> Feb 24 16:13:31 eng103 syslogd 1.4.1: restart.
> Feb 24 16:15:28 eng103 syslogd 1.4.1: restart.
> Feb 24 16:18:02 eng103 syslogd 1.4.1: restart.
> Feb 24 16:20:08 eng103 syslogd 1.4.1: restart.
> Feb 24 20:45:09 eng103 syslogd 1.4.1: restart.
> Feb 24 20:47:20 eng103 syslogd 1.4.1: restart.
> Feb 24 20:49:39 eng103 syslogd 1.4.1: restart.
> 76 x[Feb 24 20:50:06 eng103 kernel: MC0: UE - no information available: e752x
UE log memory write]
> Feb 24 20:51:39 eng103 syslogd 1.4.1: restart.
> 76 x[Feb 24 20:52:06 eng103 kernel: MC0: UE - no information available: e752x
UE log memory write]
> Feb 24 20:53:39 eng103 syslogd 1.4.1: restart.
> Feb 24 20:55:37 eng103 syslogd 1.4.1: restart.
> Feb 24 20:57:36 eng103 syslogd 1.4.1: restart.
> Feb 24 20:59:34 eng103 syslogd 1.4.1: restart.
> Feb 24 21:01:44 eng103 syslogd 1.4.1: restart.
> 330 x[Feb 24 21:04:11 eng103 kernel: MC0: UE - no information available: e752x
UE log memory write]

The last one indicates that I booted the system and bypassed the automatic
reboot. I captured the following from /proc/mc/0:

> Check PCI Parity: 0
> Panic PCI Parity: 0
> Panic UE: 0
> Log UE: 1
> Log CE: 1
> Poll msec: 1000
>
> MC Core: edac_mc Ver: 2.0.0.devel Dec 20 2005
> MC Module: e752x_edac $Revision: 1.3 $
> Memory Controller: E7520
> PCI Bus ID: 0000:00:00.0 (0000:00:00.0)
> EDAC capability: None SECDED S4ECD4ED
> Current EDAC capability: None S4ECD4ED
> Supported Mem Types: Registered-DDR
>
> 0:|:Memory Size: 2048 MiB
> 0:|:Mem Type: Registered-DDR
> 0:|:Dev Type: x4
> 0:|:EDAC Mode: S4ECD4ED
> 0:|:UE: 0
> 0:|:CE: 0
> 0.0::CE: 0
> 0.1::CE: 0
>
> 2:|:Memory Size: 2048 MiB
> 2:|:Mem Type: Registered-DDR
> 2:|:Dev Type: x4
> 2:|:EDAC Mode: S4ECD4ED
> 2:|:UE: 0
> 2:|:CE: 0
> 2.0::CE: 0
> 2.1::CE: 0
>
> 4:|:Memory Size: 2048 MiB
> 4:|:Mem Type: Registered-DDR
> 4:|:Dev Type: x4
> 4:|:EDAC Mode: S4ECD4ED
> 4:|:UE: 0
> 4:|:CE: 0
> 4.0::CE: 0
> 4.1::CE: 0
>
> Total Memory Size: 6144 MiB
> Seconds since reset: 157
> UE No Info: 942
> CE No Info: 314
> Total UE: 942
> Total CE: 314
> Total PCI Parity: 0
>

I hope that helps.

In short, it seems that if it loads and initializes OK, it doesn't fail later. I
saw this on my initial system as well. Once I had disabled "Panic on UE", if
EDAC started complaining, then I could rmmod/insmod the e752x_edac module and
the errors would stop.

Thanks,
:v)

Comment 2 Linda Wang 2006-02-27 17:06:48 UTC

Alan wrote:

On Mon, Feb 27, 2006 at 09:24:41AM -0500, Linda Wang wrote:

>>>> >>76 x[Feb 24 20:50:06 eng103 kernel: MC0: UE - no information 
>>>> >>available: e752x UE log memory write]


That looks real enough


>>> >In short, it seems that if it loads and initializes OK, it doesn't 
>>> >fail later. I saw this on my initial system as well. Once I had 
>>> >disabled "Panic on UE", if EDAC started complaining, then I could 
>>> >rmmod/insmod the e752x_edac module and the errors would stop.
>>> >


The two posted look rather different. The first is a "christmas tree" of
0xFFFFFFFF, the second looks quite real and the memory controller is flagging
specifically UE and CE events.

Comment 3 Linda Wang 2006-02-27 17:09:18 UTC

Philip Pokorny wrote:

> Sorry for the top post.  Can't do it any other way from my Treo.
>
> Alan, the log entries I posted were only selected lines.  I chose that line
because it was unique in the "christmas tree" output.  But every one of those UE
entries was a full christmas tree of messages.
>
> I'll get the full lspci from both systems when I get to the office.
>
> Thanks
>
>
>  -----Original Message-----
> From:   Alan Cox [alan]
> Sent:   Mon Feb 27 06:31:58 2006
> To:     Linda Wang
> Cc:     Philip Pokorny; Jason Baron; Alan Cox; Tim Burke
> Subject:        Re: Problems with EDAC module
>
> On Mon, Feb 27, 2006 at 09:24:41AM -0500, Linda Wang wrote:
> > >>76 x[Feb 24 20:50:06 eng103 kernel: MC0: UE - no information
> > >>available: e752x UE log memory write]
>
> That looks real enough
>
> > >In short, it seems that if it loads and initializes OK, it doesn't
> > >fail later. I saw this on my initial system as well. Once I had
> > >disabled "Panic on UE", if EDAC started complaining, then I could
> > >rmmod/insmod the e752x_edac module and the errors would stop.
> > >
>
> The two posted look rather different. The first is a "christmas tree" of
> 0xFFFFFFFF, the second looks quite real and the memory controller is flagging
> specifically UE and CE events.

Comment 4 Linda Wang 2006-02-27 17:27:17 UTC

one more posting that I miss.  This is the reply from Alan to Philip's 
original posting:


Alan Cox wrote:

> On Fri, Feb 24, 2006 at 03:00:15PM -0500, Linda Wang wrote:
>
>>> Fatal Error PCI Express C1
>>> Fatal Error PCI Express C
>>> Fatal Error PCI Express B1
>>> Fatal Error PCI Express B
>>> Fatal Error PCI Express A1
>>> Fatal Error PCI Express A
>>> Fatal Error DMA Controler
>>> Fatal Error HUB Interface
>>> Fatal Error System Bus
>>> Fatal Error DRAM Controler
>
>
> Something seems to have gone very wrong with the PCI setup as the
> chip is reading back 0xFFFFFFFF if it spewed all of this lot. Looks like
> a serious PCI layer breakage not an EDAC bug. We'd need to know who made
> the chip vanish on us.
>
> My guess is you are looking for another drive which managed to do a
> pci_disable_device() on it
>
> Need to know what is loaded on that box, in what order, and a full lspci -vxxx
>

Comment 5 Linda Wang 2006-02-27 20:13:01 UTC

Linda Wang wrote:

> Hi Philip,
>
> Can you post the full lspci output?  We are curious to what happened to your
systems.

Here are the LSPCI from the two systems.  You'll notice that the memory
controller is Rev 0C which is the third generation of this chip according to the
documentation on the Intel developers web site:

http://developer.intel.com/design/chipsets/E7520_E7320/documentation.htm#specupdates

Your Rev 09 chips were the first generation.

The lspci output is attached.  Did you want lspci -vv? or some other
combinations of options?

00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c)
00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c)
00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c)
00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c)
00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c)
00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c)
00:07.0 PCI bridge: Intel Corporation E7520 PCI Express Port C1 (rev 0c)
00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller
(rev 02)
00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller
(rev 02)
00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02)
00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt
Controller (rev 02)
00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller
(rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a)
00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02)
01:00.0 PCI bridge: Intel Corporation 80333 Segment-A PCI Express-to-PCI Express
Bridge
01:00.2 PCI bridge: Intel Corporation 80333 Segment-B PCI Express-to-PCI Express
Bridge
03:0e.0 RAID bus controller: Areca Technology Corp. ARC-1220 8-Port PCI-Express
to SATA RAID Controller
06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit
Ethernet PCI Express (rev 11)
07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit
Ethernet PCI Express (rev 11)
09:01.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)

00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c)
00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c)
00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c)
00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c)
00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller
(rev 02)
00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller
(rev 02)
00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02)
00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt
Controller (rev 02)
00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller
(rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a)
00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02)
00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02)
05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8062 PCI-E IPMI
Gigabit Ethernet Controller (rev 14)
07:01.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)

Comment 6 Alan Cox 2006-02-27 21:12:48 UTC

I really need lspci -vxxx

Ideally can you give me 

- lspci -vxxx before loading the module
- lspci -vxxx after loading the module

Would also help to know what modules were in the initrd.

Comment 7 Alan Cox 2006-02-27 21:21:26 UTC

Can you add your modprobe.conf as well ?

Comment 12 Jason Baron 2006-02-28 03:07:53 UTC

ok, i've stuck a test kernel at: http://people.redhat.com/~jbaron/bz183232/ Can
you see if this kernel resolves the issue?

Comment 16 Linda Wang 2006-02-28 20:59:32 UTC

Created attachment 125421 [details]
lspci-vxxx output (pre and post)

lspci-vxxx output (pre and post)

Comment 17 Matt Domsch 2006-02-28 21:07:05 UTC

Dell BIOS A04 on PowerEdge 2800, and BIOS A02 on PowerEdge 1425SC, both have
E7520 memory controllers.  BIOS does not set PCI device 0:0.0, config space
register 0xF4, bit 2, itself in either of these systems.  I believe then that
none of the Dell E7250-based systems have BIOS that sets this bit.

Therefore, if the problem can only happen when this bit is set by BIOS, and is
then disabled by by the e752x_edac driver, Dell systems would be unaffected.

Dell's systems with C4 stepping E7250 chip will report as (rev 09) in lspci. 
This is because Dell's BIOS reprograms the version field for these devices
because the C4 stepping is "compatible" with the C1-stepping that (rev 09)
specifies.  These chips will not report as (rev 0C) in lspci.

Comment 18 Philip Pokorny 2006-02-28 23:04:59 UTC

The new kernel (Comment #12) does *not* resolve the problem.

I am testing by putting the machine in a "reboot loop" with an entry in
/etc/inittab.  During each boot, it records the lspci -vxxx before and after
loading the module.  If there are "MC0: UE" errors logged in /var/log/messages,
then it saves it to a unique file before the next reboot.

When I first loaded the kernel, I got 13 good reboots after installing the
kernel and then on the next reboot, it failed.  Failed in this context means
that the dmesg and syslog fill with "christmas trees" of all errors set.

Today, I set the machine to reboot wether it failed or not.  In 72 reboots, only
*once* was the correct bit set after the e752x_edac driver loaded.  And that was
the only sucessful boot.

Is there some other race condition that would cause the write to F4 to be undone?

-----
Could this be worked around by putting this:

   setpci -s 0:0.0 F4.b=20:20

in /etc/rc.modules or an init.d script?

Comment 19 Linda Wang 2006-02-28 23:18:51 UTC

cut and paste an email from Philip describing the 1. his test case,
and 2. the drivers he uses:

Philip Pokorny wrote:

> bugzilla wrote:
>
>> I really need lspci -vxxx
>>
>> Ideally can you give me
>> - lspci -vxxx before loading the module
>> - lspci -vxxx after loading the module
>>  
>>
> You know, that's a little hard since the root filesystem is mounted read-only
at the point that the module gets loaded. (It's loaded by the loop at the top of
/etc/rc.sysinit driven by kmodule).
>
> To make the minimal intrusion into the init scripts, I'm using this patch to
collect and save the output until I can write it to a file:
>
> +++ /etc/rc.sysinit     2006-02-27 14:18:59.000000000 -0800
> @@ -142,6 +142,8 @@
>    fi
> fi
>
> +prelspci="$(lspci -vxxx)"
> +
> echo -n $"Initializing hardware... "
>
> ide=""
> @@ -217,6 +219,8 @@
> success
> echo
>
> +postlspci="$(lspci -vxxx)"
> +
> echo "raidautorun /dev/md0" | nash --quiet
>
> # Start the graphical boot, if necessary; /usr may not be mounted yet, so we
> @@ -478,6 +482,10 @@
> [ "$state" != "rw" -a "$READONLY" != "yes" ] && \
>   action $"Remounting root filesystem in read-write mode: " mount -n -o
remount,rw /
>
> +# Save lspci output from before
> +echo "$prelspci" > /root/pre-lspci-vxxx.$(date "+%Y%m%d-%H%M%S")
> +echo "$postlspci" > /root/post-lspci-vxxx.$(date "+%Y%m%d-%H%M%S")
> +
> # LVM2 initialization
> if [ -x /sbin/lvm.static -o -x /sbin/multipath.static -o -x /sbin/dmraid ]; then
>     if ! LC_ALL=C fgrep -q "device-mapper" /proc/devices 2>/dev/null ; then
>
>> Would also help to know what modules were in the initrd.
>>  
>>
> Here is the contents of the /lib directory in the initrd:
>
> lib/dm-zero.ko
> lib/libata.ko
> lib/jbd.ko
> lib/dm-mod.ko
> lib/ext3.ko
> lib/dm-mirror.ko
> lib/sd_mod.ko
> lib/scsi_mod.ko
> lib/dm-snapshot.ko
> lib/ata_piix.ko
>
> And modprobe.conf is now the following.  The only thing I added was the
panic_on_ue=0...
>
> alias eth0 sky2
> alias eth1 sky2
> alias scsi_hostadapter ata_piix
> alias usb-controller ehci-hcd
> alias usb-controller1 uhci-hcd
> options edac_mc panic_on_ue=0
>
> I've got a whole set of lspci -vxxx from before and after loading the module
for successful boots and for two failues.  I also have the output from a failed
boot before removing e752x_edac, after removing e752x_edac and then after
reloading e752x_edac.  I've attached them as a tarball...
>
> Can you grant me access to edit the Bugzilla entry?
>
> :v)

Comment 20 Philip Pokorny 2006-03-01 05:29:45 UTC

Jason, do you have the 32-bit versions of the kernel you created in comment #12?
 I know it didn't solve the problem for 64-bit, but I'd like to verify if it
fixes or not the problem with 32-bit.

Comment 21 Jason Baron 2006-03-01 15:56:32 UTC

Created attachment 125473 [details]
edac quirks patch

i only built an x86_64 kernel. i'll kick a 32-build for  you. Also, i'm
attaching the patch that i've added.

Comment 22 Jason Baron 2006-03-01 18:00:07 UTC

ok, x86 kernels at the same place as comment #12

Comment 25 Alan Cox 2006-03-01 19:23:02 UTC

Rev > 0x09 is not affected as the buggy quirk is not applied. Rev 0x09 or
anything reporting that will get IRQ balancing disabled which may harm
performance slightly sometimes but was neccessary for the -real- chips of that rev.

Thus it matters if 0x09 or 0x0C is reported as to what blows up or doesnt

Comment 26 Susan Denham 2006-03-01 19:26:59 UTC

Can Penguin please let us know whether the motherboard on the 2 systems
displaying the problem reported here are Intel motherboards or from a
third-party (who?).

Thanks,

Comment 27 Philip Pokorny 2006-03-01 20:08:00 UTC

The motherboards in question are designed by our ODM.  They are not Intel, Tyan,
Supermicro or ASUS.

Also, I've done additional reboot testing and it varies quite a bit.  Sometimes
it will "fail" on every reboot.  Other times, it will fail about 50% of the
time.  Most recently it's passed 32 reboots without a failure.

In every case of success, the lspci -vxxx shows that the bit (f4=20:20) is set.
 Every case of failure shows the bit not set.  I've got this in /etc/rc.modules
for now to work around the issue, but even with this, I need 'panic_on_ue=0' or
else it will panic between 'kmodule' and '/etc/rc.modules'.

setpci -s 0:0.0 f4.b=20:20

Comment 28 Jason Baron 2006-03-03 19:25:53 UTC

Created attachment 125617 [details]
blacklist patch

hi Phillip,

can you please test this update to /etc/hotplug/blacklist, to verify that it
fixes this issue for you.

thanks.

Comment 30 Philip Pokorny 2006-03-03 20:12:35 UTC

That seems a bit extreme.  Isn't that going to disable EDAC entirely?  Where
else in the startup process would the EDAC drivers get loaded if not by kmodule?

But yes, I'll test this.

Another bit of news, the 32-bit kernels also fail and the quirks patch did not
help the 32-bit kernel either.  But it did solve a kernel NULL pointer oops with
sky2 and bonding.

Comment 31 Jason Baron 2006-03-03 20:21:45 UTC

it is a big hammer yes :). However having the modules blacklisted does not
prevent you from doing an 'insmod' or 'modprobe' on thme. curious to see if the
blacklist fixes the issue...

Comment 32 Philip Pokorny 2006-03-03 23:48:36 UTC

OK.  Yes that prevents the e752x_edac driver from loading.  I've modified
/etc/rc.modules to now read:

-------- cut here ---------
#!/bin/bash
#
logger PCI_0:0.0_F4 is $(setpci -s 0:0.0 f4.b)
modprobe e752x_edac
logger PCI_0:0.0_F4 is $(setpci -s 0:0.0 f4.b)
setpci -s 0:0.0 f4.b=20:20
-------- cut here ---------

And now I find that in 22 reboots, the e752x_edac module is consistently setting
F4 bit 0x20 when it loads.

Perhaps there is another "quirk" like problem with another driver?  Here is the
output of kmodule:

OTHER hw_random
OTHER pciehp
OTHER pciehp
OTHER pciehp
OTHER pciehp
OTHER e752x_edac
NETWORK sky2
USB ehci-hcd
USB uhci-hcd
USB uhci-hcd
IDE ata_piix

This implies that pciehp is getting loaded just before the e752x_edac driver. 
Perhaps the PCIEHP driver is the cause of the race condition/problem?

Comment 33 Jason Baron 2006-03-04 00:13:45 UTC

so just to be clear, the system comes up fine with the workaround from comment
#28, and furthermore, loading the edac module at a later stage appars to work
fine. is that correct?

Comment 34 Philip Pokorny 2006-03-04 00:27:04 UTC

Yes.  That's correct.

Comment 35 Jay Turner 2006-03-04 09:59:35 UTC

Good testing results all around.  Another vendor experiencing problems has
reported successful testing using the rev'd hwdata package (which blacklists the
edac modules) so I think we're on the road to a band-aid here.  Moving this bug
to  ON_QA as testing is positive and this bug is part of the hwdata advisory
scheduled to go out as part of U3.

Comment 36 Jay Turner 2006-03-04 11:28:21 UTC

Moving to PASS_QA.  All is well unless we hear negative testing results from the
partners.

Comment 38 Red Hat Bugzilla 2006-03-07 18:08:52 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0068.html

Comment 39 Amit Bhutani 2006-03-09 19:59:05 UTC

Don't know if it's related yet but experienced a kernel panic on reboot with 
the -34.EL kernel and the updated hwdata package while trying to regress this 
EDAC issue. 

Dell has gone ahead and filed a seperate bugzilla for this. Refer bugzilla 
#184523.

Comment 42 Alan Cox 2006-04-24 13:04:44 UTC

Root cause is apparently problems with AMI BIOS code. I'd take the opinion that
its an AMI BIOS flaw but Intel seem to be trying to avoid blaming anyone while
giving no useful solution to the problem and effectively seem to be arguing that
we should disable this functionality silently for AMI BIOS users. 

(Quote from Intel on the kernel-list)

I'm sorry to have to bring up these issues after a fare amount of good
work, and I don't know how this problem managed to get by for as long as
it has, but there are some issues with the EDAC and the BIOS for managed
computer systems.

Managed computers are systems with automatic ECC logging to a System
Event Log or SEL.  They typically have an out of band Board Management
Controller aka BMC or IPMC that runs out of band WRT the OS payload.

The issues found with the EDAC driver are:
1) The default AMI BIOS behavior on SMI is to check the chipset error
registers (Dev0:Fun1) and re-hide them.
2) If you are lucky enough to have BIOS code that doesn't re-hide
Dev0:Fun1; then when EDAC is loaded there is a race condition between
the platform BIOS and the driver to gain access to these registers. 
3) If the platform BIOS does the ECC logging out of band WRT the payload
OS, there is no good way for the driver to know at load time.  

We discovered these problems when testing with one of the later RHEL4-U3
RC's.  The EDAC driver called panic when the device 0 Function 1 of the
E7250 was re-hidden by the legacy USB SMI that when off between the load
of the EDAC driver and the USB host driver.  Loading the EDAC driver for
many AMI bios's is a panic land mine waiting go off.  Unless the OS
knows that it can trust the BIOS to not re-hide those chipset registers
using this driver is not a safe thing to do.

Basically if device 0 : function 1 is hidden by the platform at boot
time un-hiding and using the device and function is a risky thing to do,
as there is likely a good reason for it to have been hidden in the first
place.  If the BIOS thinks that it owns some registers then the OS
should not use them without great care.

It is possible that the driver could be modified to check for re-hiding
of the DEV0:FUN1, but this will be racey WRT SMI processing.  At least
it shouldn't panic.  

The driver should never get loaded by default or automatically.  If the
user knows enough about there BIOS to trust that the SMI behavior will
coexist with the driver then its OK to load otherwise using this driver
is not a safe thing to do.

I think the best thing to do is to have the driver error out in its init
or probe code if the dev0:fun1 is hidden at boot time.

Comments?

Next steps?

Do you want me to send a patch implementing graceful error handling at
driver init time so it doesn't load if DEV0:FUN1 is hidden?

--mgross
Intel Open Source Technology Center
(503) 677-4628
(503)-712-6227
ms: JF1-235
2111 NW 25th Ave
Hillsboro, OR 97124
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comment 43 Philip Pokorny 2006-04-24 18:12:38 UTC

Is there something in particular I should ask of my motherboard and BIOS developers?

Just tell them to contact AMI for a "fix"?

In fact the systems we have with this problem *do* have AMI BIOS and a BMC.

But the comment that the "legacy USB" support can mess with this as well just
scares me.  What else is SMI code doing behind my OS's back?  I wonder if this
is the source of all those "lost clock ticks" when doing ACPI calls as well. 
(see also bug 189052)

Comment 44 Alan Cox 2006-04-24 18:27:43 UTC

Quite possibly, depending how the vendor ACPI is implemented. SMI (and SMM
before it) can be very problematic for time sensitive systems depending upon
implementation.

If your BIOS developers can customize the AMI BIOS they may have the ability to
patch out the code which arbitarily disables the memory controller on an SMI.
Intel are, as I understand it, talking to BIOS vendors as well as Linux people
now that the problem is understood. It may be worth talking to Intel and finding
out what the actual plans to address the problem are.

Comment 45 Aristeu Rozanski 2006-11-21 15:35:04 UTC

Philip, I'm posting here a patch that solved the problem in BZ#185762, which
appears to be the same problem you're hitting here. Please test and let me know
how it goes.

Comment 46 Aristeu Rozanski 2006-11-21 15:41:08 UTC

Created attachment 141776 [details]
patch: avoid touching dev0:fun1 if it's hidden