Bug 183232
Summary: | Problems with EDAC module during first boot | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Linda Wang <lwang> | ||||||||||
Component: | kernel | Assignee: | Alan Cox <alan> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 4.3 | CC: | arozansk, jbaron, notting, ppokorny, rhentosh, tburke | ||||||||||
Target Milestone: | --- | Keywords: | Regression | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | i686 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | RHBA-2006-0068 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2006-03-07 18:08:51 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Addition email exchange and information: Following up on my previous post... Philip Pokorny wrote: Jason Baron wrote: > On Fri, 24 Feb 2006, Philip Pokorny wrote: > > >> I am testing the RHEL4 U3 Beta on an Intel EM64T based system. This is the >> x86-64/EM64T version of the distribution. >> > > > hi Philip, > > thanks for testing! what exact kernel version did you encounter this error on? > > The kernel is 2.6.9-27.ELsmp. I've also seen the problem with the UP kernel. Linux 2.6.9-27.ELsmp #1 SMP Tue Dec 20 19:21:06 EST 2005 x86_64 x86_64 x86_64 GNU/Linux I'm following up on this because I have now got some results for another system with a different motherboard, but the same chipset. > [root@eng103 ~]# lspci -s 0:0.0 > 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) This system has 6GB of memory. Initially I didn't think it suffered the same problem, but after running some reboot and memory stress (gzip/gunzip/md5sum of an 8G file) tests, I have the following results. NOTE: I have disabled "panic on UE" in /etc/modprobe.conf on this system. The syslog lines show when the system was rebooted. The MC0 entries are repeated instances of the problem. The first two are limited to 76 because this was during reboot testing where the system automatically rebooted after completing the init.d scripts. (I added a "shutdown -r now" to /etc/inittab at the end). Notice the gap between 16:20 and 20:45. That is when I was running the gzip/gunzip/md5sum tests. > Feb 24 16:04:25 eng103 syslogd 1.4.1: restart. > Feb 24 16:13:31 eng103 syslogd 1.4.1: restart. > Feb 24 16:15:28 eng103 syslogd 1.4.1: restart. > Feb 24 16:18:02 eng103 syslogd 1.4.1: restart. > Feb 24 16:20:08 eng103 syslogd 1.4.1: restart. > Feb 24 20:45:09 eng103 syslogd 1.4.1: restart. > Feb 24 20:47:20 eng103 syslogd 1.4.1: restart. > Feb 24 20:49:39 eng103 syslogd 1.4.1: restart. > 76 x[Feb 24 20:50:06 eng103 kernel: MC0: UE - no information available: e752x UE log memory write] > Feb 24 20:51:39 eng103 syslogd 1.4.1: restart. > 76 x[Feb 24 20:52:06 eng103 kernel: MC0: UE - no information available: e752x UE log memory write] > Feb 24 20:53:39 eng103 syslogd 1.4.1: restart. > Feb 24 20:55:37 eng103 syslogd 1.4.1: restart. > Feb 24 20:57:36 eng103 syslogd 1.4.1: restart. > Feb 24 20:59:34 eng103 syslogd 1.4.1: restart. > Feb 24 21:01:44 eng103 syslogd 1.4.1: restart. > 330 x[Feb 24 21:04:11 eng103 kernel: MC0: UE - no information available: e752x UE log memory write] The last one indicates that I booted the system and bypassed the automatic reboot. I captured the following from /proc/mc/0: > Check PCI Parity: 0 > Panic PCI Parity: 0 > Panic UE: 0 > Log UE: 1 > Log CE: 1 > Poll msec: 1000 > > MC Core: edac_mc Ver: 2.0.0.devel Dec 20 2005 > MC Module: e752x_edac $Revision: 1.3 $ > Memory Controller: E7520 > PCI Bus ID: 0000:00:00.0 (0000:00:00.0) > EDAC capability: None SECDED S4ECD4ED > Current EDAC capability: None S4ECD4ED > Supported Mem Types: Registered-DDR > > 0:|:Memory Size: 2048 MiB > 0:|:Mem Type: Registered-DDR > 0:|:Dev Type: x4 > 0:|:EDAC Mode: S4ECD4ED > 0:|:UE: 0 > 0:|:CE: 0 > 0.0::CE: 0 > 0.1::CE: 0 > > 2:|:Memory Size: 2048 MiB > 2:|:Mem Type: Registered-DDR > 2:|:Dev Type: x4 > 2:|:EDAC Mode: S4ECD4ED > 2:|:UE: 0 > 2:|:CE: 0 > 2.0::CE: 0 > 2.1::CE: 0 > > 4:|:Memory Size: 2048 MiB > 4:|:Mem Type: Registered-DDR > 4:|:Dev Type: x4 > 4:|:EDAC Mode: S4ECD4ED > 4:|:UE: 0 > 4:|:CE: 0 > 4.0::CE: 0 > 4.1::CE: 0 > > Total Memory Size: 6144 MiB > Seconds since reset: 157 > UE No Info: 942 > CE No Info: 314 > Total UE: 942 > Total CE: 314 > Total PCI Parity: 0 > I hope that helps. In short, it seems that if it loads and initializes OK, it doesn't fail later. I saw this on my initial system as well. Once I had disabled "Panic on UE", if EDAC started complaining, then I could rmmod/insmod the e752x_edac module and the errors would stop. Thanks, :v) Alan wrote: On Mon, Feb 27, 2006 at 09:24:41AM -0500, Linda Wang wrote: >>>> >>76 x[Feb 24 20:50:06 eng103 kernel: MC0: UE - no information >>>> >>available: e752x UE log memory write] That looks real enough >>> >In short, it seems that if it loads and initializes OK, it doesn't >>> >fail later. I saw this on my initial system as well. Once I had >>> >disabled "Panic on UE", if EDAC started complaining, then I could >>> >rmmod/insmod the e752x_edac module and the errors would stop. >>> > The two posted look rather different. The first is a "christmas tree" of 0xFFFFFFFF, the second looks quite real and the memory controller is flagging specifically UE and CE events. Philip Pokorny wrote: > Sorry for the top post. Can't do it any other way from my Treo. > > Alan, the log entries I posted were only selected lines. I chose that line because it was unique in the "christmas tree" output. But every one of those UE entries was a full christmas tree of messages. > > I'll get the full lspci from both systems when I get to the office. > > Thanks > > > -----Original Message----- > From: Alan Cox [alan] > Sent: Mon Feb 27 06:31:58 2006 > To: Linda Wang > Cc: Philip Pokorny; Jason Baron; Alan Cox; Tim Burke > Subject: Re: Problems with EDAC module > > On Mon, Feb 27, 2006 at 09:24:41AM -0500, Linda Wang wrote: > > >>76 x[Feb 24 20:50:06 eng103 kernel: MC0: UE - no information > > >>available: e752x UE log memory write] > > That looks real enough > > > >In short, it seems that if it loads and initializes OK, it doesn't > > >fail later. I saw this on my initial system as well. Once I had > > >disabled "Panic on UE", if EDAC started complaining, then I could > > >rmmod/insmod the e752x_edac module and the errors would stop. > > > > > The two posted look rather different. The first is a "christmas tree" of > 0xFFFFFFFF, the second looks quite real and the memory controller is flagging > specifically UE and CE events. one more posting that I miss. This is the reply from Alan to Philip's original posting: Alan Cox wrote: > On Fri, Feb 24, 2006 at 03:00:15PM -0500, Linda Wang wrote: > >>> Fatal Error PCI Express C1 >>> Fatal Error PCI Express C >>> Fatal Error PCI Express B1 >>> Fatal Error PCI Express B >>> Fatal Error PCI Express A1 >>> Fatal Error PCI Express A >>> Fatal Error DMA Controler >>> Fatal Error HUB Interface >>> Fatal Error System Bus >>> Fatal Error DRAM Controler > > > Something seems to have gone very wrong with the PCI setup as the > chip is reading back 0xFFFFFFFF if it spewed all of this lot. Looks like > a serious PCI layer breakage not an EDAC bug. We'd need to know who made > the chip vanish on us. > > My guess is you are looking for another drive which managed to do a > pci_disable_device() on it > > Need to know what is loaded on that box, in what order, and a full lspci -vxxx > Linda Wang wrote: > Hi Philip, > > Can you post the full lspci output? We are curious to what happened to your systems. Here are the LSPCI from the two systems. You'll notice that the memory controller is Rev 0C which is the third generation of this chip according to the documentation on the Intel developers web site: http://developer.intel.com/design/chipsets/E7520_E7320/documentation.htm#specupdates Your Rev 09 chips were the first generation. The lspci output is attached. Did you want lspci -vv? or some other combinations of options? 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) 00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c) 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c) 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) 00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c) 00:07.0 PCI bridge: Intel Corporation E7520 PCI Express Port C1 (rev 0c) 00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02) 00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02) 01:00.0 PCI bridge: Intel Corporation 80333 Segment-A PCI Express-to-PCI Express Bridge 01:00.2 PCI bridge: Intel Corporation 80333 Segment-B PCI Express-to-PCI Express Bridge 03:0e.0 RAID bus controller: Areca Technology Corp. ARC-1220 8-Port PCI-Express to SATA RAID Controller 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11) 07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11) 09:01.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c) 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) 00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c) 00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02) 00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) 00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02) 05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8062 PCI-E IPMI Gigabit Ethernet Controller (rev 14) 07:01.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) I really need lspci -vxxx Ideally can you give me - lspci -vxxx before loading the module - lspci -vxxx after loading the module Would also help to know what modules were in the initrd. Can you add your modprobe.conf as well ? ok, i've stuck a test kernel at: http://people.redhat.com/~jbaron/bz183232/ Can you see if this kernel resolves the issue? Created attachment 125421 [details]
lspci-vxxx output (pre and post)
lspci-vxxx output (pre and post)
Dell BIOS A04 on PowerEdge 2800, and BIOS A02 on PowerEdge 1425SC, both have E7520 memory controllers. BIOS does not set PCI device 0:0.0, config space register 0xF4, bit 2, itself in either of these systems. I believe then that none of the Dell E7250-based systems have BIOS that sets this bit. Therefore, if the problem can only happen when this bit is set by BIOS, and is then disabled by by the e752x_edac driver, Dell systems would be unaffected. Dell's systems with C4 stepping E7250 chip will report as (rev 09) in lspci. This is because Dell's BIOS reprograms the version field for these devices because the C4 stepping is "compatible" with the C1-stepping that (rev 09) specifies. These chips will not report as (rev 0C) in lspci. The new kernel (Comment #12) does *not* resolve the problem. I am testing by putting the machine in a "reboot loop" with an entry in /etc/inittab. During each boot, it records the lspci -vxxx before and after loading the module. If there are "MC0: UE" errors logged in /var/log/messages, then it saves it to a unique file before the next reboot. When I first loaded the kernel, I got 13 good reboots after installing the kernel and then on the next reboot, it failed. Failed in this context means that the dmesg and syslog fill with "christmas trees" of all errors set. Today, I set the machine to reboot wether it failed or not. In 72 reboots, only *once* was the correct bit set after the e752x_edac driver loaded. And that was the only sucessful boot. Is there some other race condition that would cause the write to F4 to be undone? ----- Could this be worked around by putting this: setpci -s 0:0.0 F4.b=20:20 in /etc/rc.modules or an init.d script? cut and paste an email from Philip describing the 1. his test case, and 2. the drivers he uses: Philip Pokorny wrote: > bugzilla wrote: > >> I really need lspci -vxxx >> >> Ideally can you give me >> - lspci -vxxx before loading the module >> - lspci -vxxx after loading the module >> >> > You know, that's a little hard since the root filesystem is mounted read-only at the point that the module gets loaded. (It's loaded by the loop at the top of /etc/rc.sysinit driven by kmodule). > > To make the minimal intrusion into the init scripts, I'm using this patch to collect and save the output until I can write it to a file: > > +++ /etc/rc.sysinit 2006-02-27 14:18:59.000000000 -0800 > @@ -142,6 +142,8 @@ > fi > fi > > +prelspci="$(lspci -vxxx)" > + > echo -n $"Initializing hardware... " > > ide="" > @@ -217,6 +219,8 @@ > success > echo > > +postlspci="$(lspci -vxxx)" > + > echo "raidautorun /dev/md0" | nash --quiet > > # Start the graphical boot, if necessary; /usr may not be mounted yet, so we > @@ -478,6 +482,10 @@ > [ "$state" != "rw" -a "$READONLY" != "yes" ] && \ > action $"Remounting root filesystem in read-write mode: " mount -n -o remount,rw / > > +# Save lspci output from before > +echo "$prelspci" > /root/pre-lspci-vxxx.$(date "+%Y%m%d-%H%M%S") > +echo "$postlspci" > /root/post-lspci-vxxx.$(date "+%Y%m%d-%H%M%S") > + > # LVM2 initialization > if [ -x /sbin/lvm.static -o -x /sbin/multipath.static -o -x /sbin/dmraid ]; then > if ! LC_ALL=C fgrep -q "device-mapper" /proc/devices 2>/dev/null ; then > >> Would also help to know what modules were in the initrd. >> >> > Here is the contents of the /lib directory in the initrd: > > lib/dm-zero.ko > lib/libata.ko > lib/jbd.ko > lib/dm-mod.ko > lib/ext3.ko > lib/dm-mirror.ko > lib/sd_mod.ko > lib/scsi_mod.ko > lib/dm-snapshot.ko > lib/ata_piix.ko > > And modprobe.conf is now the following. The only thing I added was the panic_on_ue=0... > > alias eth0 sky2 > alias eth1 sky2 > alias scsi_hostadapter ata_piix > alias usb-controller ehci-hcd > alias usb-controller1 uhci-hcd > options edac_mc panic_on_ue=0 > > I've got a whole set of lspci -vxxx from before and after loading the module for successful boots and for two failues. I also have the output from a failed boot before removing e752x_edac, after removing e752x_edac and then after reloading e752x_edac. I've attached them as a tarball... > > Can you grant me access to edit the Bugzilla entry? > > :v) Jason, do you have the 32-bit versions of the kernel you created in comment #12? I know it didn't solve the problem for 64-bit, but I'd like to verify if it fixes or not the problem with 32-bit. Created attachment 125473 [details]
edac quirks patch
i only built an x86_64 kernel. i'll kick a 32-build for you. Also, i'm
attaching the patch that i've added.
ok, x86 kernels at the same place as comment #12 Rev > 0x09 is not affected as the buggy quirk is not applied. Rev 0x09 or anything reporting that will get IRQ balancing disabled which may harm performance slightly sometimes but was neccessary for the -real- chips of that rev. Thus it matters if 0x09 or 0x0C is reported as to what blows up or doesnt Can Penguin please let us know whether the motherboard on the 2 systems displaying the problem reported here are Intel motherboards or from a third-party (who?). Thanks, The motherboards in question are designed by our ODM. They are not Intel, Tyan, Supermicro or ASUS. Also, I've done additional reboot testing and it varies quite a bit. Sometimes it will "fail" on every reboot. Other times, it will fail about 50% of the time. Most recently it's passed 32 reboots without a failure. In every case of success, the lspci -vxxx shows that the bit (f4=20:20) is set. Every case of failure shows the bit not set. I've got this in /etc/rc.modules for now to work around the issue, but even with this, I need 'panic_on_ue=0' or else it will panic between 'kmodule' and '/etc/rc.modules'. setpci -s 0:0.0 f4.b=20:20 Created attachment 125617 [details]
blacklist patch
hi Phillip,
can you please test this update to /etc/hotplug/blacklist, to verify that it
fixes this issue for you.
thanks.
That seems a bit extreme. Isn't that going to disable EDAC entirely? Where else in the startup process would the EDAC drivers get loaded if not by kmodule? But yes, I'll test this. Another bit of news, the 32-bit kernels also fail and the quirks patch did not help the 32-bit kernel either. But it did solve a kernel NULL pointer oops with sky2 and bonding. it is a big hammer yes :). However having the modules blacklisted does not prevent you from doing an 'insmod' or 'modprobe' on thme. curious to see if the blacklist fixes the issue... OK. Yes that prevents the e752x_edac driver from loading. I've modified /etc/rc.modules to now read: -------- cut here --------- #!/bin/bash # logger PCI_0:0.0_F4 is $(setpci -s 0:0.0 f4.b) modprobe e752x_edac logger PCI_0:0.0_F4 is $(setpci -s 0:0.0 f4.b) setpci -s 0:0.0 f4.b=20:20 -------- cut here --------- And now I find that in 22 reboots, the e752x_edac module is consistently setting F4 bit 0x20 when it loads. Perhaps there is another "quirk" like problem with another driver? Here is the output of kmodule: OTHER hw_random OTHER pciehp OTHER pciehp OTHER pciehp OTHER pciehp OTHER e752x_edac NETWORK sky2 USB ehci-hcd USB uhci-hcd USB uhci-hcd IDE ata_piix This implies that pciehp is getting loaded just before the e752x_edac driver. Perhaps the PCIEHP driver is the cause of the race condition/problem? so just to be clear, the system comes up fine with the workaround from comment #28, and furthermore, loading the edac module at a later stage appars to work fine. is that correct? Yes. That's correct. Good testing results all around. Another vendor experiencing problems has reported successful testing using the rev'd hwdata package (which blacklists the edac modules) so I think we're on the road to a band-aid here. Moving this bug to ON_QA as testing is positive and this bug is part of the hwdata advisory scheduled to go out as part of U3. Moving to PASS_QA. All is well unless we hear negative testing results from the partners. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0068.html Don't know if it's related yet but experienced a kernel panic on reboot with the -34.EL kernel and the updated hwdata package while trying to regress this EDAC issue. Dell has gone ahead and filed a seperate bugzilla for this. Refer bugzilla #184523. Root cause is apparently problems with AMI BIOS code. I'd take the opinion that its an AMI BIOS flaw but Intel seem to be trying to avoid blaming anyone while giving no useful solution to the problem and effectively seem to be arguing that we should disable this functionality silently for AMI BIOS users. (Quote from Intel on the kernel-list) I'm sorry to have to bring up these issues after a fare amount of good work, and I don't know how this problem managed to get by for as long as it has, but there are some issues with the EDAC and the BIOS for managed computer systems. Managed computers are systems with automatic ECC logging to a System Event Log or SEL. They typically have an out of band Board Management Controller aka BMC or IPMC that runs out of band WRT the OS payload. The issues found with the EDAC driver are: 1) The default AMI BIOS behavior on SMI is to check the chipset error registers (Dev0:Fun1) and re-hide them. 2) If you are lucky enough to have BIOS code that doesn't re-hide Dev0:Fun1; then when EDAC is loaded there is a race condition between the platform BIOS and the driver to gain access to these registers. 3) If the platform BIOS does the ECC logging out of band WRT the payload OS, there is no good way for the driver to know at load time. We discovered these problems when testing with one of the later RHEL4-U3 RC's. The EDAC driver called panic when the device 0 Function 1 of the E7250 was re-hidden by the legacy USB SMI that when off between the load of the EDAC driver and the USB host driver. Loading the EDAC driver for many AMI bios's is a panic land mine waiting go off. Unless the OS knows that it can trust the BIOS to not re-hide those chipset registers using this driver is not a safe thing to do. Basically if device 0 : function 1 is hidden by the platform at boot time un-hiding and using the device and function is a risky thing to do, as there is likely a good reason for it to have been hidden in the first place. If the BIOS thinks that it owns some registers then the OS should not use them without great care. It is possible that the driver could be modified to check for re-hiding of the DEV0:FUN1, but this will be racey WRT SMI processing. At least it shouldn't panic. The driver should never get loaded by default or automatically. If the user knows enough about there BIOS to trust that the SMI behavior will coexist with the driver then its OK to load otherwise using this driver is not a safe thing to do. I think the best thing to do is to have the driver error out in its init or probe code if the dev0:fun1 is hidden at boot time. Comments? Next steps? Do you want me to send a patch implementing graceful error handling at driver init time so it doesn't load if DEV0:FUN1 is hidden? --mgross Intel Open Source Technology Center (503) 677-4628 (503)-712-6227 ms: JF1-235 2111 NW 25th Ave Hillsboro, OR 97124 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Is there something in particular I should ask of my motherboard and BIOS developers? Just tell them to contact AMI for a "fix"? In fact the systems we have with this problem *do* have AMI BIOS and a BMC. But the comment that the "legacy USB" support can mess with this as well just scares me. What else is SMI code doing behind my OS's back? I wonder if this is the source of all those "lost clock ticks" when doing ACPI calls as well. (see also bug 189052) Quite possibly, depending how the vendor ACPI is implemented. SMI (and SMM before it) can be very problematic for time sensitive systems depending upon implementation. If your BIOS developers can customize the AMI BIOS they may have the ability to patch out the code which arbitarily disables the memory controller on an SMI. Intel are, as I understand it, talking to BIOS vendors as well as Linux people now that the problem is understood. It may be worth talking to Intel and finding out what the actual plans to address the problem are. Philip, I'm posting here a patch that solved the problem in BZ#185762, which appears to be the same problem you're hitting here. Please test and let me know how it goes. Created attachment 141776 [details]
patch: avoid touching dev0:fun1 if it's hidden
|
Description of problem: I am testing the RHEL4 U3 Beta on an Intel EM64T based system. This is the x86-64/EM64T version of the distribution. The install completed successfully, but upon reboot, the system panic's during rc.sysinit around "remounting root" or "No Software RAID found" (from dmraid -ay). The panic is: MC0: Uncorrected Error That's clearly from the new EDAC feature which was added in the release. I've tried two different motherboard/CPU sets and two completely different sets of RAM. None of this hardware has exhibited any problems in the past. So I'm fairly certain this is a false positive. I tried several different ways to disable the "panic_on_ue" behavior on the kernel command line, but "edac_mc.panic_on_ue=0" didn't work, nor did any of the others. Ultimately, I had to boot into rescue mode and kill the edac modules with the following in /etc/modprobe.conf: alias e752x_edac /dev/null alias edac_mc /dev/null Then I was able to boot and the system appears to be running without problems. Further, I am able to "insmod edac_mc panic_on_ue=0" and load e752x_edac without problems. The e752x_edac module does *not* log any memory errors after I manually load the modules. Now that I had the system up, I changed the /etc/modprobe.conf to read: options edac_mc panic_on_ue=0 and tried rebooting the system. Now the system boots and runs just fine except that the log is filling up with the attached error message. Clearly something isn't initialized or being read correctly. But after unloading and reloading the e752x_edac module, everything is fine: > MC0: Removed device 0 for e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0) > tolm = 20000, remapbase = ffc000, remaplimit = 0 > MC0: Giving out device to e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0) And no further errors are reported. So it seems that the hotplug loading of e752x_edac in /etc/rc.sysinit (via kmodule) is causing things to be initialized badly. Perhaps there is a race condition of some kind between edac_mc and e752x_edac loading? What additional information and tests can I run to track down the root of the problem? I've searched bugzilla, but I haven't found any bugs *at all* against the RHEL4U3 Beta. Perhaps I'm searching the wrong catagories? I'll bugzilla if I can get some advice on the right product/release/component to log against. Thanks! :v) PS. I've got a system with a different motherboard but the same chipset that I'll try on next. Fatal Error PCI Express C1 Fatal Error PCI Express C Fatal Error PCI Express B1 Fatal Error PCI Express B Fatal Error PCI Express A1 Fatal Error PCI Express A Fatal Error DMA Controler Fatal Error HUB Interface Fatal Error System Bus Fatal Error DRAM Controler Non-Fatal Error PCI Express C1 Non-Fatal Error PCI Express C Non-Fatal Error PCI Express B1 Non-Fatal Error PCI Express B Non-Fatal Error PCI Express A1 Non-Fatal Error PCI Express A Non-Fatal Error DMA Controler Non-Fatal Error HUB Interface Non-Fatal Error System Bus Non-Fatal Error DRAM Controler Non-Fatal Error Internal Buffer Fatal Error PCI Express C1 Fatal Error PCI Express C Fatal Error PCI Express B1 Fatal Error PCI Express B Fatal Error PCI Express A1 Fatal Error PCI Express A Fatal Error DMA Controler Fatal Error HUB Interface Fatal Error System Bus Fatal Error DRAM Controler Non-Fatal Error PCI Express C1 Non-Fatal Error PCI Express C Non-Fatal Error PCI Express B1 Non-Fatal Error PCI Express B Non-Fatal Error PCI Express A1 Non-Fatal Error PCI Express A Non-Fatal Error DMA Controler Non-Fatal Error HUB Interface Non-Fatal Error System Bus Non-Fatal Error DRAM Controler Non-Fatal Error Internal Buffer Fatal Error HI Address or Command Parity Fatal Error HI Illegal Access Fatal Error Out of Range Access Fatal Error Enhanced Config Access Non-Fatal Error HI Internal Parity Non-Fatal Error HI Data Parity Non-Fatal Error Hub Interface Target Abort Fatal Error HI Address or Command Parity Fatal Error HI Illegal Access Fatal Error Out of Range Access Fatal Error Enhanced Config Access Non-Fatal Error HI Internal Parity Non-Fatal Error HI Data Parity Non-Fatal Error Hub Interface Target Abort Fatal Error System Bus PCI Express C1 Fatal Error System Bus PCI Express C Fatal Error System Bus HUB Interface Non-Fatal Error System Bus PCI Express B1 Non-Fatal Error System Bus PCI Express B Non-Fatal Error System Bus PCI Express A1 Non-Fatal Error System Bus PCI Express A Non-Fatal Error System Bus DMA Controler Non-Fatal Error System Bus System Bus Non-Fatal Error System Bus DRAM Controler Fatal Error System Bus PCI Express C1 Fatal Error System Bus PCI Express C Fatal Error System Bus HUB Interface Non-Fatal Error System Bus PCI Express B1 Non-Fatal Error System Bus PCI Express B Non-Fatal Error System Bus PCI Express A1 Non-Fatal Error System Bus PCI Express A Non-Fatal Error System Bus DMA Controler Non-Fatal Error System Bus System Bus Non-Fatal Error System Bus DRAM Controler Non-Fatal Error Internal PMWB to DRAM parity Non-Fatal Error Internal PMWB to System Bus Parity Non-Fatal Error Internal System Bus or IO to PMWB Parity Non-Fatal Error Internal DRAM to PMWB Parity Non-Fatal Error Internal PMWB to DRAM parity Non-Fatal Error Internal PMWB to System Bus Parity Non-Fatal Error Internal System Bus or IO to PMWB Parity Non-Fatal Error Internal DRAM to PMWB Parity MC0: could not look up page error address ffffff MC0: INTERNAL ERROR: row out of range (-1 >= 8) MC0: CE - no information available: INTERNAL ERROR MC0: could not look up page error address ffffff MC0: INTERNAL ERROR: row out of range (-1 >= 8) MC0: CE - no information available: INTERNAL ERROR MC0: UE - no information available: e752x UE log memory write MC0: UE - no information available: e752x UE log memory write MC0: could not look up page error address ffffff MC0: CE page 0xffffff, row -1 : Memory read retry MC0: could not look up page error address ffffff MC0: CE page 0xffffff, row -1 : Memory read retry MC0: Memory threshold CE MC0: Memory threshold CE MC0: could not look up page error address ffffff MC0: INTERNAL ERROR: row out of range (-1 >= 8) MC0: UE - no information available: INTERNAL ERROR MC0: could not look up page error address ffffff MC0: INTERNAL ERROR: row out of range (-1 >= 8) MC0: UE - no information available: INTERNAL ERROR MC0: could not look up page error address ffffff MC0: INTERNAL ERROR: row out of range (-1 >= 8) MC0: UE - no information available: INTERNAL ERROR MC0: could not look up page error address ffffff MC0: INTERNAL ERROR: row out of range (-1 >= 8) MC0: UE - no information available: INTERNAL ERROR Version-Release number of selected component (if applicable): 2.6.9-27.EL kernel How reproducible: Install the RC1 release Steps to Reproduce: 1. install the RC beta release 2. 3. Actual results: Expected results: Additional info: