Bug 848149

Summary: i82975x_edac dereferencing garbage in i82975x_init_csrows
Product: [Fedora] Fedora Reporter: Thomas Moschny <thomas.moschny>
Component: kernelAssignee: Mauro Carvalho Chehab <mchehab>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 17CC: appublic, brm, gansalmon, gerney, hannu.martikka, itamar, jforbes, jonathan, kernel-maint, len.brown, lwang, madhu.chinakonda, mvegh, nathankohagen, nicolas.vieville, rkukura, robert.wilhelm, tom
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-20 10:14:18 EST Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
screenshot
none
screenshot part2
none
Full dmesg output from 2.6.0-1 boot on Dell Workstatino 390
none
Output of lspci on Dell Workstation 390
none
blacklist i82975x_edac module at boot time none

Description Thomas Moschny 2012-08-14 14:37:38 EDT
Description of problem:

After kernel update, the machine won't boot anymore, stops somewhere after displaying that it is going to mount /boot. Scrolling up shows a kernel trace, with "i82975x_init_one [i82975x_edac]" on top, in "strncpy", for process "udevd".

Version-Release number of selected component (if applicable):
kernel-3.5.1-1.fc17.x86_64

This started with kernel-3.5.0-1.fc17.x86_64.
kernel-3.4.6-2.fc17.x86_64 is the last one that works.

How reproducible:
Always.

Additional info:
This is a Dell Precision WorkStation 390.

lspci:
00:00.0 Host bridge: Intel Corporation 82975X Memory Controller Hub
00:01.0 PCI bridge: Intel Corporation 82975X PCI Express Root Port
00:1b.0 Audio device: Intel Corporation N10/ICH 7 Family High Definition Audio Controller (rev 01)
00:1c.0 PCI bridge: Intel Corporation N10/ICH 7 Family PCI Express Port 1 (rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation N10/ICH 7 Family USB UHCI Controller #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation N10/ICH 7 Family USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 SATA controller: Intel Corporation N10/ICH7 Family SATA AHCI Controller (rev 01)
00:1f.3 SMBus: Intel Corporation N10/ICH 7 Family SMBus Controller (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754 Gigabit Ethernet PCI Express (rev 02)
05:02.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 04)
05:02.1 Input device controller: Creative Labs SB Live! Game Port (rev 01)
Comment 1 Thomas Moschny 2012-08-14 15:31:13 EDT
Created attachment 604426 [details]
screenshot
Comment 2 Thomas Moschny 2012-08-14 15:32:26 EDT
Created attachment 604427 [details]
screenshot part2
Comment 3 Dave Jones 2012-08-14 17:46:45 EDT
that Code: line decoded is..

   0:	0f b6 16             	movzbl (%rsi),%edx
   3:	80 fa 01             	cmp    $0x1,%dl
   6:	88 11                	mov    %dl,(%rcx)
   8:	48 83 de ff          	sbb    $0xffffffffffffffff,%rsi
   c:	48 83 c1 01          	add    $0x1,%rcx
  10:	4c 39 c1             	cmp    %r8,%rcx
  13:	75 eb                	jne    0x0


But somehow %rsi is a crazy value (0x5f706f5f63616465)

The rip is in strncpy. The only call to that function in this driver is here..

426                         strncpy(csrow->channels[chan]->dimm->label,
427                                         labels[(index >> 1) + (chan * 2)],
428                                         EDAC_MC_LABEL_LEN);

in i82975x_init_csrows

csrow->channels[chan]->dimm->label looks like it hasn't been initialised correctly, but I can't see exactly what's wrong yet.
Comment 4 Thomas Moschny 2012-08-17 08:12:50 EDT
If there is anything I can do (test, etc...) to get this fixed, let me know.
Comment 5 Eric Gerney 2012-08-20 13:39:04 EDT
I'm having the exact same issue on the same hardware, a Dell Precision WorkStation 390.  The last kernel that booted without issue was kernel-PAE-3.4.6-2.fc17.i686.  Everything since then has been a chore to boot.

Not only is the boot issue the same, but so is the stack trace in the i82975x_edac module.  I tried booting with rd.udev.debug, but that didn't yield anything useful to me.

To boot my system, I've done the following with varying success:

- Specify VG on the kernel command line:
   rd.lvm.lv=vg_ejg/lv_home

- Comment out everything but root and boot in /etc/fstab, boot to single user mode, manually mount those entries, then run 'systemctl default'

- Attempt to blacklist various kernel modules via the command line:
   rd.blacklist=scsi_wait_scan rd.blacklist=i82975x_edac

Bottom line, if I just keep trying, eventually I'll get a usable system.

Also, MagicSysRq has been quite useful.

Willing to provide whatever information is needed.  Thanks!
Comment 6 Hannu Martikka 2012-08-22 10:20:51 EDT
I have same problem on Dell 390.

Booting with F16 kernel (3.4.9-1.fc16.x86_64) works.
Comment 7 Eric Gerney 2012-08-30 23:24:36 EDT
Issue persists with kernel-PAE-3.5.2-3.fc17.i686.
Comment 8 Eric Gerney 2012-09-06 11:58:11 EDT
(In reply to comment #7)
Boot issue continues kernel-PAE-3.5.3-1.fc17.i686.

I updated the 390 BIOS to ver 2.6.0 and toggled Multiple CPU Cores with no luck.  My system has 8GB, 4 1x2GB DIMMs with ECC.

To boot cleanly to the default target, I just renamed /lib/modules/3.5.3-1.fc17.i686.PAE/kernel/drivers/edac/i82975x_edac.ko to i82975x_edac.ko.bak.  This stopped udevd from horking with appears to be the source of why the system doesn't boot.

Diffing linux-3.4/drivers/edac/i82975x_edac.c (last known working) with the linux-3.5 version shows some differences, but, I'm not experienced enough to know what I'm looking at.
Comment 9 Len Brown 2012-09-06 14:03:09 EDT
*** Bug 851525 has been marked as a duplicate of this bug. ***
Comment 10 Len Brown 2012-09-06 14:40:13 EDT
I have an Intel D975XBX with 6GB of ECC memory.

When ECC is enabled in BIOS setup, it fails in i82975x_edac as above.
However, when ECC is disabled in BIOS SETUP, FC17 is able to boot.
Comment 11 mvegh 2012-09-12 05:35:54 EDT
I have the same issue with hp xw4400 (4GB ram). i82975x_edac is crashing both on F17.x86 and F17.x86_64.

The 32 and the 64 bit version boots up normally with disabled ECC.

Kernel dump of the 32 bit version:

[    6.703904] Oops: 0000 [#1] SMP 
[    6.703911] Modules linked in: i82975x_edac(+) mfd_core edac_core ppdev hp_wmi coretemp kvm_intel sparse_keymap rfkill kvm snd_hda_codec_realtek parport_pc parport tg3 snd_hda_intel snd_hda_codec microcode serio_raw snd_hwdep snd_pcm snd_page_alloc snd_timer snd soundcore uinput nouveau mxm_wmi wmi video i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
[    6.703980] 
[    6.703984] Pid: 388, comm: udevd Not tainted 3.5.3-1.fc17.i686.PAE #1 Hewlett-Packard HP xw4400 Workstation/0A68h
[    6.703997] EIP: 0060:[<c067947d>] EFLAGS: 00010206 CPU: 0
[    6.704003] EIP is at strncpy+0x1d/0x40
[    6.704008] EAX: f24cea40 EBX: f24cea40 ECX: 0000001e EDX: 63616465
[    6.704015] ESI: 63616465 EDI: f24cea40 EBP: f23e7d44 ESP: f23e7d38
[    6.704021]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    6.704026] CR0: 80050033 CR2: 63616465 CR3: 36fe5000 CR4: 000007f0
[    6.704032] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    6.704038] DR6: ffff0ff0 DR7: 00000400
[    6.704043] Process udevd (pid: 388, ti=f23e6000 task=f6f425b0 task.ti=f23e6000)
[    6.704048] Stack:
[    6.704052]  f24cea40 00000003 f24ce140 f23e7db8 f7f31582 00000004 f24ce038 f7e3a000
[    6.704068]  f24ce000 f2c3a000 00000000 00000000 f24ce034 00000000 00000000 00040000
[    6.704083]  00040000 00000004 fed14000 00000246 f23e7db8 00000286 f2c3a060 00000003
[    6.704099] Call Trace:
[    6.704106]  [<f7f31582>] i82975x_init_one+0x22c/0x2db [i82975x_edac]
[    6.704116]  [<c074f601>] ? pm_runtime_autosuspend_expiration+0x1/0x90
[    6.704124]  [<c0694427>] pci_device_probe+0x87/0x110
[    6.704132]  [<c05ac4f7>] ? sysfs_create_link+0x17/0x20
[    6.704139]  [<c07469d0>] driver_probe_device+0x80/0x350
[    6.704145]  [<c069436e>] ? pci_match_device+0x9e/0xb0
[    6.704151]  [<c0746d31>] __driver_attach+0x91/0xa0
[    6.704157]  [<c0746ca0>] ? driver_probe_device+0x350/0x350
[    6.704163]  [<c0744ee2>] bus_for_each_dev+0x42/0x80
[    6.704169]  [<c074633e>] driver_attach+0x1e/0x20
[    6.704175]  [<c0746ca0>] ? driver_probe_device+0x350/0x350
[    6.704181]  [<c07460a7>] bus_add_driver+0x1a7/0x2b0
[    6.704188]  [<c066fecd>] ? kset_find_obj+0x2d/0x60
[    6.704194]  [<c06944b0>] ? pci_device_probe+0x110/0x110
[    6.704200]  [<c06944b0>] ? pci_device_probe+0x110/0x110
[    6.704206]  [<c074732a>] driver_register+0x6a/0x140
[    6.704212]  [<f7f36000>] ? 0xf7f35fff
[    6.704218]  [<c06934e2>] __pci_register_driver+0x42/0xb0
[    6.704224]  [<f7f36000>] ? 0xf7f35fff
[    6.704229]  [<f7f3602b>] i82975x_init+0x2b/0x1000 [i82975x_edac]
[    6.704238]  [<c0403112>] do_one_initcall+0x112/0x160
[    6.704245]  [<c04a1432>] ? set_section_ro_nx+0x62/0x80
[    6.704252]  [<c04a3e04>] sys_init_module+0x1004/0x1d80
[    6.704259]  [<c0526f66>] ? do_mmap_pgoff+0x1e6/0x2d0
[    6.704269]  [<c096851f>] sysenter_do_call+0x12/0x28
[    6.704273] Code: 8b 7d fc 89 ec 5d c3 8d b4 26 00 00 00 00 55 89 e5 83 ec 0c 89 5d f4 89 75 f8 89 7d fc 66 66 66 66 90 89 c3 89 d6 89 c7 49 78 08 <ac> aa 84 c0 75 f7 f3 aa 89 d8 8b 75 f8 8b 5d f4 8b 7d fc 89 ec 
[    6.704362] EIP: [<c067947d>] strncpy+0x1d/0x40 SS:ESP 0068:f23e7d38
[    6.704371] CR2: 0000000063616465
[    6.705693] ---[ end trace 94f1a7de70cd4f55 ]---
Comment 12 Andrew 2012-09-17 18:02:02 EDT
I have exact same issue as Eric Gerney.  I also have a Dell Workstation 390.  Renaming the i82975x_edac.ko allows me to boot.
Comment 13 Eric Gerney 2012-10-01 09:59:13 EDT
Recent kernel updates 3.5.4-1 and 3.5.4-2 exhibit the same behavior (32-bit), probably because i82975x_edac.c hasn't changed.  Apparently there are module updates in the 3.6.x series, don't know if they address this issue or not.

Again, renaming i82975x_edac.ko allows the system to boot.
Comment 14 Josh Boyer 2012-10-01 10:19:49 EDT
Mauro, any ideas on this one?  Seems to only happen if ECC is enabled for the DIMMs.
Comment 15 Mauro Carvalho Chehab 2012-10-01 10:44:01 EDT
(In reply to comment #14)
> Mauro, any ideas on this one?  Seems to only happen if ECC is enabled for
> the DIMMs.

When ECC is disabled, the module won't load.

There are indeed some fixes that got merged at 3.6-rc6, correcting some issues at
memory allocation on device module register/unregister.

We should test it and see if they'll fix the reported issue. My suggestion is to test a F18 kernel or a Kernel 3.6 there. If it got fixed, then all we need to do is to backport the already-existing patches, if we're currently not planning to release a 3.6 kernel on F17.
Comment 16 Mauro Carvalho Chehab 2012-10-01 10:46:31 EDT
(In reply to comment #15)
> (In reply to comment #14)
> > Mauro, any ideas on this one?  Seems to only happen if ECC is enabled for
> > the DIMMs.
> 
> When ECC is disabled, the module won't load.
> 
> There are indeed some fixes that got merged at 3.6-rc6, correcting some
> issues at
> memory allocation on device module register/unregister.

Actually, it was merged the day before 3.6-rc7, if I'm not mistaken.

Anyway, a 3.6-rc7 or 3.6 kernel should contain those fixes.
Comment 17 Andrew 2012-10-02 13:13:50 EDT
I built and installed 3.6.0-1 from the F18 SRPM on Koji (onto my F17 system).  The i82975x_edac module is still blowing up, but now the kernel pauses for ~10 seconds, recovers, and continues booting.  So, the core issue with the i82975 modules is not fixed, but the kernel handles the failure more gracefully.

Dump info (full dmesg output attached):


[    7.955498] general protection fault: 0000 [#1] SMP
[    7.955510] Modules linked in: snd_hda_codec_idt(F) i82975x_edac(F+) dcdbas(F) snd_hda_intel(F+) snd_hda_codec(F) i2c_core(F) edac_core(F) ppdev(F) tg3(F) snd_hwdep(F) snd_pcm(F) snd_page_alloc(F) snd_timer(F) snd(F) soundcore(F) parp
ort_pc(F) parport(F) raid1(F)
[    7.955514] CPU 0
[    7.955514] Pid: 386, comm: udevd Tainted: GF            3.6.0-1.fc17.x86_64 #1 Dell Inc.                 Precision WorkStation 390    /0DN075
[    7.955524] RIP: 0010:[<ffffffff812e2108>]  [<ffffffff812e2108>] strncpy+0x18/0x30
[    7.955525] RSP: 0018:ffff880037315b68  EFLAGS: 00010202
[    7.955526] RAX: ffff880037158688 RBX: ffff880037373c00 RCX: ffff880037158688
[    7.955528] RDX: 000000000000001f RSI: 5f706f5f63616465 RDI: ffff880037158688
[    7.955529] RBP: ffff880037315b68 R08: ffff8800371586a7 R09: 000000000000fffe
[    7.955530] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
[    7.955531] R13: 0000000000000000 R14: ffff880037158400 R15: ffff88007b90a370
[    7.955533] FS:  00007f669b66e840(0000) GS:ffff88007fa00000(0000) knlGS:0000000000000000
[    7.955534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.955535] CR2: 00007f669d92c048 CR3: 0000000036ef1000 CR4: 00000000000007f0
[    7.955536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.955538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    7.955539] Process udevd (pid: 386, threadinfo ffff880037314000, task ffff880036c45c40)
[    7.955540] Stack:
[    7.955543]  ffff880037315c18 ffffffffa00196b8 0004000036c45c40 ffff88007b90a378
[    7.955545]  ffffc90000354000 ffff88007b90a000 ffff88007c130000 0000000000000000
[    7.955547]  0000000000000000 0004000000000000 feda00003718d690 0000000000000246
[    7.955548] Call Trace:
[    7.955554]  [<ffffffffa00196b8>] i82975x_init_one+0x2e6/0x3e6 [i82975x_edac]
[    7.955558]  [<ffffffff81308539>] local_pci_probe+0x79/0x100
[    7.955561]  [<ffffffff813086e1>] pci_device_probe+0x121/0x130
[    7.955565]  [<ffffffff813c4fab>] driver_probe_device+0x8b/0x390
[    7.955568]  [<ffffffff813c535b>] __driver_attach+0xab/0xb0
[    7.955571]  [<ffffffff813c52b0>] ? driver_probe_device+0x390/0x390
[    7.955573]  [<ffffffff813c3045>] bus_for_each_dev+0x55/0x90
[    7.955576]  [<ffffffff813c492e>] driver_attach+0x1e/0x20
[    7.955578]  [<ffffffff813c4560>] bus_add_driver+0x1a0/0x290
[    7.955580]  [<ffffffffa003d000>] ? 0xffffffffa003cfff
[    7.955582]  [<ffffffffa003d000>] ? 0xffffffffa003cfff
[    7.955584]  [<ffffffff813c5a27>] driver_register+0x77/0x170
[    7.955586]  [<ffffffffa003d000>] ? 0xffffffffa003cfff
[    7.955588]  [<ffffffff8130735e>] __pci_register_driver+0x5e/0xe0
[    7.955590]  [<ffffffffa003d000>] ? 0xffffffffa003cfff
[    7.955593]  [<ffffffffa003d035>] i82975x_init+0x35/0x1000 [i82975x_edac]
[    7.955597]  [<ffffffff8100212a>] do_one_initcall+0x12a/0x180
[    7.955602]  [<ffffffff810bd0f6>] sys_init_module+0x126/0x2230
[    7.955604]  [<ffffffff812f9300>] ? ddebug_proc_open+0xd0/0xd0
[    7.955609]  [<ffffffff816280a9>] system_call_fastpath+0x16/0x1b
[    7.955629] Code: 84 c9 75 ef 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 d2 48 89 f8 4c 8d 04 17 48 89 f9 48 89 e5 74 1a 0f 1f 44 00 00 <0f> b6 16 80 fa 01 88 11 48 83 de ff 48 83 c1 01 4c 39 c1 75 eb
[    7.955631] RIP  [<ffffffff812e2108>] strncpy+0x18/0x30
[    7.955632]  RSP <ffff880037315b68>
[    7.955652] ---[ end trace 46cbbac85ac355e9 ]---
Comment 18 Andrew 2012-10-02 13:16:22 EDT
Created attachment 620398 [details]
Full dmesg output from 2.6.0-1 boot on Dell Workstatino 390
Comment 19 Mauro Carvalho Chehab 2012-10-02 13:28:10 EDT
(In reply to comment #18)
> Created attachment 620398 [details]
> Full dmesg output from 2.6.0-1 boot on Dell Workstatino 390

Please post the output for:
# lspci -vnn

I'll see if I can find a similar machine to that one.
Comment 20 Andrew 2012-10-02 13:32:34 EDT
Created attachment 620410 [details]
Output of lspci on Dell Workstation 390

This was run under the 3.5.4 kernel with the i82975x module disabled - I assume that doesn't matter.
Comment 21 nicolas.vieville 2012-10-03 03:56:15 EDT
Hello,

Found this bug report while digging the Web to understand why my 10 HP XW4400 stations (not mine, my employer's ones ;) ) were booting each one with random problems such as:
- boot OK but no sound card detected ;
- boot OK but no keyboard detected (only dead keys works: eg CTRL key - very
  annoying) ;
- boot process stops and never reach the end as described above (similar 
  dmesg messages).

All these stations booted correctly after I reinstalled and used kernel-3.4.6-2.fc17.x86_64.rpm yesterday... before I found this thread today. 

After reading your bug report and these thread:
- https://bbs.archlinux.org/viewtopic.php?id=148033
- https://bugzilla.kernel.org/show_bug.cgi?id=47171
there seems that i82975x_edac kernel module is the one that causes these symptoms. 
Not at work today, but I think it would be worth trying to blacklist this device driver at boot time (may be by adding a specific file such as /etc/modprobe.d/blacklist_edac_fc17.conf) while waiting a fix.

I'll provide feedback about that suggestion once back at work in the next days.

Hope this should help to catch this not_so_easy_to_identify bug!

Cordially,


-- 
NVieville
Comment 22 nicolas.vieville 2012-10-04 13:33:44 EDT
Created attachment 621751 [details]
blacklist i82975x_edac module at boot time

Hello,

As I promised it yesterday, I confirm that the blacklist_edac_fc17.conf (attached) copied in /etc/modprobe.d/ directory on HP XW4400 workstations running F-17 x86_64 make them more stable at boot time and usable (no more keyboard, sound card or general freeze problems at boot time for the moment).

This can be used temporally while waiting for the module being corrected, but I can't certify that this file would do the trick on others/all platforms.

Cordially,


-- 
NVieville
Comment 23 Andrew 2012-10-10 16:59:46 EDT
Just tried this on kernel-3.6.1-1.fc17.x86_64 with same issue.

[    7.704359] EDAC MC: Ver: 3.0.0
[    7.729537] general protection fault: 0000 [#1] SMP 
[    7.729769] Modules linked in: i82975x_edac(+) edac_core dcdbas parport tg3 raid1
[    7.730257] CPU 0 
[    7.730257] Pid: 395, comm: udevd Not tainted 3.6.1-1.fc17.x86_64 #1 Dell Inc.                 Precision WorkStation 390    /0DN075
[    7.730257] RIP: 0010:[<ffffffff812dffe8>]  [<ffffffff812dffe8>] strncpy+0x18/0x30
[    7.730257] RSP: 0018:ffff88007bb43b68  EFLAGS: 00010202
[    7.730257] RAX: ffff8800371e9a88 RBX: ffff8800371dd800 RCX: ffff8800371e9a88
[    7.730257] RDX: 000000000000001f RSI: 5f706f5f63616465 RDI: ffff8800371e9a88
[    7.730257] RBP: ffff88007bb43b68 R08: ffff8800371e9aa7 R09: 000000000000fffe
[    7.730257] R10: 0000000000000000 R11: ffffffff81161048 R12: 0000000000000003
[    7.730257] R13: 0000000000000000 R14: ffff8800371e9800 R15: ffff880078c07370
[    7.730257] FS:  00007f79fffaf840(0000) GS:ffff88007fa00000(0000) knlGS:0000000000000000
[    7.730257] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.730257] CR2: 00007f79ffe7a000 CR3: 0000000079390000 CR4: 00000000000007f0
[    7.730257] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.730257] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    7.730257] Process udevd (pid: 395, threadinfo ffff88007bb42000, task ffff8800372c5c40)
[    7.730257] Stack:
[    7.730257]  ffff88007bb43c18 ffffffffa00346b8 00040000372c5c40 ffff880078c07378
[    7.730257]  ffffc90000354000 ffff880078c07000 ffff88007c130000 0000000000000000
[    7.730257]  0000000000000000 0004000000000000 feda000079c25d20 0000000000000246
[    7.730257] Call Trace:
[    7.730257]  [<ffffffffa00346b8>] i82975x_init_one+0x2e6/0x3e6 [i82975x_edac]
[    7.730257]  [<ffffffff81302599>] local_pci_probe+0x79/0x100
[    7.730257]  [<ffffffff81302741>] pci_device_probe+0x121/0x130
[    7.730257]  [<ffffffff813bee4b>] driver_probe_device+0x8b/0x390
[    7.730257]  [<ffffffff813bf1fb>] __driver_attach+0xab/0xb0
[    7.730257]  [<ffffffff813bf150>] ? driver_probe_device+0x390/0x390
[    7.730257]  [<ffffffff813bcee5>] bus_for_each_dev+0x55/0x90
[    7.730257]  [<ffffffff813be7ce>] driver_attach+0x1e/0x20
[    7.730257]  [<ffffffff813be400>] bus_add_driver+0x1a0/0x290
[    7.730257]  [<ffffffffa000a000>] ? 0xffffffffa0009fff
[    7.730257]  [<ffffffffa000a000>] ? 0xffffffffa0009fff
[    7.730257]  [<ffffffff813bf8c7>] driver_register+0x77/0x170
[    7.730257]  [<ffffffffa000a000>] ? 0xffffffffa0009fff
[    7.730257]  [<ffffffff813013be>] __pci_register_driver+0x5e/0xe0
[    7.730257]  [<ffffffffa000a000>] ? 0xffffffffa0009fff
[    7.730257]  [<ffffffffa000a035>] i82975x_init+0x35/0x1000 [i82975x_edac]
[    7.730257]  [<ffffffff8100212a>] do_one_initcall+0x12a/0x180
[    7.730257]  [<ffffffff810be086>] sys_init_module+0x10f6/0x20b0
[    7.730257]  [<ffffffff812f71f0>] ? ddebug_proc_open+0xd0/0xd0
[    7.730257]  [<ffffffff816226e9>] system_call_fastpath+0x16/0x1b
[    7.730257] Code: 84 c9 75 ef 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 d2 48 89 f8 4c 8d 04 17 48 89 f9 48 89 e5 74 1a 0f 1f 44 00 00 <0f> b6 16 80 fa 01 88 11 48 83 de ff 48 83 c1 01 4c 39 c1 75 eb 
[    7.730257] RIP  [<ffffffff812dffe8>] strncpy+0x18/0x30
[    7.730257]  RSP <ffff88007bb43b68>
[    7.740648] ---[ end trace b1ab65861aac905d ]---
Comment 24 Josh Boyer 2012-10-15 06:55:55 EDT
*** Bug 866423 has been marked as a duplicate of this bug. ***
Comment 25 nicolas.vieville 2012-10-16 11:01:50 EDT
Hello,

To complete comment #22, and feedback, I had to blacklist i7core_edac and edac_core modules on recent HP Z400 stations to avoid random issues such as keyboard not responding once booted, glxgears with NVidia drivers completely freezing system (I know that Fedora doesn't provide any support for such a driver, but reading bugzilla's entries these days, maybe some issues with recent kernel are in relation with this one). 

None of these symptoms occur when the edac modules mentioned above are blacklisted on these stations with F-17 x86_64.

While waiting for a fix, blacklisting these modules is Ok, but it would be more accurate to avoid it as these stations (Z400) are equipped with ECC memory modules.

Thanks for your investigations on this issue.

Cordially,


-- 
NVieville
Comment 26 Andrew 2012-10-16 12:53:04 EDT
I installed 3.6.2-3.fc17.x86_64 from the Koji site, which indicates it includes "Fix i82975x_edac OOPS".  This appears to have resolved my issue - my workstation 390 now boots normally with i82975x_edac enabled.  Sweet.
Comment 27 Mauro Carvalho Chehab 2012-10-16 12:57:38 EDT
> To complete comment #22, and feedback, I had to blacklist i7core_edac

Huh? This seems to be a completely different bug. Please open a separate bugzilla for that one. Btw, a fix for i7core_edac was post today at linux-edac ML and at LKML:
    https://lkml.org/lkml/2012/10/16/277
Comment 28 Mauro Carvalho Chehab 2012-10-16 13:00:26 EDT
(In reply to comment #26)
> I installed 3.6.2-3.fc17.x86_64 from the Koji site, which indicates it
> includes "Fix i82975x_edac OOPS".  This appears to have resolved my issue -
> my workstation 390 now boots normally with i82975x_edac enabled.  Sweet.

Good to know. The fix there will prevent the OOPS, but there are lots of odd things at the i82975x driver, related to the way it detects memory and report errors.

I'm actually needing to rewrite most of the driver, in order to properly fix the issues there. I'll post here the final patches when they got ready and after some tests.
Comment 29 nicolas.vieville 2012-10-16 13:21:35 EDT
(In reply to comment #27)
> > To complete comment #22, and feedback, I had to blacklist i7core_edac
> 
> Huh? This seems to be a completely different bug. Please open a separate
> bugzilla for that one. 

I agree with you. But the randomness of the symptoms and the type of them were so close of these described here that my first reaction was to blacklist edac modules for that platform, thinking that edac was "buggy" with new kernel independently of the platform used. Sorry for that, as you seem to say it, I was wrong.

> Btw, a fix for i7core_edac was post today at
> linux-edac ML and at LKML:
>     https://lkml.org/lkml/2012/10/16/277

Thanks for pointing this.

I'll wait for the fix to arrive in Fedora.


-- 
NVieville
Comment 30 Andrew 2012-10-16 13:44:13 EDT
>> I'll post here the final patches when they got ready and after some tests.

Thanks, Mauro.  My affected system is a non-production test server, so please let me know if I can be of any help - pulling info/debug, testing patches, etc.  Happy to help.
Comment 31 Mauro Carvalho Chehab 2012-10-16 22:13:51 EDT
(In reply to comment #28)
> I'm actually needing to rewrite most of the driver, in order to properly fix
> the issues there. I'll post here the final patches when they got ready and
> after some tests.

Ok, driver seems ready. I updated it on my upstream tree at:
    http://git.kernel.org/?p=linux/kernel/git/mchehab/linux-edac.git;a=summary

Koji build:
    http://koji.fedoraproject.org/koji/taskinfo?taskID=4597366
    (I just started the job - it will take some time for it to finish)

I tested on a Dell Precision n390, and it looks fine:

Memory layout, as detected by the driver:
          +-----------------------------------+
          |                mc0                |
          |  csrow0   |  csrow1   |  csrow2   |
----------+-----------------------------------+
channel1: |   512 MB  |     0 MB  |   512 MB  |
channel0: |   512 MB  |     0 MB  |   512 MB  |
----------+-----------------------------------+

(that matches the machine memory size of 2GB)

By creating a file at: /etc/modprobe.d/edac.conf with the following content:
    options edac_core edac_debug_level=3

And loading the debug kernel, it properly detects the memory:

[   20.935332] EDAC MC: Ver: 3.0.0
[   20.939474] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[   20.954466] EDAC DEBUG: i82975x_init: 
[   20.954570] EDAC DEBUG: i82975x_init_one: 
[   20.954580] EDAC DEBUG: i82975x_probe1: 
[   20.954632] EDAC i82975x: MCHBAR real = feda0000, remapped = ffffc90004cd6000
[   20.961873] EDAC i82975x: DRAM0 Rank Boundary Address: Channel A: 0x00000010; Channel B: 0x00000010
[   20.970894] EDAC i82975x: DRAM1 Rank Boundary Address: Channel A: 0x00000020; Channel B: 0x00000020
[   20.979914] EDAC i82975x: DRAM Controller mode Channel A: = 0x40204a06 (ECC enabled); Channel B: 0x40204a06 (ECC enabled)
[   20.990836] EDAC i82975x: Bank Architecture Channel A: 0x00000000, Channel B: 0x00000000
[   20.998901] EDAC i82975x: DRAM Timings :      ChA    ChB
[   21.004208] EDAC i82975x:   RAS Active Min =  15      15
[   21.009505] EDAC i82975x:   CAS latency    =   5       5
[   21.014802] EDAC i82975x:   RAS to CAS     =   5       5
[   21.020095] EDAC i82975x:   RAS precharge  =   5       5
[   21.025391] EDAC DEBUG: edac_mc_alloc: allocating 2104 bytes for mci data (8 ranks, 8 csrows/channels)
[   21.031621] EDAC DEBUG: i82975x_probe1: init mci
[   21.031625] EDAC DEBUG: i82975x_probe1: init pvt
[   21.031628] EDAC DEBUG: i82975x_init_csrows: DIMM A0: from page 0x00000000 to 0x0003fffe (size: 0x00020000 pages)
[   21.031630] EDAC DEBUG: i82975x_init_csrows: DIMM A1: from page 0x00040000 to 0x0007fffe (size: 0x00020000 pages)
[   21.031633] EDAC DEBUG: i82975x_init_csrows: DIMM B0: from page 0x00000000 to 0x0003fffe (size: 0x00020000 pages)
[   21.031635] EDAC DEBUG: i82975x_init_csrows: DIMM B1: from page 0x00040000 to 0x0007fffe (size: 0x00020000 pages)
[   21.031650] EDAC DEBUG: edac_mc_add_mc: 
[   21.031652] EDAC DEBUG: edac_mc_dump_mci: 	mci = ffff88006a39c520
[   21.031653] EDAC DEBUG: edac_mc_dump_mci: 	mci->mtype_cap = 800
[   21.031655] EDAC DEBUG: edac_mc_dump_mci: 	mci->edac_ctl_cap = 22
[   21.031656] EDAC DEBUG: edac_mc_dump_mci: 	mci->edac_cap = 22
[   21.031658] EDAC DEBUG: edac_mc_dump_mci: 	mci->nr_csrows = 4, csrows = ffff880079641c20
[   21.031660] EDAC DEBUG: edac_mc_dump_mci: 	mci->nr_dimms = 8, dimms = ffff8800719e5260
[   21.031661] EDAC DEBUG: edac_mc_dump_mci: 	dev = ffff88007ac211e0
[   21.031663] EDAC DEBUG: edac_mc_dump_mci: 	mod_name:ctl_name = i82975x_edac:i82975x
[   21.031665] EDAC DEBUG: edac_mc_dump_mci: 	pvt_info = ffff88006a39cd00
[   21.031782] EDAC DEBUG: find_mci_by_dev: 
[   21.031818] EDAC DEBUG: edac_create_sysfs_mci_device: creating bus mc0
[   21.032510] EDAC DEBUG: edac_create_sysfs_mci_device: creating device mc0
[   21.039169] EDAC DEBUG: edac_create_sysfs_mci_device: creating dimm0, located at csrow 0 channel 0 
[   21.058319] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device rank0
[   21.058323] EDAC DEBUG: edac_create_sysfs_mci_device: creating dimm1, located at csrow 0 channel 1 
[   21.058937] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device rank1
[   21.058939] EDAC DEBUG: edac_create_sysfs_mci_device: creating dimm4, located at csrow 2 channel 0 
[   21.059575] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device rank4
[   21.059577] EDAC DEBUG: edac_create_sysfs_mci_device: creating dimm5, located at csrow 2 channel 1 
[   21.060203] EDAC DEBUG: edac_create_dimm_object: creating rank/dimm device rank5
[   21.060242] EDAC DEBUG: edac_create_csrow_object: creating (virtual) csrow node csrow0
[   21.083611] EDAC DEBUG: edac_create_csrow_object: creating (virtual) csrow node csrow1
[   21.104752] EDAC DEBUG: edac_mc_workq_setup: 
[   21.104796] EDAC MC0: Giving out device to 'i82975x_edac' 'i82975x': DEV 0000:00:00.0
[   21.112896] EDAC DEBUG: i82975x_probe1: success

The above seems to match the memory configuration on this system.

As I don't have any hardware error generator here, I couldn't test the error decoding logic.
Comment 33 Fedora Update System 2012-10-17 07:49:27 EDT
kernel-3.6.2-4.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.6.2-4.fc17
Comment 34 Thomas Moschny 2012-10-17 08:32:33 EDT
kernel-3.6.2-4.fc17 seems to work for me.
Comment 35 Fedora Update System 2012-10-17 08:49:30 EDT
kernel-3.6.2-2.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.6.2-2.fc18
Comment 36 Fedora Update System 2012-10-17 13:34:07 EDT
Package kernel-3.6.2-2.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.6.2-2.fc18'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-16308/kernel-3.6.2-2.fc18
then log in and leave karma (feedback).
Comment 37 Eric Gerney 2012-10-22 11:04:56 EDT
Confirmed, kernel 3.6.2-4.fc17.i686.PAE (32-bit) works as expected.  No need to exclude/blacklist i82975x_edac.ko in order to boot the system.  Many thanks!
Comment 38 Robert Wilhelm 2012-10-31 16:43:17 EDT
I have similar problem with Fedora 16 on my DELL Precision Workstation 390.
Will there be a kernel update for Fedora 16, too? 

[   18.162190] Pid: 571, comm: modprobe Not tainted 3.6.2-1.fc16.x86_64 #1 Dell Inc.                 Precision WorkStation 390    /0DN075
[   18.162198] RIP: 0010:[<ffffffff812dc930>]  [<ffffffff812dc930>] strncpy+0x10/0x30
[   18.162209] RSP: 0018:ffff880114059c48  EFLAGS: 00010202
[   18.162213] RAX: ffff880111ae9e90 RBX: ffff880111ae9c00 RCX: ffff880111ae9e90
[   18.162218] RDX: 000000000000001f RSI: 5f706f5f63616465 RDI: ffff880111ae9e90
[   18.162223] RBP: ffff880114059c48 R08: 0000ffffffffff0a R09: 0000000000000000
[   18.162228] R10: 000000000000fffe R11: 0000000000000008 R12: 0000000000000003
[   18.162232] R13: 0000000000000006 R14: ffff880111af5800 R15: ffff8801118f9000
[   18.162238] FS:  00007f8cb8e7a700(0000) GS:ffff88011bc00000(0000) knlGS:0000000000000000
[   18.162243] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.162248] CR2: 00007f8cb8e99000 CR3: 000000011140d000 CR4: 00000000000007f0
[   18.162253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   18.162258] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   18.162263] Process modprobe (pid: 571, threadinfo ffff880114058000, task ffff880110968000)
[   18.162268] Stack:
[   18.162270]  ffff880114059d08 ffffffffa0308695 ffff880114059d28 ffff8801154410e8
[   18.162280]  0004000010968000 ffffc90004fa6000 ffff880115438000 0000000000000000
[   18.162289]  0000000000000000 0004000000000000 ffff880114059cf8 00000003811ff9bf
[   18.162297] Call Trace:
[   18.162303]  [<ffffffffa0308695>] i82975x_init_one+0x2d3/0x3d1 [i82975x_edac]
[   18.162312]  [<ffffffff813c8c52>] ? __pm_runtime_set_status+0x142/0x220
[   18.162319]  [<ffffffff812fd2a9>] local_pci_probe+0x79/0x100
[   18.162324]  [<ffffffff812febf9>] pci_device_probe+0x109/0x130
[   18.162330]  [<ffffffff813bd161>] driver_probe_device+0x91/0x3b0
[   18.162336]  [<ffffffff813bd52b>] __driver_attach+0xab/0xb0
[   18.162341]  [<ffffffff813bd480>] ? driver_probe_device+0x3b0/0x3b0
[   18.162347]  [<ffffffff813bb356>] bus_for_each_dev+0x56/0x90
[   18.162352]  [<ffffffff813bcbee>] driver_attach+0x1e/0x20
[   18.162357]  [<ffffffff813bc6f0>] bus_add_driver+0x1a0/0x2c0
[   18.162362]  [<ffffffffa0006000>] ? 0xffffffffa0005fff
[   18.162367]  [<ffffffff813bda7a>] driver_register+0x7a/0x160
[   18.162372]  [<ffffffffa0006000>] ? 0xffffffffa0005fff
[   18.162377]  [<ffffffff812fe8d6>] __pci_register_driver+0x56/0xd0
[   18.162383]  [<ffffffffa0006033>] i82975x_init+0x33/0x1000 [i82975x_edac]
[   18.162390]  [<ffffffff8100203f>] do_one_initcall+0x3f/0x170
[   18.162397]  [<ffffffff810bfd3e>] sys_init_module+0xbe/0x230
[   18.162404]  [<ffffffff816211a9>] system_call_fastpath+0x16/0x1b
[   18.162408] Code: 0c 10 48 83 c2 01 84 c9 75 f1 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 d2 48 89 f8 48 89 e5 74 1c 48 89 f9 90 <0f> b6 3e 40 80 ff 01 40 88 39 48 83 de ff 48 83 c1 01 48 83 ea 
[   18.162483] RIP  [<ffffffff812dc930>] strncpy+0x10/0x30
[   18.162489]  RSP <ffff880114059c48>
[   18.164476] ---[ end trace 47cf241bd866fb9b ]---
[   18.164690] udevd[551]: '/sbin/modprobe -bv pci:v00008086d0000277Csv00001028sd000001DEbc06sc00i00' [571] terminated by signal 11 (Segmentation fault)
Comment 39 Justin M. Forbes 2012-10-31 23:43:34 EDT
Yes, this is in 3.6.5-2.f16 which is building now.
Comment 40 Robert Wilhelm 2012-11-01 03:29:59 EDT
3.6.5-2.fc16.x86_64 from koji seems to work for me. Many thanks for fast response.
Comment 41 Fedora Update System 2012-11-01 09:23:08 EDT
kernel-3.6.5-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.6.5-2.fc16
Comment 42 Fedora Update System 2012-11-06 09:21:46 EST
kernel-3.6.6-1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.6.6-1.fc16
Comment 43 Fedora Update System 2012-12-20 10:14:22 EST
kernel-3.6.6-1.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.