Created attachment 354792 [details] dmesg Description of problem: System crashes randomly during resync of software Raid5 array. After reboot, resync restarts and mostly crashes again. Sometimes it finishes (luck??) Version-Release number of selected component (if applicable): mdadm version 0.90 How reproducible: 9 out of 10 times Steps to Reproduce: 1. Crash kernel (or unclean shutdown to get Raid array to resync 2. Reboot 3. wait Actual results: Kernel failure message 1: general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/block/md0/md/sync_speed CPU 1 Modules linked in: fuse it87 hwmon_vid sunrpc ipv6 nf_conntrack_ftp dm_multipath raid456 async_xor async_memcpy async_tx xor uinput snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec firewire_ohci snd_hwdep k8temp firewire_core hwmon snd_pcm pcspkr snd_timer crc_itu_t serio_raw snd soundcore snd_page_alloc i2c_piix4 usb_storage shpchp sky2 ata_generic pata_acpi pata_atiixp radeon drm i2c_algo_bit i2c_core [last unloaded: freq_table] Pid: 1126, comm: md0_raid5 Not tainted 2.6.29.6-213.fc11.x86_64 #1 Unknow RIP: 0010:[<ffffffffa01334cb>] [<ffffffffa01334cb>] xor_sse_4+0x7a/0x32a [xor] RSP: 0018:ffff88007083bae0 EFLAGS: 00010212 RAX: ffff88007106d000 RBX: 0000000000000000 RCX: 0000000000000010 RDX: 6dff88007106e000 RSI: ffff88007106f000 RDI: 0000000000001000 RBP: ffff88007083bb20 R08: ffff88007106c000 R09: 0000000000000100 R10: ffff88007083bae0 R11: 000000008005003b R12: 0000000000000003 R13: 0000000000000003 R14: ffff88007083bca8 R15: 0000000000000004 FS: 00007f42f3a706f0(0000) GS:ffff880077801d00(0000) knlGS:00000000f71ffb70 CS: 0010 DS: 0018 ES: 0018 CR0: 0000000080050033 CR2: 000000000042c510 CR3: 0000000068dd5000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process md0_raid5 (pid: 1126, threadinfo ffff88007083a000, task ffff8800712f5c00) Stack: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffff88007083bb30 ffffffffa0133bc0 ffff88007083bbd0 ffffffffa017d348 Call Trace: [<ffffffffa0133bc0>] xor_blocks+0x57/0x63 [xor] [<ffffffffa017d348>] async_xor+0x2e9/0x30e [async_xor] [<ffffffff811b3a58>] ? kobject_release+0x0/0x66 [<ffffffff81011f67>] ? restore_args+0x0/0x30 [<ffffffffa017d4da>] async_xor_zero_sum+0x16d/0x1ff [async_xor] [<ffffffffa01feb27>] ? release_stripe+0x40/0x49 [raid456] [<ffffffffa01febff>] ? ops_complete_check+0x0/0x1c [raid456] [<ffffffffa020110c>] ops_run_check+0xa6/0xc7 [raid456] [<ffffffffa0202af3>] handle_stripe+0x19c6/0x1a66 [raid456] [<ffffffff81039742>] ? __wake_up+0x44/0x4d [<ffffffffa0203ddc>] raid5d+0x40c/0x459 [raid456] [<ffffffff811b3a58>] ? kobject_release+0x0/0x66 [<ffffffff812d65ae>] md_thread+0xf1/0x10f [<ffffffff8105c8d7>] ? autoremove_wake_function+0x0/0x39 [<ffffffff812d64bd>] ? md_thread+0x0/0x10f [<ffffffff8105c541>] kthread+0x4d/0x78 [<ffffffff8101264a>] child_rip+0xa/0x20 [<ffffffff81011f67>] ? restore_args+0x0/0x30 [<ffffffff8105c4f4>] ? kthread+0x0/0x78 [<ffffffff81012640>] ? child_rip+0x0/0x20 Code: 0f 18 82 00 01 00 00 0f 18 82 20 01 00 00 0f 28 06 0f 28 4e 10 0f 28 56 20 0f 28 5e 30 0f 18 80 00 01 00 00 0f 18 80 20 01 00 00 <0f> 57 02 0f 57 4a 10 0f 57 52 20 0f 57 5a 30 41 0f 18 80 00 01 RIP [<ffffffffa01334cb>] xor_sse_4+0x7a/0x32a [xor] RSP <ffff88007083bae0> ---[ end trace ae2e91c45cd73027 ]--- Expected results: Completed resync Additional info: Tried with several kernels. acpi=off and or noapic. same result
This is most likely a hardware error. A raid resync is both a CPU intensive and memory intensive operation. It can easily expose marginal memory. The fact that the check sometimes completes but most times does not, but also fails at random locations, is a strong indication that this is related to memory or CPU and not the code (that and the fact that it works properly on all my test systems 100% of the time also lends to that conclusion).
What are the best ways to test the hardware cause? Currently running memory test (fedora dvd) for one memory bank at a time. (How long to run?). Memory is cheap to buy, so fingers crossed this produces an error. If need to replace, better of with 533Mhz CL4 or 667Mhz CL5?? What is the best way to test CPU? I did notice that during resync only one core of the CPU was demanded... (IRQ22) [root@linuxserver ~]# cat /proc/interrupts CPU0 CPU1 0: 132 1 IO-APIC-edge timer 1: 0 2 IO-APIC-edge i8042 4: 0 2 IO-APIC-edge 8: 0 1 IO-APIC-edge rtc0 9: 0 9 IO-APIC-fasteoi acpi 14: 5 4707 IO-APIC-edge pata_atiixp 15: 0 0 IO-APIC-edge pata_atiixp 16: 0 291 IO-APIC-fasteoi ohci_hcd:usb2, HDA Intel 17: 161 144 IO-APIC-fasteoi ohci_hcd:usb3, ohci_hcd:usb5 18: 0 1 IO-APIC-fasteoi ohci_hcd:usb4, ohci_hcd:usb6, radeon 19: 4725 125 IO-APIC-fasteoi ehci_hcd:usb1, HDA Intel 20: 0 5 IO-APIC-fasteoi firewire_ohci 22: 3 1117082 IO-APIC-fasteoi ahci 25: 376 213 PCI-MSI-edge eth0 NMI: 0 0 Non-maskable interrupts LOC: 153546 132361 Local timer interrupts RES: 221445 19051 Rescheduling interrupts CAL: 212 57 Function call interrupts TLB: 377 214 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts SPU: 0 0 Spurious interrupts ERR: 0 MIS: 0 [root@linuxserver ~]#
I can't really say which memory speed is better, I haven't tested it. But, I would assume 667MHz would do better even with a slightly higher CL latency. As far as only one core handling the ahci interrupt, that doesn't mean only that core did the work. The md resync thread is not tied to the ahci interrupt and can run on either core regardless of the interrupt routing. That thread runs in kernel thread context, not interrupt context. As for testing, in order to test ram as best possible, I've had a memory test script on my web page for probably 10 years now. Got to http://people.redhat.com/dledford and click on the memory tester link on the left. That script will exercise your ram and show you if there are any errors.
Nothing new in a couple months, closing this bug out.