513297 – md_resync fails into softlock

Bug 513297 - md_resync fails into softlock

Summary: md_resync fails into softlock

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	11
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-07-22 21:53 UTC by Jacco
Modified:	2009-09-15 18:27 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-09-15 18:27:58 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg (38.28 KB, text/plain) 2009-07-22 21:53 UTC, Jacco	no flags	Details
View All

Description Jacco 2009-07-22 21:53:07 UTC

Created attachment 354792 [details]
dmesg

Description of problem:

System crashes randomly during resync of software Raid5 array. After reboot, resync restarts and mostly crashes again. Sometimes it finishes (luck??)

Version-Release number of selected component (if applicable):

mdadm version 0.90

How reproducible:

9 out of 10 times

Steps to Reproduce:
1. Crash kernel (or unclean shutdown to get Raid array to resync
2. Reboot
3. wait
  
Actual results:

Kernel failure message 1:
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/md0/md/sync_speed
CPU 1 
Modules linked in: fuse it87 hwmon_vid sunrpc ipv6 nf_conntrack_ftp dm_multipath raid456 async_xor async_memcpy async_tx xor uinput snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec firewire_ohci snd_hwdep k8temp firewire_core hwmon snd_pcm pcspkr snd_timer crc_itu_t serio_raw snd soundcore snd_page_alloc i2c_piix4 usb_storage shpchp sky2 ata_generic pata_acpi pata_atiixp radeon drm i2c_algo_bit i2c_core [last unloaded: freq_table]
Pid: 1126, comm: md0_raid5 Not tainted 2.6.29.6-213.fc11.x86_64 #1 Unknow
RIP: 0010:[<ffffffffa01334cb>]  [<ffffffffa01334cb>] xor_sse_4+0x7a/0x32a [xor]
RSP: 0018:ffff88007083bae0  EFLAGS: 00010212
RAX: ffff88007106d000 RBX: 0000000000000000 RCX: 0000000000000010
RDX: 6dff88007106e000 RSI: ffff88007106f000 RDI: 0000000000001000
RBP: ffff88007083bb20 R08: ffff88007106c000 R09: 0000000000000100
R10: ffff88007083bae0 R11: 000000008005003b R12: 0000000000000003
R13: 0000000000000003 R14: ffff88007083bca8 R15: 0000000000000004
FS:  00007f42f3a706f0(0000) GS:ffff880077801d00(0000) knlGS:00000000f71ffb70
CS:  0010 DS: 0018 ES: 0018 CR0: 0000000080050033
CR2: 000000000042c510 CR3: 0000000068dd5000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process md0_raid5 (pid: 1126, threadinfo ffff88007083a000, task ffff8800712f5c00)
Stack:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
ffff88007083bb30 ffffffffa0133bc0 ffff88007083bbd0 ffffffffa017d348
Call Trace:
[<ffffffffa0133bc0>] xor_blocks+0x57/0x63 [xor]
[<ffffffffa017d348>] async_xor+0x2e9/0x30e [async_xor]
[<ffffffff811b3a58>] ? kobject_release+0x0/0x66
[<ffffffff81011f67>] ? restore_args+0x0/0x30
[<ffffffffa017d4da>] async_xor_zero_sum+0x16d/0x1ff [async_xor]
[<ffffffffa01feb27>] ? release_stripe+0x40/0x49 [raid456]
[<ffffffffa01febff>] ? ops_complete_check+0x0/0x1c [raid456]
[<ffffffffa020110c>] ops_run_check+0xa6/0xc7 [raid456]
[<ffffffffa0202af3>] handle_stripe+0x19c6/0x1a66 [raid456]
[<ffffffff81039742>] ? __wake_up+0x44/0x4d
[<ffffffffa0203ddc>] raid5d+0x40c/0x459 [raid456]
[<ffffffff811b3a58>] ? kobject_release+0x0/0x66
[<ffffffff812d65ae>] md_thread+0xf1/0x10f
[<ffffffff8105c8d7>] ? autoremove_wake_function+0x0/0x39
[<ffffffff812d64bd>] ? md_thread+0x0/0x10f
[<ffffffff8105c541>] kthread+0x4d/0x78
[<ffffffff8101264a>] child_rip+0xa/0x20
[<ffffffff81011f67>] ? restore_args+0x0/0x30
[<ffffffff8105c4f4>] ? kthread+0x0/0x78
[<ffffffff81012640>] ? child_rip+0x0/0x20
Code: 0f 18 82 00 01 00 00 0f 18 82 20 01 00 00 0f 28 06 0f 28 4e 10 0f 28 56 20 0f 28 5e 30 0f 18 80 00 01 00 00 0f 18 80 20 01 00 00 <0f> 57 02 0f 57 4a 10 0f 57 52 20 0f 57 5a 30 41 0f 18 80 00 01 
RIP  [<ffffffffa01334cb>] xor_sse_4+0x7a/0x32a [xor]
RSP <ffff88007083bae0>
---[ end trace ae2e91c45cd73027 ]---

Expected results:

Completed resync

Additional info:

Tried with several kernels. acpi=off and or noapic. same result

Comment 1 Doug Ledford 2009-07-25 14:37:02 UTC

This is most likely a hardware error.  A raid resync is both a CPU intensive and memory intensive operation.  It can easily expose marginal memory.  The fact that the check sometimes completes but most times does not, but also fails at random locations, is a strong indication that this is related to memory or CPU and not the code (that and the fact that it works properly on all my test systems 100% of the time also lends to that conclusion).

Comment 2 Jacco 2009-07-26 00:27:23 UTC

What are the best ways to test the hardware cause? Currently running memory test (fedora dvd) for one memory bank at a time. (How long to run?). Memory is cheap to buy, so fingers crossed this produces an error. If need to replace, better of with 533Mhz CL4 or 667Mhz CL5??

What is the best way to test CPU? I did notice that during resync only one core of the CPU was demanded... (IRQ22)

[root@linuxserver ~]# cat /proc/interrupts
           CPU0       CPU1
  0:        132          1   IO-APIC-edge      timer
  1:          0          2   IO-APIC-edge      i8042
  4:          0          2   IO-APIC-edge
  8:          0          1   IO-APIC-edge      rtc0
  9:          0          9   IO-APIC-fasteoi   acpi
 14:          5       4707   IO-APIC-edge      pata_atiixp
 15:          0          0   IO-APIC-edge      pata_atiixp
 16:          0        291   IO-APIC-fasteoi   ohci_hcd:usb2, HDA Intel
 17:        161        144   IO-APIC-fasteoi   ohci_hcd:usb3, ohci_hcd:usb5
 18:          0          1   IO-APIC-fasteoi   ohci_hcd:usb4, ohci_hcd:usb6, radeon
 19:       4725        125   IO-APIC-fasteoi   ehci_hcd:usb1, HDA Intel
 20:          0          5   IO-APIC-fasteoi   firewire_ohci
 22:          3    1117082   IO-APIC-fasteoi   ahci
 25:        376        213   PCI-MSI-edge      eth0
NMI:          0          0   Non-maskable interrupts
LOC:     153546     132361   Local timer interrupts
RES:     221445      19051   Rescheduling interrupts
CAL:        212         57   Function call interrupts
TLB:        377        214   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0
[root@linuxserver ~]#

Comment 3 Doug Ledford 2009-07-27 13:48:11 UTC

I can't really say which memory speed is better, I haven't tested it.  But, I would assume 667MHz would do better even with a slightly higher CL latency.

As far as only one core handling the ahci interrupt, that doesn't mean only that core did the work.  The md resync thread is not tied to the ahci interrupt and can run on either core regardless of the interrupt routing.  That thread runs in kernel thread context, not interrupt context.

As for testing, in order to test ram as best possible, I've had a memory test script on my web page for probably 10 years now.  Got to http://people.redhat.com/dledford and click on the memory tester link on the left.  That script will exercise your ram and show you if there are any errors.

Comment 4 Doug Ledford 2009-09-15 18:27:58 UTC

Nothing new in a couple months, closing this bug out.

Note You need to log in before you can comment on or make changes to this bug.