575402 – mdraid check causes data read corruption, massive CPU load, 100+ "async" processes

Bug 575402 - mdraid check causes data read corruption, massive CPU load, 100+ "async" processes

Summary: mdraid check causes data read corruption, massive CPU load, 100+ "async" pro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	12
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	575897 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-03-20 14:44 UTC by Brian
Modified:	2010-08-05 18:26 UTC (History)
CC List:	16 users (show)
Fixed In Version:	2.6.32.12-114.fc12
Clone Of:
Environment:
Last Closed:	2010-08-05 18:26:25 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Brian 2010-03-20 14:44:05 UTC

Description of problem:
Nagios is being tripped due to 100+ processes named "async" .  It appears to be similar to what is being discussed here:
   http://lists.mandriva.com/kernel-discuss/2009-12/msg00004.php

Version-Release number of selected component (if applicable):
kernel-2.6.32.9-70.fc12.i686

How reproducible:
I am unable to identify the exact trigger for this.

Steps to Reproduce:
1.  Create raid5
2.  Configure raid5 space for use with BackupPC, MythTV, Samba, Apache, etc.
3.
  
Actual results:
The "async" processes don't seem to take up a bunch of CPU by the time I get to the machine and run top .

Expected results:
100+ processes not popping up at random times.

Additional info:
CONFIG_MULTICORE_RAID456 in kernel config may be the cause.  Can anyone confirm?  (this is a lazy bug report, I know)

Comment 1 Brian 2010-03-21 18:30:16 UTC

/etc/cron.weekly/99-raid-check not only launches these async process, but the check takes an extraordinary amount of time and the loads are WAY out of control, over 40+.

It was to the point that Samba was sending out corrupted data to windows clients.  The clients weren't writing to smb, so I don't know if the corruption occurs during writes too, but this is out of control.

A reboot stopped the check and restored sanity.

Comment 2 Brian 2010-03-21 18:59:37 UTC

Additionally, BTRFS shows checksum fails during this raid check.

These occurrences are extensive, but here are two as an example:

Mar 21 11:29:45 hq1 kernel: btrfs: dm-3 checksum verify failed on 39846834176 wanted 474382DF found 4E0AFD5E level 0
Mar 21 11:29:45 hq1 kernel: btrfs: dm-3 checksum verify failed on 39846834176 wanted 474382DF found 4E0AFD5E level 0

Comment 3 Devlin Null 2010-03-22 11:03:41 UTC

Ditto for me with the same kernel and RAID5 configuration.

md7 : active raid5 sdc9[0] sdb9[2] sda9[1]
      1216650368 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      [===================>.]  check = 98.3% (598230400/608325184) finish=32.6min speed=5148K/sec

# uptime
 10:57:26 up 18:51,  2 users,  load average: 48.80, 31.72, 31.50

# ps auxwww | grep async | wc -l
235

This has only occurred very recently after I last updated the kernel. I also removed "maxcpus=1" from the kernel parameters, so this could well be related to MULTICORE_RAID456 as Brian suggests.

Comment 4 Devlin Null 2010-03-22 13:13:12 UTC

Doesn't seem to improve even after specifying 'maxcpus=1' and rebooting:

$ uptime
 13:06:16 up  1:48,  2 users,  load average: 102.20, 102.18, 58.90
$ !ps
ps auxwww | grep sync
root        10  0.0  0.0      0     0 ?        S    11:17   0:04 [async/mgr]
root        12  0.0  0.0      0     0 ?        S    11:17   0:00 [sync_supers]
root     29358  0.0  0.0      0     0 ?        S    12:58   0:00 [md7_resync]
root     29360  0.0  0.0      0     0 ?        S    12:58   0:00 [md4_resync]
root     29361  0.0  0.0      0     0 ?        S    12:58   0:00 [md3_resync]
root     29362  0.0  0.0      0     0 ?        S    12:58   0:00 [md2_resync]
root     29363  0.2  0.0      0     0 ?        D    12:58   0:01 [md1_resync]
root     29364  0.0  0.0      0     0 ?        S    12:58   0:00 [md0_resync]
root     30079  0.3  0.0      0     0 ?        S    13:00   0:01 [async/0]
root     30080  0.2  0.0      0     0 ?        S    13:00   0:01 [async/1]
root     30081  0.2  0.0      0     0 ?        S    13:00   0:00 [async/2]
root     30082  0.2  0.0      0     0 ?        S    13:00   0:00 [async/3]
root     30083  0.2  0.0      0     0 ?        S    13:00   0:00 [async/4]
root     30084  0.2  0.0      0     0 ?        S    13:00   0:00 [async/5]
...
root     30311  0.1  0.0      0     0 ?        S    13:00   0:00 [async/232]
root     30312  0.2  0.0      0     0 ?        S    13:00   0:00 [async/233]
root     30313  1.3  0.0      0     0 ?        S    13:00   0:05 [async/234]

This occurred while unrar'ing a large file.

I have reverted to 2.6.31.12-174.2.22.fc12.i686 and all is ok again.

Comment 5 Brian 2010-03-28 23:04:43 UTC

I'm writing to confirm data write corruption that wasn't even on the array.  I too reverted to 2.6.31.12-174.2.22.fc12.i686 .

I don't know if anyone with any situational authority has picked up on this yet or not, but it's a BFD.

Comment 6 Devlin Null 2010-04-06 11:00:22 UTC

It looks like it's just us two then Brian!

Is there anything common to our respective setups that might help to narrow this one down?

I am running a Shuttle SA76G2 with an Athlon 64 X2 4400+ processor and 3 x WDC WD6400AADS Caviar Green disks in a software RAID-5 configuration. I run Samba, Apache HTTPD, Sendmail, Mailman.

I was wondering if it might be worth upgrading to 2.6.32.10-90.fc12 but can't see anything immediately obvious in the ChangeLog that addresses this issue.

Comment 7 Brian 2010-04-06 14:48:22 UTC

The only thing significant I see is that we're both running the 32-bit distributions.

This is an NForce2 board, Asus I think, AMD Athlon(tm) XP 3200+ , 5-disk RAID5, disks are
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11 family
Device Model:     ST3500320AS
Serial Number:    9QM2B8HN
Firmware Version: SD1A

I'm running all kinds of services with LVM on top of the RAID5.

I can't understand why this thread isn't flooded with "dittos".  This system is completely unstable with the Fedora build of the 2.6.32 kernel.

I tried setting CONFIG_MULTICORE_RAID456 to no and rebuilding the kernel rpm, but wasn't successful.  I'll try again another day.

Comment 8 Chuck Ebbert 2010-04-07 05:17:43 UTC

*** Bug 575897 has been marked as a duplicate of this bug. ***

Comment 9 Chuck Ebbert 2010-04-07 06:16:36 UTC

Disabled CONFIG_MULTICORE_RAID456 in kernel-2.6.32.11-101.fc12

Comment 10 Jeremy Sanders 2010-04-12 10:02:00 UTC

Would it be possible to a build of this kernel so we can try it out? I can't see it in koji: http://koji.fedoraproject.org/koji/packageinfo?packageID=8

Comment 11 jfceklosky 2010-04-16 01:50:35 UTC

I have another instance of the Multicore failure.  I think this should not be turned on until this is debugged.

I am running the latest Fedora kernel on FC12.  I will get about 100 async processes going during a resync, cp, or rsync to the raid device.  Sometimes it locks the machine up and the load is sky high on the box.

I am running this on an Intel ss4200-e with 4 sata drive in a RAID 5 setup.

I think this problem started in 2.6.32 when this MULTICORE option was added
and fedora decided to turn it on.

I am rebuilding the latest Fedora kernel with the option off and going to retest on this same hardware again.

Comment 12 jfceklosky 2010-04-16 10:17:20 UTC

I retested with the option removed from the Fedora kernel and the async's are gone and the problem have disappeared.

Please remove this option from the compiled Fedora kernels.

We also need to get this lock up information to the developer.

Comment 13 Fedora Update System 2010-04-28 04:35:36 UTC

kernel-2.6.32.12-114.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/kernel-2.6.32.12-114.fc12

Comment 14 Devlin Null 2010-05-12 11:29:34 UTC

I updated to kernel-2.6.32.12-114.fc12 and it performs well for me.

I have completed a full mdraid check on my software RAID and there was no sign of the multiple async issue.

Disabling CONFIG_MULTICORE_RAID456 would indeed appear to have fixed this issue. Thanks Chuck.

Comment 15 Fedora Update System 2010-05-17 05:50:07 UTC

kernel-2.6.32.12-115.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/kernel-2.6.32.12-115.fc12

Comment 16 Fedora Update System 2010-05-18 21:58:55 UTC

kernel-2.6.32.12-115.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 17 Doug Ledford 2010-06-08 16:24:28 UTC

*** Bug 596490 has been marked as a duplicate of this bug. ***

Comment 18 Brian 2010-06-13 23:50:44 UTC

Has anyone else noticed an corruption of the RAID array as a result of this?  Is destroying and rebuilding required after being affected by this?

I haven't gotten any notices from the md modules, but no matter what filesystem I use for a BackupPC volume which sits on top of LVM, on top of my RAID5, always has corruption issues.  No one else seems to have these corruption issues with BackupPC though, and again, I've used btrfs, ext4, and xfs.   

To add to the difficulty in diagnosing my problem, the 5 drive RAID5 is made of the Seagate ST3500320AS gems which caused Seagate the biggest PR disaster I can recall for a hard drive manufacturer.  I'd updated all of the firmware before construction this array though.  Still, some have been skeptical of the update.

BackupPC doesn't seem to be the only issue though.  Some photoshop files on the drive have complained of jpg corruption issues inside the file, though they still seem to open fine.  Other files are now failing to copy and I just don't trust the array anymore.

Can anyone come up with a reasoning as to how this might have happened and suggest a process to attain stability once again.  At this point, I'm most heavily leaning towards ZFS-Fuse to just take the hit on speed as long as it can both help me diagnose a possible drive issue and still provide me reliability.

Comment 19 jbielans 2010-07-06 14:35:29 UTC

For your information - bug 596490 still exists. It came up when tests for Red Hat Beta2 x86_64 had been executed.

Comment 20 Michal Schmidt 2010-08-05 12:03:27 UTC

CONFIG_MULTICORE_RAID456 is still enabled in master, f14 and f13. Please disable it on all branches.

Comment 21 Chuck Ebbert 2010-08-05 18:26:25 UTC

(In reply to comment #20)
> CONFIG_MULTICORE_RAID456 is still enabled in master, f14 and f13. Please
> disable it on all branches.    

Done.

Note You need to log in before you can comment on or make changes to this bug.