Description of problem: Nagios is being tripped due to 100+ processes named "async" . It appears to be similar to what is being discussed here: http://lists.mandriva.com/kernel-discuss/2009-12/msg00004.php Version-Release number of selected component (if applicable): kernel-2.6.32.9-70.fc12.i686 How reproducible: I am unable to identify the exact trigger for this. Steps to Reproduce: 1. Create raid5 2. Configure raid5 space for use with BackupPC, MythTV, Samba, Apache, etc. 3. Actual results: The "async" processes don't seem to take up a bunch of CPU by the time I get to the machine and run top . Expected results: 100+ processes not popping up at random times. Additional info: CONFIG_MULTICORE_RAID456 in kernel config may be the cause. Can anyone confirm? (this is a lazy bug report, I know)
/etc/cron.weekly/99-raid-check not only launches these async process, but the check takes an extraordinary amount of time and the loads are WAY out of control, over 40+. It was to the point that Samba was sending out corrupted data to windows clients. The clients weren't writing to smb, so I don't know if the corruption occurs during writes too, but this is out of control. A reboot stopped the check and restored sanity.
Additionally, BTRFS shows checksum fails during this raid check. These occurrences are extensive, but here are two as an example: Mar 21 11:29:45 hq1 kernel: btrfs: dm-3 checksum verify failed on 39846834176 wanted 474382DF found 4E0AFD5E level 0 Mar 21 11:29:45 hq1 kernel: btrfs: dm-3 checksum verify failed on 39846834176 wanted 474382DF found 4E0AFD5E level 0
Ditto for me with the same kernel and RAID5 configuration. md7 : active raid5 sdc9[0] sdb9[2] sda9[1] 1216650368 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] [===================>.] check = 98.3% (598230400/608325184) finish=32.6min speed=5148K/sec # uptime 10:57:26 up 18:51, 2 users, load average: 48.80, 31.72, 31.50 # ps auxwww | grep async | wc -l 235 This has only occurred very recently after I last updated the kernel. I also removed "maxcpus=1" from the kernel parameters, so this could well be related to MULTICORE_RAID456 as Brian suggests.
Doesn't seem to improve even after specifying 'maxcpus=1' and rebooting: $ uptime 13:06:16 up 1:48, 2 users, load average: 102.20, 102.18, 58.90 $ !ps ps auxwww | grep sync root 10 0.0 0.0 0 0 ? S 11:17 0:04 [async/mgr] root 12 0.0 0.0 0 0 ? S 11:17 0:00 [sync_supers] root 29358 0.0 0.0 0 0 ? S 12:58 0:00 [md7_resync] root 29360 0.0 0.0 0 0 ? S 12:58 0:00 [md4_resync] root 29361 0.0 0.0 0 0 ? S 12:58 0:00 [md3_resync] root 29362 0.0 0.0 0 0 ? S 12:58 0:00 [md2_resync] root 29363 0.2 0.0 0 0 ? D 12:58 0:01 [md1_resync] root 29364 0.0 0.0 0 0 ? S 12:58 0:00 [md0_resync] root 30079 0.3 0.0 0 0 ? S 13:00 0:01 [async/0] root 30080 0.2 0.0 0 0 ? S 13:00 0:01 [async/1] root 30081 0.2 0.0 0 0 ? S 13:00 0:00 [async/2] root 30082 0.2 0.0 0 0 ? S 13:00 0:00 [async/3] root 30083 0.2 0.0 0 0 ? S 13:00 0:00 [async/4] root 30084 0.2 0.0 0 0 ? S 13:00 0:00 [async/5] ... root 30311 0.1 0.0 0 0 ? S 13:00 0:00 [async/232] root 30312 0.2 0.0 0 0 ? S 13:00 0:00 [async/233] root 30313 1.3 0.0 0 0 ? S 13:00 0:05 [async/234] This occurred while unrar'ing a large file. I have reverted to 2.6.31.12-174.2.22.fc12.i686 and all is ok again.
I'm writing to confirm data write corruption that wasn't even on the array. I too reverted to 2.6.31.12-174.2.22.fc12.i686 . I don't know if anyone with any situational authority has picked up on this yet or not, but it's a BFD.
It looks like it's just us two then Brian! Is there anything common to our respective setups that might help to narrow this one down? I am running a Shuttle SA76G2 with an Athlon 64 X2 4400+ processor and 3 x WDC WD6400AADS Caviar Green disks in a software RAID-5 configuration. I run Samba, Apache HTTPD, Sendmail, Mailman. I was wondering if it might be worth upgrading to 2.6.32.10-90.fc12 but can't see anything immediately obvious in the ChangeLog that addresses this issue.
The only thing significant I see is that we're both running the 32-bit distributions. This is an NForce2 board, Asus I think, AMD Athlon(tm) XP 3200+ , 5-disk RAID5, disks are === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST3500320AS Serial Number: 9QM2B8HN Firmware Version: SD1A I'm running all kinds of services with LVM on top of the RAID5. I can't understand why this thread isn't flooded with "dittos". This system is completely unstable with the Fedora build of the 2.6.32 kernel. I tried setting CONFIG_MULTICORE_RAID456 to no and rebuilding the kernel rpm, but wasn't successful. I'll try again another day.
*** Bug 575897 has been marked as a duplicate of this bug. ***
Disabled CONFIG_MULTICORE_RAID456 in kernel-2.6.32.11-101.fc12
Would it be possible to a build of this kernel so we can try it out? I can't see it in koji: http://koji.fedoraproject.org/koji/packageinfo?packageID=8
I have another instance of the Multicore failure. I think this should not be turned on until this is debugged. I am running the latest Fedora kernel on FC12. I will get about 100 async processes going during a resync, cp, or rsync to the raid device. Sometimes it locks the machine up and the load is sky high on the box. I am running this on an Intel ss4200-e with 4 sata drive in a RAID 5 setup. I think this problem started in 2.6.32 when this MULTICORE option was added and fedora decided to turn it on. I am rebuilding the latest Fedora kernel with the option off and going to retest on this same hardware again.
I retested with the option removed from the Fedora kernel and the async's are gone and the problem have disappeared. Please remove this option from the compiled Fedora kernels. We also need to get this lock up information to the developer.
kernel-2.6.32.12-114.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/kernel-2.6.32.12-114.fc12
I updated to kernel-2.6.32.12-114.fc12 and it performs well for me. I have completed a full mdraid check on my software RAID and there was no sign of the multiple async issue. Disabling CONFIG_MULTICORE_RAID456 would indeed appear to have fixed this issue. Thanks Chuck.
kernel-2.6.32.12-115.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/kernel-2.6.32.12-115.fc12
kernel-2.6.32.12-115.fc12 has been pushed to the Fedora 12 stable repository. If problems still persist, please make note of it in this bug report.
*** Bug 596490 has been marked as a duplicate of this bug. ***
Has anyone else noticed an corruption of the RAID array as a result of this? Is destroying and rebuilding required after being affected by this? I haven't gotten any notices from the md modules, but no matter what filesystem I use for a BackupPC volume which sits on top of LVM, on top of my RAID5, always has corruption issues. No one else seems to have these corruption issues with BackupPC though, and again, I've used btrfs, ext4, and xfs. To add to the difficulty in diagnosing my problem, the 5 drive RAID5 is made of the Seagate ST3500320AS gems which caused Seagate the biggest PR disaster I can recall for a hard drive manufacturer. I'd updated all of the firmware before construction this array though. Still, some have been skeptical of the update. BackupPC doesn't seem to be the only issue though. Some photoshop files on the drive have complained of jpg corruption issues inside the file, though they still seem to open fine. Other files are now failing to copy and I just don't trust the array anymore. Can anyone come up with a reasoning as to how this might have happened and suggest a process to attain stability once again. At this point, I'm most heavily leaning towards ZFS-Fuse to just take the hit on speed as long as it can both help me diagnose a possible drive issue and still provide me reliability.
For your information - bug 596490 still exists. It came up when tests for Red Hat Beta2 x86_64 had been executed.
CONFIG_MULTICORE_RAID456 is still enabled in master, f14 and f13. Please disable it on all branches.
(In reply to comment #20) > CONFIG_MULTICORE_RAID456 is still enabled in master, f14 and f13. Please > disable it on all branches. Done.