Description of problem: I have seen "mismatch_cnt is not 0" warnings in the past, but that has always been with RAID 1 arrays, and with relatively small numbers on /sys/block/md*/md/mismatch_cnt After updating to Fedora 18, I get this message from all updated systems that have RAID 6 arrays, and with _huge_ numbers of mismatch_cnt, like that: System A: ========= # mdadm -q --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Mon Jan 14 14:20:34 2013 Raid Level : raid6 Array Size : 1459617792 (1392.00 GiB 1494.65 GB) Used Dev Size : 243269632 (232.00 GiB 249.11 GB) Raid Devices : 8 Total Devices : 8 Persistence : Superblock is persistent Update Time : Sun Jan 27 02:27:28 2013 State : clean Active Devices : 8 Working Devices : 8 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 16K Name : atlas.denx.de:0 (local to host atlas.denx.de) UUID : da015f96:138b37bf:d5ef71dc:8970ab15 Events : 8 Number Major Minor RaidDevice State 0 8 49 0 active sync /dev/sdd1 1 8 65 1 active sync /dev/sde1 2 8 81 2 active sync /dev/sdf1 3 8 97 3 active sync /dev/sdg1 4 8 113 4 active sync /dev/sdh1 5 8 129 5 active sync /dev/sdi1 6 8 145 6 active sync /dev/sdj1 7 8 161 7 active sync /dev/sdk1 # cat /sys/block/md0/md/mismatch_cnt 362732152 System H: ========= # mdadm -q --detail /dev/md6 /dev/md6: Version : 1.2 Creation Time : Thu May 10 13:12:22 2012 Raid Level : raid6 Array Size : 976789696 (931.54 GiB 1000.23 GB) Used Dev Size : 244197424 (232.88 GiB 250.06 GB) Raid Devices : 6 Total Devices : 6 Persistence : Superblock is persistent Update Time : Sun Jan 27 01:51:55 2013 State : clean Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 16K Name : hercules.denx.de:6 (local to host hercules.denx.de) UUID : a3364d52:c21c7ed6:fa604fed:ec01723e Events : 116 Number Major Minor RaidDevice State 0 8 32 0 active sync /dev/sdc 1 8 48 1 active sync /dev/sdd 2 8 64 2 active sync /dev/sde 3 8 80 3 active sync /dev/sdf 4 8 96 4 active sync /dev/sdg 5 8 112 5 active sync /dev/sdh # cat /sys/block/md6/md/mismatch_cnt 10657736 System X: ========= # mdadm -q --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Sat Apr 7 00:17:28 2012 Raid Level : raid6 Array Size : 1465184544 (1397.31 GiB 1500.35 GB) Used Dev Size : 244197424 (232.88 GiB 250.06 GB) Raid Devices : 8 Total Devices : 8 Persistence : Superblock is persistent Update Time : Sun Jan 27 19:49:08 2013 State : clean Active Devices : 8 Working Devices : 8 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 16K Name : xpert.denx.de:2 (local to host xpert.denx.de) UUID : c7ee02c7:5ef42476:a4b34c5a:0ef93715 Events : 185 Number Major Minor RaidDevice State 0 8 32 0 active sync /dev/sdc 1 8 48 1 active sync /dev/sdd 2 8 64 2 active sync /dev/sde 3 8 80 3 active sync /dev/sdf 4 8 96 4 active sync /dev/sdg 5 8 112 5 active sync /dev/sdh 6 8 128 6 active sync /dev/sdi 7 8 144 7 active sync /dev/sdj # cat /sys/block/md2/md/mismatch_cnt 88624952 Note: except for the huge values of mismatch_cnt, I see no other indications for errors on the disk drives, RAID arrays or the file systems on top of these. Version-Release number of selected component (if applicable): mdadm-3.2.6-7.fc18.x86_64 How reproducible: All updated systems with RAID 6 arrays show this state now; it was detected during the usual "raid-check" run at the weekend. Steps to Reproduce: 1. Set up a system with a RAID 6 array. 2. Let it run. 3. After a few days, run /usr/sbin/raid-check Actual results: Huge values of mismatch_cnt. Expected results: mismatch_cnt=0 Additional info:
Note: after rebooting these systems, I see mismatch_cnt=0 once more.
More information: - re-running a "check" operation on the array will cause a huge mismatch_cnt again - in all cases I use ext4 (in two cases under LVM) on the RAID arrays - so far I did not see any corruption of actual data. running e2fsck -f on the unmounted file systems shows no issues either - this happens with different HBAs (one system uses a LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS controller, two use Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller) - one system now also shows a non-zero mismatch count on a RAID 1 array - all systems have been running without any problems for many months before (even years); the problem appeared with the update to Fedora 18 - the systems use relatively old disks (one system 8 x Seagate NL35 ST3250623NS; one 6 x Seagate Barracuda ES.2 ST3250310NS; one 8 x Maxtor MaXLine Plus II 7Y250M0). AFAICT none of these are AF disks; sector size is 512 on all of them.
See also thread "Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18" on Linux RAID mailing list: http://thread.gmane.org/gmane.linux.raid/41443
I've installed a mainline v3.8-rc5 kernel now on the affected systems. A "check" operation showed no more problems, but "raid6test" still reported a large number of errors like these: ... P(4) wrong at 10291 Q(5) wrong at 10291 Error detected at 10291: disk slot unknown P(3) wrong at 10292 Q(4) wrong at 10292 Error detected at 10292: disk slot unknown P(2) wrong at 10293 Q(3) wrong at 10293 Error detected at 10293: disk slot unknown ... After running a "repair" on the array, both "check" and "raid6test" would not report any further issues. I'll continue to watch this for a while, but I think I will not "update" to a Fedora kernel for some time...
So we have here an issue which might have the potential for major data corruption, and for a full 3 weeks after reporting NOTHING happens? Nothing at all? This is ... surprising.
The upstream discussion on this issue went all silent by the end of January. Does this mean the problem was resolved? Thanks, Jes
(In reply to comment #6) > The upstream discussion on this issue went all silent by the end of January. > > Does this mean the problem was resolved? If so, then no information about such a fix has ever been disclosed. I still avoid using Fedora kernels since, and rather run pristine mainline code.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs. Fedora 18 has now been rebased to 3.11.4-101.fc18. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. It has been over a month since we asked you to test the 3.11 kernel updates and let us know if your issue has been resolved or is still a problem. When this happened, the bug was set to needinfo. Because the needinfo is still set, we assume either this is no longer a problem, or you cannot provide additional information to help us resolve the issue. As a result we are closing with insufficient data. If this is still a problem, we apologize, feel free to reopen the bug and provide more information so that we can work towards a resolution If you experience different issues, please open a new bug report for those.