904831 – "WARNING: mismatch_cnt is not 0" on all RAID 6 arrays

Bug 904831 - "WARNING: mismatch_cnt is not 0" on all RAID 6 arrays

Summary: "WARNING: mismatch_cnt is not 0" on all RAID 6 arrays

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	18
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jes Sorensen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-27 19:20 UTC by Wolfgang Denk
Modified:	2017-08-07 12:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-11-27 16:06:30 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	wd: needinfo-

Attachments	(Terms of Use)

Description Wolfgang Denk 2013-01-27 19:20:33 UTC

Description of problem:

I have seen "mismatch_cnt is not 0" warnings in the past, but that has
always been with RAID 1 arrays, and with relatively small numbers on
/sys/block/md*/md/mismatch_cnt

After updating to Fedora 18, I get this message from all updated
systems that have RAID 6 arrays, and with _huge_ numbers of
mismatch_cnt, like that:

System A:
=========

# mdadm -q --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Jan 14 14:20:34 2013
     Raid Level : raid6
     Array Size : 1459617792 (1392.00 GiB 1494.65 GB)
  Used Dev Size : 243269632 (232.00 GiB 249.11 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Sun Jan 27 02:27:28 2013
          State : clean 
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 16K

           Name : atlas.denx.de:0  (local to host atlas.denx.de)
           UUID : da015f96:138b37bf:d5ef71dc:8970ab15
         Events : 8

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1
       3       8       97        3      active sync   /dev/sdg1
       4       8      113        4      active sync   /dev/sdh1
       5       8      129        5      active sync   /dev/sdi1
       6       8      145        6      active sync   /dev/sdj1
       7       8      161        7      active sync   /dev/sdk1
# cat /sys/block/md0/md/mismatch_cnt
362732152


System H:
=========

# mdadm -q --detail /dev/md6
/dev/md6:
        Version : 1.2
  Creation Time : Thu May 10 13:12:22 2012
     Raid Level : raid6
     Array Size : 976789696 (931.54 GiB 1000.23 GB)
  Used Dev Size : 244197424 (232.88 GiB 250.06 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Sun Jan 27 01:51:55 2013
          State : clean 
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 16K

           Name : hercules.denx.de:6  (local to host hercules.denx.de)
           UUID : a3364d52:c21c7ed6:fa604fed:ec01723e
         Events : 116

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       1       8       48        1      active sync   /dev/sdd
       2       8       64        2      active sync   /dev/sde
       3       8       80        3      active sync   /dev/sdf
       4       8       96        4      active sync   /dev/sdg
       5       8      112        5      active sync   /dev/sdh
# cat /sys/block/md6/md/mismatch_cnt
10657736


System X:
=========

# mdadm -q --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Sat Apr  7 00:17:28 2012
     Raid Level : raid6
     Array Size : 1465184544 (1397.31 GiB 1500.35 GB)
  Used Dev Size : 244197424 (232.88 GiB 250.06 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Sun Jan 27 19:49:08 2013
          State : clean 
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 16K

           Name : xpert.denx.de:2  (local to host xpert.denx.de)
           UUID : c7ee02c7:5ef42476:a4b34c5a:0ef93715
         Events : 185

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       1       8       48        1      active sync   /dev/sdd
       2       8       64        2      active sync   /dev/sde
       3       8       80        3      active sync   /dev/sdf
       4       8       96        4      active sync   /dev/sdg
       5       8      112        5      active sync   /dev/sdh
       6       8      128        6      active sync   /dev/sdi
       7       8      144        7      active sync   /dev/sdj
# cat /sys/block/md2/md/mismatch_cnt
88624952

Note: except for the huge values of mismatch_cnt, I see no other
indications for errors on the disk drives, RAID arrays or the file
systems on top of these.


Version-Release number of selected component (if applicable):

mdadm-3.2.6-7.fc18.x86_64

How reproducible:

All updated systems with RAID 6 arrays show this state now; it was
detected during the usual "raid-check" run at the weekend.


Steps to Reproduce:
1. Set up a system with a RAID 6 array.
2. Let it run.
3. After a few days, run  /usr/sbin/raid-check
  
Actual results:

Huge values of mismatch_cnt.

Expected results:

mismatch_cnt=0

Additional info:

Comment 1 Wolfgang Denk 2013-01-27 19:43:05 UTC

Note: after rebooting these systems, I see mismatch_cnt=0 once more.

Comment 2 Wolfgang Denk 2013-01-28 11:27:30 UTC

More information:
- re-running a "check" operation on the array will cause a huge
  mismatch_cnt again
- in all cases I use ext4 (in two cases under LVM) on the RAID arrays
- so far I did not see any corruption of actual data. running e2fsck -f
  on the unmounted file systems shows no issues either
- this happens with different HBAs (one system uses a LSI Logic /
  Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS controller, two
  use Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X
  Controller)
- one system now also shows a non-zero mismatch count on a RAID 1
  array
- all systems have been running without any problems for many months
  before (even years); the problem appeared with the update to Fedora 18
- the systems use relatively old disks (one system 8 x Seagate NL35
  ST3250623NS; one 6 x Seagate Barracuda ES.2 ST3250310NS; one 8 x
  Maxtor MaXLine Plus II 7Y250M0). AFAICT none of these are AF disks;
  sector size is 512 on all of them.

Comment 3 Wolfgang Denk 2013-01-29 06:34:31 UTC

See also thread "Huge values of mismatch_cnt on RAID 6 arrays under Fedora 18"
on Linux RAID mailing list:
http://thread.gmane.org/gmane.linux.raid/41443

Comment 4 Wolfgang Denk 2013-01-31 12:04:31 UTC

I've installed a mainline v3.8-rc5 kernel now on the affected
systems. A "check" operation showed no more problems, but "raid6test"
still reported a large number of errors like these:

...
P(4) wrong at 10291
Q(5) wrong at 10291
Error detected at 10291: disk slot unknown
P(3) wrong at 10292
Q(4) wrong at 10292
Error detected at 10292: disk slot unknown
P(2) wrong at 10293
Q(3) wrong at 10293
Error detected at 10293: disk slot unknown
...

After running a "repair" on the array, both "check" and "raid6test"
would not report any further issues.

I'll continue to watch this for a while, but I think I will not
"update" to a Fedora kernel for some time...

Comment 5 Wolfgang Denk 2013-02-16 21:13:53 UTC

So we have here an issue which might have the potential for major data corruption, and for a full 3 weeks after reporting NOTHING happens?
Nothing at all?  This is ... surprising.

Comment 6 Jes Sorensen 2013-05-02 09:42:18 UTC

The upstream discussion on this issue went all silent by the end of January.

Does this mean the problem was resolved?

Thanks,
Jes

Comment 7 Wolfgang Denk 2013-05-02 11:03:15 UTC

(In reply to comment #6)
> The upstream discussion on this issue went all silent by the end of January.
> 
> Does this mean the problem was resolved?

If so, then no information about such a fix has ever been disclosed.
I still avoid using Fedora kernels since, and rather run pristine mainline
code.

Comment 8 Justin M. Forbes 2013-10-18 21:02:19 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs.

Fedora 18 has now been rebased to 3.11.4-101.fc18.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19.

If you experience different issues, please open a new bug report for those.

Comment 9 Justin M. Forbes 2013-11-27 16:06:30 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  

It has been over a month since we asked you to test the 3.11 kernel updates and let us know if your issue has been resolved or is still a problem. When this happened, the bug was set to needinfo.  Because the needinfo is still set, we assume either this is no longer a problem, or you cannot provide additional information to help us resolve the issue.  As a result we are closing with insufficient data. If this is still a problem, we apologize, feel free to reopen the bug and provide more information so that we can work towards a resolution

If you experience different issues, please open a new bug report for those.

Note You need to log in before you can comment on or make changes to this bug.