1090423 – Data integrity issue on rebuilding RAID 6 with 100MB resync speed

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1090423 - Data integrity issue on rebuilding RAID 6 with 100MB resync speed

Summary: Data integrity issue on rebuilding RAID 6 with 100MB resync speed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.5
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jes Sorensen
QA Contact:	XiaoNi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1130905 1131395 1131409 1131410
TreeView+	depends on / blocked

Reported:	2014-04-23 09:56 UTC by Manibalan
Modified:	2018-12-09 17:45 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-2.6.32-498.el6
Doc Type:	Bug Fix
Doc Text:	Previously, recovery of a double-degraded RAID6 array could, under certain circumstances, result in data corruption. This could happen because the md driver was using an optimization that is safe to use only for single-degraded arrays. This update ensures that this optimization is skipped during the recovery of double-degraded RAID6 arrays.
Clone Of:
Clones:	1130905 1131395 1131409 1131410 (view as bug list)
Environment:
Last Closed:	2014-10-14 06:05:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sorage integrity checker - CAVEAT THIS DESTROYS ALL DATA ON THE NAMED DEVICE (3.61 KB, text/x-csrc) 2014-07-02 06:07 UTC, Mark Goodwin	no flags	Details
Attachement for comment 5 - aiodit got the mis-compare (328.88 KB, application/octet-stream) 2014-07-26 03:52 UTC, Manibalan	no flags	Details
Attachment for comment 10 (126.23 KB, application/octet-stream) 2014-07-29 14:54 UTC, Manibalan	no flags	Details
Attachment for comment 13 (46.42 KB, application/octet-stream) 2014-07-30 10:35 UTC, Manibalan	no flags	Details
Attachment for comment 21 (13.70 KB, text/plain) 2014-07-31 06:18 UTC, Manibalan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:1392	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2014-10-14 01:28:44 UTC

Description Manibalan 2014-04-23 09:56:58 UTC

Steps to Reproduce:
# Steps to reproduce the issue
1. create a RAID6 array with 4 or more drives
2. set resync speed max & min to 100000(100Mb)
3. A script is running to simulate drive failure. This script will do the following
 a. mdadm set faulty for two random drives on the md array
 b. mdadm remove those drives.
 c. mdadm add one drive, and wait for rebuild to complete,
 d. Then insert the next drive.
 e. Wait till the rebuild complete on the second drive and the md become optimal.
 f. continue the disk removal cycle again.
4. Configure iSCSI target for “/dev/md0”
5. Connect the target in a Windows server using MicroSoft iSCSI initiator.
6. Formatted it with NTFS.
7. run Dit32 IO tool.(Dit32 is a data verification tool, it will do a write and read verification continuously to ensure data integrity)

Note:
	This issue happens only with high resync speed


Additional info:

	# Below, is the Log from the dit32 tool:

		* The test has been started on Tue Apr 22 18:55:42 2014
		* Mis-cpmpare happen on Tue Apr 22 20:19:19 2014(1 hrs and 24 min after starting the test)
		* The failure happens in the verification cycle, while reading the file "G:\dit\s7\d55\s7d55f62"
			->Expected Data: 07 37 3e 01 0240 (dirSet, dirNo, fileNo, elementNo, sectorOffset)
			->Read Data: 07 37 3e 01 0340 (dirSet, dirNo, fileNo, elementNo, sectorOffset)
		-------------------------------------------------------------------------------------------------------	
			Tue Apr 22 18:55:42 2014 INFO:  DITNT application started
			Tue Apr 22 18:57:44 2014 INFO:  Test started on Drive G:
				 Dir Sets=8, Dirs per Set=64, Files per Dir=64
				 File Size=512KB
				 Read Only=N, Debug Stamp=Y, Verify During Copy=Y
				 Build I/O Size range=1 to 128 sectors
				 Copy Read I/O Size range=1 to 128 sectors
				 Copy Write I/O Size range=1 to 128 sectors
				 Verify I/O Size range=1 to 128 sectors
			Tue Apr 22 20:19:19 2014 ERROR: Miscompare Found: File "G:\dit\s7\d55\s7d55f62", offset=00048008
				 Expected Data: 07 37 3e 01 0240 (dirSet, dirNo, fileNo, elementNo, sectorOffset)
					 Read Data: 07 37 3e 01 0340 (dirSet, dirNo, fileNo, elementNo, sectorOffset)
				 Read Request: offset=00043000, size=00008600
		--------------------------------------------------------------------------------------------------------

	# Below, shows the md stat during mis-compare hits

		mis-compare hits by 20:19:19 , On that time md is rebuilding the first drive(possibly at the end of first drive rebuild and the second drive is not added). 
		---------------------------------------------------------------------------------------------------------
		Tue Apr 22 20:03:10 IST 2014 :  After Rebuild Started
		Tue Apr 22 20:03:10 IST 2014 : /dev/md0:
				Version : 1.2
		  Creation Time : Tue Apr 22 18:49:47 2014
			 Raid Level : raid6
			 Array Size : 50331648 (48.00 GiB 51.54 GB)
		  Used Dev Size : 8388608 (8.00 GiB 8.59 GB)
		   Raid Devices : 8
		  Total Devices : 7
			Persistence : Superblock is persistent

			Update Time : Tue Apr 22 20:03:00 2014
				  State : active, degraded, recovering
		 Active Devices : 6
		Working Devices : 7
		 Failed Devices : 0
		  Spare Devices : 1

				 Layout : left-symmetric
			 Chunk Size : 64K

		 Rebuild Status : 0% complete

				   Name : initiator:0
				   UUID : 2a4ec82e:e9832053:06df1fdb:59ff3565
				 Events : 462

			Number   Major   Minor   RaidDevice State
			   0       8       22        0      active sync   /dev/sdb6
			   8       8       38        1      active sync   /dev/sdc6
			   2       8       54        2      active sync   /dev/sdd6
			   3       8       70        3      active sync   /dev/sde6
			   4       8       86        4      active sync   /dev/sdf6
			   9       8      134        5      spare rebuilding   /dev/sdi6
			  10       8      118        6      active sync   /dev/sdh6
			   7       0        0        7      removed
		Tue Apr 22 20:19:41 IST 2014 :  After Rebuild completed on First drive
		Tue Apr 22 20:19:41 IST 2014 : /dev/md0:
				Version : 1.2
		  Creation Time : Tue Apr 22 18:49:47 2014
			 Raid Level : raid6
			 Array Size : 50331648 (48.00 GiB 51.54 GB)
		  Used Dev Size : 8388608 (8.00 GiB 8.59 GB)
		   Raid Devices : 8
		  Total Devices : 7
			Persistence : Superblock is persistent

			Update Time : Tue Apr 22 20:19:41 2014
				  State : clean, degraded
		 Active Devices : 7
		Working Devices : 7
		 Failed Devices : 0
		  Spare Devices : 0

				 Layout : left-symmetric
			 Chunk Size : 64K

				   Name : initiator:0
				   UUID : 2a4ec82e:e9832053:06df1fdb:59ff3565
				 Events : 496

			Number   Major   Minor   RaidDevice State
			   0       8       22        0      active sync   /dev/sdb6
			   8       8       38        1      active sync   /dev/sdc6
			   2       8       54        2      active sync   /dev/sdd6
			   3       8       70        3      active sync   /dev/sde6
			   4       8       86        4      active sync   /dev/sdf6
			   9       8      134        5      active sync   /dev/sdi6
			  10       8      118        6      active sync   /dev/sdh6
			   7       0        0        7      removed
		Tue Apr 22 20:19:41 IST 2014 : Iteration : 3 : Continue adding disk 2
		Tue Apr 22 20:19:56 IST 2014 :  After adding the second drive [ /dev/sdg6 ]
		---------------------------------------------------------------------------------------------------------

More Information on about the system:
------------------------------------------------------------------------------------------------------------------
[root@root ~]# uname -a
Linux root 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

[root@root ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.5 (Santiago)

[root@root ~]# lsscsi
[0:0:0:0]    disk    SEAGATE  ST2000NM0001     0002  /dev/sda
[0:0:1:0]    enclosu LSI CORP SAS2X36          0417  -
[0:0:2:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sdb
[0:0:3:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sdc
[0:0:4:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sdd
[0:0:5:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sde
[0:0:6:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sdf
[0:0:7:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sdg
[0:0:8:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sdh
[0:0:9:0]    disk    SEAGATE  ST3600057SS      0008  /dev/sdi

[root@root ~]# rpm -qa | grep iscsi
iscsi-initiator-utils-6.2.0.873-10.el6.x86_64

[root@root ~]# parted -l
Model: SEAGATE ST2000NM0001 (scsi)
Disk /dev/sda: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  525MB   524MB   primary  ext4         boot
 2      525MB   2000GB  2000GB  primary               lvm


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sdb: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB  ext3                  boot
 2      4296MB  6443MB  2147MB                        raid
 3      6443MB  7517MB  1074MB                        raid
 4      7517MB  7651MB  134MB                         raid
 5      7651MB  7653MB  1049kB                        raid
 6      7654MB  500GB   492GB                primary


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sdc: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB  ext3                  boot
 2      4296MB  6443MB  2147MB                        raid
 3      6443MB  7517MB  1074MB                        raid
 4      7517MB  7651MB  134MB                         raid
 5      7651MB  7653MB  1049kB                        raid
 6      7654MB  500GB   492GB                primary


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sdd: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB  ext3                  boot
 2      4296MB  6443MB  2147MB                        raid
 3      6443MB  7517MB  1074MB                        raid
 4      7517MB  7651MB  134MB                         raid
 5      7651MB  7653MB  1049kB                        raid
 6      7654MB  500GB   492GB                primary


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sde: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB               primary
 2      4296MB  6443MB  2147MB               primary
 3      6443MB  7517MB  1074MB               primary
 4      7517MB  7651MB  134MB                primary
 5      7651MB  7653MB  1049kB               primary
 6      7654MB  500GB   492GB                primary


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sdf: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB               primary
 2      4296MB  6443MB  2147MB               primary
 3      6443MB  7517MB  1074MB               primary
 4      7517MB  7651MB  134MB                primary
 5      7651MB  7653MB  1049kB               primary
 6      7654MB  500GB   492GB                primary


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_root-lv_home: 1940GB
Sector size (logical/physical): 512B/512B
Partition Table: loop

Number  Start  End     Size    File system  Flags
 1      0.00B  1940GB  1940GB  ext4


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_root-lv_swap: 6275MB
Sector size (logical/physical): 512B/512B
Partition Table: loop

Number  Start  End     Size    File system     Flags
 1      0.00B  6275MB  6275MB  linux-swap(v1)


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_root-lv_root: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: loop

Number  Start  End     Size    File system  Flags
 1      0.00B  53.7GB  53.7GB  ext4


Model: Unknown (unknown)
Disk /dev/md0: 51.5GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  51.5GB  51.5GB  primary  ntfs


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sdg: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB               primary
 2      4296MB  6443MB  2147MB               primary
 3      6443MB  7517MB  1074MB               primary
 4      7517MB  7651MB  134MB                primary
 5      7651MB  7653MB  1049kB               primary
 6      7654MB  500GB   492GB                primary


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sdh: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB               primary
 2      4296MB  6443MB  2147MB               primary
 3      6443MB  7517MB  1074MB               primary
 4      7517MB  7651MB  134MB                primary
 5      7651MB  7653MB  1049kB               primary
 6      7654MB  500GB   492GB                primary


Model: SEAGATE ST3600057SS (scsi)
Disk /dev/sdi: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4296MB  4295MB               primary
 2      4296MB  6443MB  2147MB               primary
 3      6443MB  7517MB  1074MB               primary
 4      7517MB  7651MB  134MB                primary
 5      7651MB  7653MB  1049kB               primary
 6      7654MB  500GB   492GB                primary

[root@root ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
unused devices: <none>
[root@root ~]#  mdadm -C /dev/md0 -c 64 -l 6 -f -n 8 -e 1.2 -z 8G --homehost=initiator /dev/sdb6 /dev/sdc6 /dev/sdd6 /dev/sde6 /dev/sdf6 /dev/sdg6 /dev/sdh6 /dev/sdi6
mdadm: /dev/sdb6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: /dev/sdc6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: /dev/sdd6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: /dev/sde6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: /dev/sdf6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: /dev/sdg6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: /dev/sdh6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: /dev/sdi6 appears to be part of a raid array:
    level=raid6 devices=8 ctime=Tue Apr 22 16:07:31 2014
mdadm: largest drive (/dev/sdb6) exceeds size (8388608K) by more than 1%
Continue creating array? yes
mdadm: array /dev/md0 started.

[root@root ~]# /etc/init.d/tgtd start
Starting SCSI target daemon:                               [  OK  ]
[root@root ~]# tgtadm --mode target --op show
Target 1: iqn.2011-07.self.server2:tgtr
    System information:
        Driver: iscsi
        State: ready
    I_T nexus information:
    LUN information:
        LUN: 0
            Type: controller
            SCSI ID: IET     00010000
            SCSI SN: beaf10
            Size: 0 MB, Block size: 1
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: null
            Backing store path: None
            Backing store flags:
        LUN: 1
            Type: disk
            SCSI ID: IET     00010001
            SCSI SN: beaf11
            Size: 51540 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: rdwr
            Backing store path: /dev/md0
            Backing store flags:
    Account information:
    ACL information:
        ALL
------------------------------------------------------------------------------------------------------------------

Comment 2 Dan Williams 2014-05-16 17:38:57 UTC

Can you add lsmod and lspci output to this?

Comment 3 Manibalan 2014-06-25 05:39:23 UTC

(In reply to Dan Williams from comment #2)
> Can you add lsmod and lspci output to this?

[root@root ~]# lsmod
Module                  Size  Used by
iptable_mangle          3349  0
iptable_nat             6158  0
nf_nat                 22759  1 iptable_nat
raid456                76573  1
async_raid6_recov       6555  1 raid456
async_pq                4775  2 raid456,async_raid6_recov
raid6_pq               90821  2 async_raid6_recov,async_pq
async_xor               3691  3 raid456,async_raid6_recov,async_pq
xor                     8738  1 async_xor
async_memcpy            2236  2 raid456,async_raid6_recov
async_tx                3027  5 raid456,async_raid6_recov,async_pq,async_xor,async_memcpy
autofs4                26513  3
cpufreq_ondemand       10544  16
acpi_cpufreq            7763  1
freq_table              4936  2 cpufreq_ondemand,acpi_cpufreq
mperf                   1557  1 acpi_cpufreq
ipt_REJECT              2351  0
nf_conntrack_ipv4       9506  3 iptable_nat,nf_nat
nf_defrag_ipv4          1483  1 nf_conntrack_ipv4
iptable_filter          2793  0
ip_tables              17831  3 iptable_mangle,iptable_nat,iptable_filter
ip6t_REJECT             4628  2
nf_conntrack_ipv6       8748  2
nf_defrag_ipv6         11182  1 nf_conntrack_ipv6
xt_state                1492  2
nf_conntrack           79758  5 iptable_nat,nf_nat,nf_conntrack_ipv4,nf_conntrack_ipv6,xt_state
ip6table_filter         2889  1
ip6_tables             18732  1 ip6table_filter
ipv6                  317340  77 ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6
microcode             112685  0
iTCO_wdt                7115  0
iTCO_vendor_support     3056  1 iTCO_wdt
sg                     29350  0
ses                     6475  0
enclosure               8438  1 ses
ixgbe                 243968  0
mdio                    4769  1 ixgbe
igb                   197536  0
i2c_algo_bit            5935  1 igb
ptp                     9614  2 ixgbe,igb
pps_core               11458  1 ptp
serio_raw               4594  0
i2c_i801               11359  0
i2c_core               31084  3 igb,i2c_algo_bit,i2c_i801
lpc_ich                12803  0
mfd_core                1895  1 lpc_ich
ioatdma                58130  24
dca                     7101  3 ixgbe,igb,ioatdma
i7core_edac            17948  0
edac_core              46581  1 i7core_edac
shpchp                 32778  0
ext4                  374902  3
jbd2                   93427  1 ext4
mbcache                 8193  1 ext4
sd_mod                 39069  11
crc_t10dif              1541  1 sd_mod
pata_acpi               3701  0
ata_generic             3837  0
ata_piix               24601  0
mpt2sas               186848  6
scsi_transport_sas     35652  1 mpt2sas
raid_class              4516  1 mpt2sas
dm_mirror              14384  0
dm_region_hash         12085  1 dm_mirror
dm_log                  9930  2 dm_mirror,dm_region_hash
dm_mod                 84209  11 dm_mirror,dm_log
[root@root ~]# lspci
00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
00:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 8 (rev 22)
00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
00:0e.0 Host bridge: Intel Corporation Device 341c (rev 22)
00:0e.1 Host bridge: Intel Corporation Device 341d (rev 22)
00:0e.2 Host bridge: Intel Corporation Device 341e (rev 22)
00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)
00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIB (ICH10) LPC Interface Controller
00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
02:00.0 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01)
03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02)
06:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
06:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
07:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
[root@root ~]#

Comment 4 Mark Goodwin 2014-07-02 06:07:18 UTC

Created attachment 913989 [details]
sorage integrity checker - CAVEAT THIS DESTROYS ALL DATA ON THE NAMED DEVICE

I have attached 'integrity.c' to this bug. This program creates a number of parallel threads, each of which writes random data to a different offset of the specified device or file, and then verifies what was written. If the verification succeeds, then each thread will repeat this forever. NOTE THAT IT OVERWRITES ALL DATA ON THE NAMED DEVICE, including metadata and partition tables - use with extreme caution. There is a large comment at the start of the file explaining it's usage. Compile: cc -o integrity integrity.c -lpthread

I have been running this program with an 8 disk s/w raid6 60GB device, deliberately failing two drives and then adding them back in, causing a re-sync. I used 19 threads, each testing 3GB at sequential offsets in the 60GB raid6 device. I've run this for days and it has not failed once. Note that I do not have any iSCSI remote devices or Microsoft filesystems in this test environment - it is just testing the RHEL6 raid6 driver and the underlying scsi devices. I have also tried various re-sync speeds, as per the repro description.

Comment 5 Manibalan 2014-07-26 03:50:48 UTC

Hi,

We are able to reproduce the data-miscompare issue, with a Linux IO tool (aiodit).. Attached for your use.

Steps for running AIO on RHEL.

1.	Delete all volume/container/LD
3.	Create RAID 6 LD using mdadm command, using the following command.
	Eg: “mdadm -C /dev/md0 -c 64 -l 6 -f -n 4 -e 1.2 -z 10G /dev/sdb /dev/sdc /dev/sdd /dev/sde”
4.	Change the resync speed max and min to 100K.
		echo "100000" > /sys/block/md0/md/sync_speed_min
		echo "100000" > /sys/block/md0/md/sync_speed_max
5.	mkdir /aiotest
6.	sd /aiotest
7.	Copy “aiodit, do_aio.sh, test.sh and RollingHotSpareTwoDriveFailure-noitx.sh” into /aiotest folder.(all this files are attached with this ticket)

Before going to next step, ensure the LD is optimal, and the time formate is GMT(helpful for analysis)
9.	Run “sh do_aio.sh”
	- This script will run 4 instance of "aiodit", each will do random io within the given start and end offset(in GB)
	- in each pass, it will write and configured load and read it back and verify.
10.	Run “sh RollingHotSpareTwoDriveFailure-noitx.sh”
	- This script will simulate 2 drive failure in RAID6 array. will randomly remove 2 drives and add one by one, wait till rebuild complete. and repete it.
	- scripts works on md0. can change.

Miscompare Info:

One of the instance, of aiodit "./aiodit -d /dev/md0 -b 0 -e 2 -t 1 -i 27 -p 100000 -s -r -x test.sh > aio1.log &" got data mis-compare
it does random IO on md0 from 0GB to 2GB, and the  maximum io size is 27 SECTORS. 

Test is started by : Wed, 23 Jul 2014 18:33:42 GMT

	./aiodit: args </dev/md0> begin<0> end<4193792> test-size<2097152> io-size<27> num-pass<100000> mode<r+s> time-stamp<1406140422:936989> disk<md0>
	pid 14561 doing random write cycle for pass 0 on disk /dev/md0 [1572864]

Got miscompare on : Fri, 25 Jul 2014 16:25:20 GMT
	[root@root aio]# tail -f aio1.log
		Firing READ IO lba: 1539743, size: 9 sectors iocount 18001 lba_count 234339 g_io_pending 208 time<1406305461:672849>
		Firing READ IO lba: 306286, size: 3 sectors iocount 21001 lba_count 273372 g_io_pending 136 time<1406305488:252550>
		Firing READ IO lba: 3131038, size: 25 sectors iocount 24001 lba_count 312139 g_io_pending 192 time<1406305515:489132>
		Miscompare: @42930 IO[lba 42930 + i 0 nsectors 10] TIME[DATA 1406140422:936989 LASTWRITE 1406305292:215631 CURRENT 1406305520:859925]
		EXPCTED [SIG:ffffffff21544944 LBA:42930 DISK:md0 PASS:52 STS:1406140422:936989] != GOT [SIG:ffffffff21544944 LBA:42802 DISK:md0 PASS:52 STS:1406140422:936989]

	LBA : 42930 = 42930 * 512 = 21980160 = hex[14F6400]
	
	hexdump of MD0:
		014f6400  44 49 54 21 ff ff ff ff  32 a7 00 00 00 00 00 00  |DIT!....2.......|
		014f6410  6d 64 30 00 00 00 00 00  34 00 00 00 00 00 00 00  |md0.....4.......|

	So,
	Sig  : 44 49 54 21 ff ff ff ff  - OK
	LBA: 42930 = hex[a7b2] Expected, but got is hex[a732] = 42802 - MISCOMPARE
		
And the miscompare happens during the second drive rebuild.
	[root@root aio]# tail -f /volumes/RAIDCONF/raid_conf_info.txt
		Fri Jul 25 21:57:29 IST 2014 :  Iteration 2: After Rebuild Started
		Fri Jul 25 21:57:29 IST 2014 : /dev/md0:
				Version : 1.2
		  Creation Time : Wed Jul 23 20:55:54 2014
			 Raid Level : raid6
			 Array Size : 20971520 (20.00 GiB 21.47 GB)
		  Used Dev Size : 10485760 (10.00 GiB 10.74 GB)
		   Raid Devices : 4
		  Total Devices : 4
			Persistence : Superblock is persistent

			Update Time : Fri Jul 25 21:57:26 2014
				  State : clean, degraded, recovering
		 Active Devices : 3
		Working Devices : 4
		 Failed Devices : 0
		  Spare Devices : 1

				 Layout : left-symmetric
			 Chunk Size : 64K

		 Rebuild Status : 8% complete

				   Name : initiator:0
				   UUID : 6639e64c:4bbfaf35:d018922f:2a03e693
				 Events : 630

			Number   Major   Minor   RaidDevice State
			   4       8       22        0      active sync   /dev/sdb6
			   1       8       38        1      active sync   /dev/sdc6
			   5       8       70        2      active sync   /dev/sde6
			   6       8       54        3      spare rebuilding   /dev/sdd6

		
All the logs and scripts using in this setup has been attached.


Thanks,
Manibalan.


Additional Information:

[root@root aio]# ./aiodit --help
Usage: ./aiodit -d </dev/bdev> -b <begin-GB> -e <end-GB> -t <test-size-GB> -i <io-max-size-sectors> -p <passes> ...[-x <exit-script.sh> -s <seq> &|-r <rand> -h <help>]

System Info:
	root@KernelCentOS64 aio-RHEL]# uname -a
	Linux KernelCentOS64 2.6.32-431.5.1.el6.x86_64.debug #1 SMP Wed Feb 12 01:14:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

[root@root /]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid6 sdd6[6] sde6[5] sdb6[4] sdc6[1]
      20971520 blocks super 1.2 level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

[root@root /]# uname -a
	Linux root 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

[root@root /]# cat /proc/mdstat
	Personalities : [raid6] [raid5] [raid4] [raid1]
	md0 : active raid6 sdd6[6] sde6[5] sdb6[4] sdc6[1]
		  20971520 blocks super 1.2 level 6, 64k chunk, algorithm 2 [4/4] [UUUU]

	unused devices: <none>
	
[root@root /]# lsscsi
	[0:0:0:0]    disk    SEAGATE  ST500NM0001      0002  /dev/sda
	[0:0:1:0]    disk    SEAGATE  ST500NM0001      0002  /dev/sdb
	[0:0:2:0]    disk    SEAGATE  ST500NM0001      0002  /dev/sdc
	[0:0:3:0]    disk    SEAGATE  ST500NM0001      0002  /dev/sdd
	[0:0:4:0]    disk    SEAGATE  ST500NM0001      0002  /dev/sde
	[0:0:5:0]    enclosu LSI CORP SAS2X36          0724  -
	
[root@root /]# lspci
	00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)
	00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
	00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
	00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22)
	00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
	00:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 8 (rev 22)
	00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
	00:0e.0 Host bridge: Intel Corporation Device 341c (rev 22)
	00:0e.1 Host bridge: Intel Corporation Device 341d (rev 22)
	00:0e.2 Host bridge: Intel Corporation Device 341e (rev 22)
	00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)
	00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers (rev 22)
	00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
	00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
	00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers (rev 22)
	00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
	00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
	00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
	00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
	00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
	00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
	00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
	00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
	00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
	00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
	00:1f.0 ISA bridge: Intel Corporation 82801JIB (ICH10) LPC Interface Controller
	00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1
	00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
	01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
	01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
	02:00.0 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01)
	02:00.1 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01)
	03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
	06:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
	06:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02)
	07:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

Comment 6 Manibalan 2014-07-26 03:52:22 UTC

Created attachment 921122 [details]
Attachement for comment 5 - aiodit got the mis-compare

Comment 7 NeilBrown 2014-07-29 05:17:58 UTC

Hi,
the mis-compare shown in comment 5:

	LBA: 42930 = hex[a7b2] Expected, but got is hex[a732] = 42802 - MISCOMPARE

is a single bit error.  '32' instead of 'b2'.

Alternatively, it is a difference of 128 sectors which is 64K which is the chunk size.  So it could be returning data from the wrong chunk.

It tried to read from 42930 which is in chunk 335, but got data that was presumably written to 42802 in chunk 334 (or possibly the writes got mixed up instead).
Each stripe has 2 data chunks so an even numbered one and the following odd numbered one, so these two chunks are on the same stripe - stripe 167

From the code:

	pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks);
	qd_idx = (pd_idx + 1) % raid_disks;
	*dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks;

stripe2 == 167, raid_disks == 4, so

 pd_idx = 4 - 1 - (167%4) = 4 - 1 - 3 == 0
 qd_idx = (0 + 1) % 4 == 1
 dd_idx = 2, or 3

The two data blocks are on the devices which were failed.

The error was detected at Fri Jul 25 16:25:20 GMT 2014

That is about 2 minutes (109 seconds) before:

Fri Jul 25 21:57:09 IST 2014 :	Iteration 2: After Rebuild completed on First drive

(aren't half-hour time zones fun!!!)

which is (based on the script) between 5 and 10 seconds after the rebuild actually completed.  The update time was recorded as 

    Update Time : Fri Jul 25 21:57:03 2014

so I suspect that is when the rebuild of the first device actually finished.

$ TZ=GMT date -d 'Fri Jul 25 21:57:03 IST 2014'
Fri Jul 25 16:27:03 GMT 2014


So it looks to me like the error happened while the first drive was rebuilding,
about 103 seconds before it finished in fact.

The error was detected in stripe 167, and there are 10485760/64 == 163840
stripes on the device so the error wasn't anywhere near where recovery was happening.


It seems that a read from a block on a faulty device received data from a device that was being recovered, but the recovery was happening much later in the array.

In 2.6.32 a read from a recovering device is never attempted - raid6 will only read from fully in-sync devices.  So it looks like it tried to recovery one data block from the two parity blocks, but got the other data block.
So somehow ops_run_compute6_2 is producing a bad result, but only when lots of other 2-data recovery operations  are ongoing at the same time.

That seems like enough of a hint to find the problem.

Do you have logs from other runs so I can see how robust the patterns are?

Comment 8 NeilBrown 2014-07-29 06:31:35 UTC

Also, could you please report the 'xor:' and 'raid6:' and surrounding lines that are produced in the kernel messages at boot and report which xor/raid6 code is being used.
Thanks.

Comment 9 Manibalan 2014-07-29 10:59:43 UTC

(In reply to NeilBrown from comment #8)
> Also, could you please report the 'xor:' and 'raid6:' and surrounding lines
> that are produced in the kernel messages at boot and report which xor/raid6
> code is being used.

Jul 29 21:54:34 root kernel: async_tx: api initialized (async)
Jul 29 21:54:34 root kernel: xor: automatically using best checksumming function: generic_sse
Jul 29 21:54:34 root kernel:   generic_sse:  9044.000 MB/sec
Jul 29 21:54:34 root kernel: xor: using function: generic_sse (9044.000 MB/sec)
Jul 29 21:54:34 root kernel: iTCO_vendor_support: vendor-support=0
Jul 29 21:54:34 root kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh
Jul 29 21:54:34 root kernel: iTCO_wdt: Found a ICH10 TCO device (Version=2, TCOBASE=0x0860)
Jul 29 21:54:34 root kernel: iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
Jul 29 21:54:34 root kernel: raid6: sse2x1    5703 MB/s
Jul 29 21:54:34 root kernel: raid6: sse2x2    6828 MB/s
Jul 29 21:54:34 root kernel: raid6: sse2x4    6492 MB/s
Jul 29 21:54:34 root kernel: raid6: using algorithm sse2x2 (6828 MB/s)
Jul 29 21:54:34 root kernel: raid6: using ssse3x2 recovery algorithm

following the var/log/message output around that line.

Jul 29 21:54:34 root kernel: scsi 0:0:5:0: Attached scsi generic sg5 type 13
Jul 29 21:54:34 root kernel: ses 0:0:5:0: Attached Enclosure device
Jul 29 21:54:34 root kernel: microcode: CPU0 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: md: bind<sdb3>
Jul 29 21:54:34 root kernel: md: bind<sdb2>
Jul 29 21:54:34 root kernel: md: bind<sdb6>
Jul 29 21:54:34 root kernel: md: bind<sdb1>
Jul 29 21:54:34 root kernel: md: bind<sdb4>
Jul 29 21:54:34 root kernel: md: bind<sdc2>
Jul 29 21:54:34 root kernel: md: bind<sdc3>
Jul 29 21:54:34 root kernel: md: bind<sdc1>
Jul 29 21:54:34 root kernel: md: bind<sdc4>
Jul 29 21:54:34 root kernel: md: bind<sdc6>
Jul 29 21:54:34 root kernel: md: bind<sdd1>
Jul 29 21:54:34 root kernel: md: bind<sdd2>
Jul 29 21:54:34 root kernel: md: bind<sdd3>
Jul 29 21:54:34 root kernel: md: bind<sdd4>
Jul 29 21:54:34 root kernel: microcode: CPU1 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: md: bind<sdd6>
Jul 29 21:54:34 root kernel: microcode: CPU2 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU3 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU4 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: md: bind<sde6>
Jul 29 21:54:34 root kernel: microcode: CPU5 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU6 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU7 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU8 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU9 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU10 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU11 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU12 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU13 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU14 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: microcode: CPU15 sig=0x206c2, pf=0x1, revision=0x10
Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02
Jul 29 21:54:34 root kernel: Microcode Update Driver: v2.00 <tigran.co.uk>, Peter Oruba
Jul 29 21:54:34 root kernel: async_tx: api initialized (async)
Jul 29 21:54:34 root kernel: xor: automatically using best checksumming function: generic_sse
Jul 29 21:54:34 root kernel:   generic_sse:  9044.000 MB/sec
Jul 29 21:54:34 root kernel: xor: using function: generic_sse (9044.000 MB/sec)
Jul 29 21:54:34 root kernel: iTCO_vendor_support: vendor-support=0
Jul 29 21:54:34 root kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh
Jul 29 21:54:34 root kernel: iTCO_wdt: Found a ICH10 TCO device (Version=2, TCOBASE=0x0860)
Jul 29 21:54:34 root kernel: iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
Jul 29 21:54:34 root kernel: raid6: sse2x1    5703 MB/s
Jul 29 21:54:34 root kernel: raid6: sse2x2    6828 MB/s
Jul 29 21:54:34 root kernel: raid6: sse2x4    6492 MB/s
Jul 29 21:54:34 root kernel: raid6: using algorithm sse2x2 (6828 MB/s)
Jul 29 21:54:34 root kernel: raid6: using ssse3x2 recovery algorithm
Jul 29 21:54:34 root kernel: md: raid6 personality registered for level 6
Jul 29 21:54:34 root kernel: md: raid5 personality registered for level 5
Jul 29 21:54:34 root kernel: md: raid4 personality registered for level 4
Jul 29 21:54:34 root kernel: bio: create slab <bio-1> at 1
Jul 29 21:54:34 root kernel: md/raid:md125: device sde6 operational as raid disk 3
Jul 29 21:54:34 root kernel: md/raid:md125: device sdd6 operational as raid disk 2
Jul 29 21:54:34 root kernel: md/raid:md125: device sdc6 operational as raid disk 1
Jul 29 21:54:34 root kernel: md/raid:md125: device sdb6 operational as raid disk 0
Jul 29 21:54:34 root kernel: md/raid:md125: allocated 4314kB
Jul 29 21:54:34 root kernel: md/raid:md125: raid level 6 active with 4 out of 4 devices, algorithm 2
Jul 29 21:54:34 root kernel: md125: detected capacity change from 0 to 21474836480
Jul 29 21:54:34 root kernel: md125: unknown partition table
Jul 29 21:54:34 root kernel: md: raid1 personality registered for level 1

> Thanks.

Comment 10 Manibalan 2014-07-29 14:49:22 UTC

(In reply to NeilBrown from comment #7)
> Hi,
> the mis-compare shown in comment 5:
> 
> 	LBA: 42930 = hex[a7b2] Expected, but got is hex[a732] = 42802 - MISCOMPARE
> 
> is a single bit error.  '32' instead of 'b2'.
> 
> Alternatively, it is a difference of 128 sectors which is 64K which is the
> chunk size.  So it could be returning data from the wrong chunk.
> 
> It tried to read from 42930 which is in chunk 335, but got data that was
> presumably written to 42802 in chunk 334 (or possibly the writes got mixed
> up instead).
> Each stripe has 2 data chunks so an even numbered one and the following odd
> numbered one, so these two chunks are on the same stripe - stripe 167
> 
> From the code:
> 
> 	pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks);
> 	qd_idx = (pd_idx + 1) % raid_disks;
> 	*dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks;
> 
> stripe2 == 167, raid_disks == 4, so
> 
>  pd_idx = 4 - 1 - (167%4) = 4 - 1 - 3 == 0
>  qd_idx = (0 + 1) % 4 == 1
>  dd_idx = 2, or 3
> 
> The two data blocks are on the devices which were failed.
> 
> The error was detected at Fri Jul 25 16:25:20 GMT 2014
> 
> That is about 2 minutes (109 seconds) before:
> 
> Fri Jul 25 21:57:09 IST 2014 :	Iteration 2: After Rebuild completed on First
> drive
> 
> (aren't half-hour time zones fun!!!)
> 
> which is (based on the script) between 5 and 10 seconds after the rebuild
> actually completed.  The update time was recorded as 
> 
>     Update Time : Fri Jul 25 21:57:03 2014
> 
> so I suspect that is when the rebuild of the first device actually finished.
> 
> $ TZ=GMT date -d 'Fri Jul 25 21:57:03 IST 2014'
> Fri Jul 25 16:27:03 GMT 2014
> 
> 
> So it looks to me like the error happened while the first drive was
> rebuilding,
> about 103 seconds before it finished in fact.
> 
> The error was detected in stripe 167, and there are 10485760/64 == 163840
> stripes on the device so the error wasn't anywhere near where recovery was
> happening.
> 
> 
> It seems that a read from a block on a faulty device received data from a
> device that was being recovered, but the recovery was happening much later
> in the array.
> 
> In 2.6.32 a read from a recovering device is never attempted - raid6 will
> only read from fully in-sync devices.  So it looks like it tried to recovery
> one data block from the two parity blocks, but got the other data block.
> So somehow ops_run_compute6_2 is producing a bad result, but only when lots
> of other 2-data recovery operations  are ongoing at the same time.
> 
> That seems like enough of a hint to find the problem.
> 
> Do you have logs from other runs so I can see how robust the patterns are?

Hi Neil,

Following are few more mis-compare patterns we got, also I have attached all the logs for each test (compressed file : aio-RHEL-29-Jul-2014)..
-----------------------------------------------------------------------------------
Log folder name: aiotest-R6-20k
Resync speed max and min = 20000

Mon Jul 28 00:05:46 IST 2014 : AIO Started
Mon Jul 28 01:46:06 IST 2014 : AIO Stopped
RHS iteration 2, During the first drive rebuild.

Duration :  1h 40 minutes

Miscompare: @11537481 IO[lba 11537481 + i 0 nsectors 49] TIME[DATA 0:0 LASTWRITE 1406508344:393802 CURRENT 1406508366:134762]
EXPCTED [SIG:ffffffff21544944 LBA:11537481 DISK:md0 PASS:5 STS:1406502346:551921] != GOT [SIG:0 LBA:128 DISK: PASS:0 STS:0:0]

160189200  00 00 00 00 00 00 00 00  80 00 00 00 00 00 00 00  |................|
160189210  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

--------------------------------------------------------------------------------
Log folder name: aiotest-R6-30k
Resync speed max and min = 30000

Sun Jul 27 19:38:40 IST 2014 : AIO Started = Sun, 27 Jul 2014 18:38:40 GMT
Sun Jul 27 21:52:22 IST 2014 : AIO Stopped = Sun, 27 Jul 2014 20:52:22 GMT
RHS iteration 2, During the first drive rebuild.

Duration :  2h 10 minutes

Miscompare: @8350080 IO[lba 8350023 + i 57 nsectors 58] TIME[DATA 4116891984:1233141 LASTWRITE 1406494269:897097 CURRENT 1406494342:121414]
EXPCTED [SIG:ffffffff21544944 LBA:8350080 DISK:md0 PASS:4 STS:1406486320:965459] != GOT [SIG:1c1c1c1c63fcdbcc LBA:8502045 DISK:▒▒P PASS:9 STS:4116891984:1233141]

fed30000  cc db fc 63 1c 1c 1c 1c  1d bb 81 00 00 00 00 00  |▒▒▒c.....▒......|
fed30010  b7 ac 50 00 00 00 00 00  09 00 00 00 00 00 00 00  |▒▒P.............|

--------------------------------------------------------------------------------
Log folder name: aiotest-R6-50k
Resync speed max and min = 50000

Sun Jul 27 13:43:25 IST 2014 : AIO Started = Sun, 27 Jul 2014 12:43:25 GMT
Sun Jul 27 18:55:15 IST 2014 : AIO Stopped = Sun, 27 Jul 2014 17:55:15 GMT
RHS iteration 5, During the first drive rebuild.

Duration: 5 h 11 minutes

Miscompare: @9712523 IO[lba 9712523 + i 0 nsectors 93] TIME[DATA 1406465005:979025 LASTWRITE 1406483631:662432 CURRENT 1406483715:389030]
EXPCTED [SIG:ffffffff21544944 LBA:9712523 DISK:md0 PASS:19 STS:1406465005:979025] != GOT [SIG:ffffffff21544944 LBA:9712395 DISK:md0 PASS:19 STS:1406465005:979025]

128671600  44 49 54 21 ff ff ff ff  0b 33 94 00 00 00 00 00  |DIT!▒▒▒▒.3......|
128671610  6d 64 30 00 00 00 00 00  13 00 00 00 00 00 00 00  |md0.............|

LBA:       Expected 9712523 [94338B] , Got 9712395 [94330B]

--------------------------------------------------------------------------
Log folder name: aiotest-R6-100k
Resync speed max and min = 100000

Sun Jul 27 11:13:00 IST 2014 : AIO Started = Sun, 27 Jul 2014 10:13:00 GMT
Sun Jul 27 12:50:35 IST 2014 : AIO Stopped = Sun, 27 Jul 2014 11:50:35 GMT
RHS iteration 1, During the first drive rebuild.

Duration : 1 h 35 minutes

Miscompare: @4273718 IO[lba 4273718 + i 0 nsectors 27] TIME[DATA 1406455980:710048 LASTWRITE 1406461821:511449 CURRENT 1406461835:71978]
EXPCTED [SIG:ffffffff21544944 LBA:4273718 DISK:md0 PASS:3 STS:1406455980:710048] != GOT [SIG:ffffffff21544944 LBA:4273846 DISK:md0 PASS:0 STS:1406455980:710048]

826c6c00  44 49 54 21 ff ff ff ff  b6 36 41 00 00 00 00 00  |DIT!▒▒▒▒▒6A.....|
826c6c10  6d 64 30 00 3f 00 00 00  00 00 00 00 00 00 00 00  |md0.?...........|
826c6c20  ac d0 d4 53 00 00 00 00  a0 d5 0a 00 00 00 00 00  |▒▒▒S....▒▒......|
826c6c30  06 d9 d4 53 00 00 00 00  8b 7e 04 00 00 00 00 00  |.▒▒S.....~......|
826c6c40  00 00 00 00 00 00 00 00  00 b0 00 00 00 00 00 00  |.........▒......|
826c6c50  f0 18 00 00 00 00 00 00  28 76 b5 b4 3f 00 00 00  |▒.......(v▒▒?...|
826c6c60  40 80 00 02 00 00 00 00  40 80 00 02 00 00 00 00  |@.......@.......|
826c6c70  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

LBA:     Expected 4273718 [36 36 41] , Got 4273846 [b6 36 41]
PASS:    Expected 3, got 0

Comment 11 Manibalan 2014-07-29 14:54:09 UTC

Created attachment 922175 [details]
Attachment for comment 10

Few more mis-compare pattern

Comment 12 NeilBrown 2014-07-29 23:17:01 UTC

aiotest-R6-50k
read from 9712523, got data from 9712395, the previous chunk.

There is something odd with the time stamps.
The "Miscompare" line says the CURRENT time is 1406483715
which is (TZ=GMT date -d @1406483715)
Sun Jul 27 17:55:15 GMT 2014

The modify time on that file (TZ=GMT ls -l aio3.log) is
Jul 27 13:56
4 hours *before* the time data was written to it.
The last entry in raid_conf_info.txt is
$ TZ=GMT date -d 'Sun Jul 27 18:55:15 IST 2014'
Sun Jul 27 13:25:15 GMT 2014

Is "IST" GMT+6:30 ??

It's confusing, but it looks like device 2 was rebuilding
and device 3 was removed.
Rebuild started 11 minutes before the error and takes about
half an hour, so it was probably 1/3 of the way through.
The read was stripe 37939 out of 163840 stripes so 23%.
The read might have been near the recovery, but cannot be sure.

0 1 were in-sync, 2 was recovering, 3 was missing
Chunks 75878 and 75879 on stripe 37939
pd_idx==0, qd_idx==1, data on 2 and 3.

So similar pattern to earlier except we were reading from
device 3 and got device 2 data rather than the reverse.

aiotest-R6-100k
read from 4273718, got data from 4273846, the next chunk
Recovery was 33 minutes into rebuilding device 0 while device 3 was missing.
Chunks 33388 and 33389 on stripe 16694.
pd_idx = 1 pd_idx=2 data on 3 and 0
So once again, both data disks are non-sync, and one of them is being recovered.
Unfortunately cannot tell how far through the recovery was.

(if test.sh could collect the "mdadm -D" when the aio stops, that might help a little).

So it seems that: when lots of to-blocks-missing stripes are being recovered, reading from a double-data-missing stripe can sometimes return the wrong block.

Can someone point me at the exact raid5 source code that this kernel is using?
I think you said elsewhere that the bug happen on mainline too, but it's still good to look at the exact code being tested.

Comment 13 Narayan Balakrishnan 2014-07-30 10:10:40 UTC

Hi Neil, 
I am Manibalan's colleague. I have a question, since the tool aiodit writes a specific data pattern, is it possible to put in some instrumentation...perhaps in ops_run_compute6_2 or other function to see if the data block we are computing/reading is what we expect? Maybe that way we can narrow down the problem.
Also, do you think trying a different xor/raid6 algorithm will help?
We will share the raid5 source code shortly - but it is the standard kernel in RHEL 6.5....
Thanks,
Narayan

Comment 14 Manibalan 2014-07-30 10:32:36 UTC

(In reply to NeilBrown from comment #12)
> aiotest-R6-50k
> read from 9712523, got data from 9712395, the previous chunk.
> 
> There is something odd with the time stamps.
> The "Miscompare" line says the CURRENT time is 1406483715
> which is (TZ=GMT date -d @1406483715)
>    Sun Jul 27 17:55:15 GMT 2014
> 
> The modify time on that file (TZ=GMT ls -l aio3.log) is
>    Jul 27 13:56
> 4 hours *before* the time data was written to it.
> The last entry in raid_conf_info.txt is
> $ TZ=GMT date -d 'Sun Jul 27 18:55:15 IST 2014'
> Sun Jul 27 13:25:15 GMT 2014
> 
> Is "IST" GMT+6:30 ??
> 
> It's confusing, but it looks like device 2 was rebuilding
> and device 3 was removed.
> Rebuild started 11 minutes before the error and takes about
> half an hour, so it was probably 1/3 of the way through.
> The read was stripe 37939 out of 163840 stripes so 23%.
> The read might have been near the recovery, but cannot be sure.
> 
> 0 1 were in-sync, 2 was recovering, 3 was missing
> Chunks 75878 and 75879 on stripe 37939
> pd_idx==0, qd_idx==1, data on 2 and 3.
> 
> So similar pattern to earlier except we were reading from
> device 3 and got device 2 data rather than the reverse.
> 
> aiotest-R6-100k
> read from 4273718, got data from 4273846, the next chunk
> Recovery was 33 minutes into rebuilding device 0 while device 3 was missing.
> Chunks 33388 and 33389 on stripe 16694.
> pd_idx = 1 pd_idx=2 data on 3 and 0
> So once again, both data disks are non-sync, and one of them is being
> recovered.
> Unfortunately cannot tell how far through the recovery was.
> 
> (if test.sh could collect the "mdadm -D" when the aio stops, that might help
> a little).
> 
> So it seems that: when lots of to-blocks-missing stripes are being
> recovered, reading from a double-data-missing stripe can sometimes return
> the wrong block.
> 
> Can someone point me at the exact raid5 source code that this kernel is
> using? 
> I think you said elsewhere that the bug happen on mainline too, but it's
> still good to look at the exact code being tested.

Hi Neil,

I have attached the raid5.c and raid5.h files(in md.6.5.rar file)

Also, I tried to check the following patch in this raid5.c code. since most of it is re-factored, I am not able to follow up the code to verify this patch is available or not. Please clarify me on this patch

http://board.issociate.de/thread/508107/rhel5-raid6-corruption.html 
----
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b8a2c5d..f8cd6ef 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2436,10 +2436,16 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
 BUG();
 case 1:
 compute_block_1(sh, r6s->failed_num[0], 0);
+ set_bit(R5_LOCKED,
+ &sh->dev[r6s->failed_num[0]].flags);
 break;
 case 2:
 compute_block_2(sh, r6s->failed_num[0],
 r6s->failed_num[1]);
+ set_bit(R5_LOCKED,
+ &sh->dev[r6s->failed_num[0]].flags);
+ set_bit(R5_LOCKED,
+ &sh->dev[r6s->failed_num[1]].flags);
 break;
 default: /* This request should have been failed? */
 BUG();


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f8cd6ef..83f83cd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2466,8 +2466,6 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
 if (s->locked == disks)
 if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
 atomic_inc(&conf->pending_full_writes);
-	/* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
-	set_bit(STRIPE_INSYNC, &sh->state);
 
 if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 atomic_dec(&conf->preread_active_stripes);

----

Thanks,
Manibalan.

Comment 15 Manibalan 2014-07-30 10:35:00 UTC

Created attachment 922482 [details]
Attachment for comment 13

contains raid5.c and raid5.h code for linux-2.6.32-431.el6

Comment 16 NeilBrown 2014-07-30 10:39:53 UTC

Hi Narayan.
It's possible that instrumenting the kernel as you suggest might help, but it would be hard to know within the kernel whether the block being read has been written in the current run and so should have known data, or if it still has whatever it was initialised with.
I guess you could write a similar signature to every block at the start so that every data block computed should  have some signature.  However determining when to test would be tricky.  We know we get the wrong data - without a theory as to why I cannot suggest anywhere where watching...

I hope a different raid6 algorithm wouldn't make a difference...  It might make a difference if you have xor-capable DMA hardware so the xor and raid6 calculation wasn't being done by the CPU, but I don't think you do.

I realise the code is just standard RHEL, but I don't know much about RHEL and don't know where to look to find that code.  I figured someone here would be able to tell me.

Comment 17 NeilBrown 2014-07-30 11:10:25 UTC

The first patch you quote was wrong.
The second patch you quote, the correct one, is in the raid5.c you gave me.

That raid5.c code is *very* different from upstream 2.6.32.

When reading from a device that is being recovered, that code will read from the device if the recovery has already passed - it won't try to reconstruct.

So my analysis in comment #7 isn't quite right.  When reading the block that caused the mis-match it would have read from all the other devices, performed a single-data-block recovery and returned the result.  This recovery would have been with a simple xor.

In comment 5 where is says:
       hexdump of MD0:

presumably that was run after the mis-compare was reported and everything was stopped. This suggests there is bad data on the device, it wasn't made bad while reading.  I guess it must have been corrupted when writing....

Comment 18 NeilBrown 2014-07-31 00:38:21 UTC

Just FYI I have been running you test script over night and just got a data corruption reported.  I haven't examined closely yet, but I will.

If the source code for "aiodit" publicly available?  I had a look around but couldn't find anything.  It might be helpful.

Comment 19 NeilBrown 2014-07-31 05:46:07 UTC

I've been staring at my test results and getting nowhere.
So I thought I'd write down what I do know so at least I have a record for next time I look.

Failure was reported at 08:57:55 while the "second" disk was rebuilding.
lba=6386506
chunk=6236 (I'm using 512K chunks)
stripe=3118
So layout of this stripe is   D1 P Q D0
D0 was rebuilding and had the error.
D1 was the previous disk to rebuild (while only P and Q were working)
D1 has the correct signature and content.
The errors in D0 were that the lsb of ever 32bit word were all flipped.
In the P block, the lsb of each word as 1.  Other bits in P were correct.

The timestamp in D1 of when it was written was 06:13:55
The timestamp in D0 (which might be corrupt) was 04:41:36

At 6:13 it was nearly finished rebuilding the P drive while Q drive was missing.
At 4:41 it was rebuilding D1 drive while D0 was missing.

I suspect the corruption happened before or when D1 was written.
As the D0 drive was missing we are depending on P to restore the D0 block.
When the D0 and D1 devices were removed, P and Q were able to restore D1
correctly, but not D0.

I cannot figure out the significance of all the '1' bits in the P block.
Maybe it is an artefact of the Q calculation when something is wrong, but as Q is calculated a byte at a time, that doesn't seem likely.

I guess I'll just wait until it happens again and look for more patterns then.

Comment 20 Narayan Balakrishnan 2014-07-31 05:57:53 UTC

Neal,
aiodit is a tool we wrote. We can share the source with you.
If it helps, we had done an experiment in our systems some time back where the same script we had run with just one drive failure i.e. in a RAID 6 we will remove one disk, and add it back later and let it rebuild. That test had passed without any problems. 
So, even though the miscompare was found when the second drive was rebuilding, the corruption, I feel is happening much before at the first drive rebuild itself.

Comment 21 Manibalan 2014-07-31 06:16:49 UTC

(In reply to NeilBrown from comment #18)
> Just FYI I have been running you test script over night and just got a data
> corruption reported.  I haven't examined closely yet, but I will.
> 
> If the source code for "aiodit" publicly available?  I had a look around but
> couldn't find anything.  It might be helpful.

Hi Neil,
The aiodit.c file has been attached for your reference.
Thanks,
Manibalan.

Comment 22 Manibalan 2014-07-31 06:18:57 UTC

Created attachment 922826 [details]
Attachment for comment 21

aiodit.c

Comment 23 Narayan Balakrishnan 2014-07-31 07:28:49 UTC

Neil,
Here is one analysis. Here if you see the expected and actual LBA were in same stripe as well. But, the disk corresponding to the lba whose data we expect is the one that is rebuilding. The lba that we got the data from is an active disk. We checked the history of the drive removal in that test and we did not see disk 1 and disk 2 were never removed in the same pass in any of the cycles.

Is it possible that we had two outstanding IOs to the same stripe that was getting rebuilt and it resulted in corruption. Also, what would be Q if I have to get the same data as D1 from D2 (not sure if D1 is also corrupted as the system is not in the same state) -- suspect Q should be 0s then...

We can try dumping the entire IO pattern for the pass when we hit the corruption to see whether the two IOs were written simultaneously...I am sure it was written in the same pass as the pass numbers are matching...

Miscompare: @9712523 IO[lba 9712523 + i 0 nsectors 93] TIME[DATA 1406465005:979025 LASTWRITE 1406483631:662432 CURRENT 1406483715:389030]
EXPCTED [SIG:ffffffff21544944 LBA:9712523 DISK:md0 PASS:19 STS:1406465005:979025] != GOT [SIG:ffffffff21544944 LBA:9712395 DISK:md0 PASS:19 STS:1406465005:979025]

Expected:
LBA:9712523 chunk: 75879 stripe: 37939

Got:
LBA:9712395 chunk: 75878 stripe: 37939

pd_idx = 4 - 1 - (37939%4) = 3
qd_idx = (3 + 1) % 4 = 0
dd_idx = 1, 2


    Number   Major   Minor   RaidDevice State
q      4       8      134        0     active sync   /dev/sdi6
d1     5       8      150       1      active sync   /dev/sdj6
d2     6       8      182       2      spare rebuilding   /dev/sdl6
p      3       0        0        3     removed

Comment 24 NeilBrown 2014-08-11 22:24:16 UTC

I think I've found the problem.  I ran a test last night, but I had the fix wrong and it still failed :-(  So I'm now starting another run.
If you could test as well, that would be great.

The problem occurs when the array is doubly degraded and a stripe_head is being recovered and written to at the same time.
If this stripe_head has 2 data blocks missing (i.e. both P and Q are intact), and one of the data blocks is being written while the device that stores the other data block is being recovered, then that recovered block can sometimes not be written.  i.e. the recovery misses some blocks.

The  problem is caused by:

				if (!test_bit(R5_Insync, &dev->flags) ||
				    ((i == sh->pd_idx || i == sh->qd_idx)  &&
				     s.failed == 0))
					set_bit(STRIPE_INSYNC, &sh->state);

in handle_stripe().
This says "if we are writing to a device that was not Insync, then the stripe must now be fully uptodate.  This is true for RAID5 and for singly-degraded RAID6, but not for doubly degraded RAID6.
Just before that test, insert

				if (s.failed > 1)
					continue;

This should make sure the full recovery happens properly.

If you can test that fix I would really appreciate it.

Thanks.

Comment 25 Narayan Balakrishnan 2014-08-12 09:53:11 UTC

Great find! Thanks for being at it.
Yes, we have started running it in a few systems here. We will get back to you with the results.

Comment 26 Jes Sorensen 2014-08-12 14:55:14 UTC

(In reply to NeilBrown from comment #16)
> I realise the code is just standard RHEL, but I don't know much about RHEL
> and don't know where to look to find that code.  I figured someone here
> would be able to tell me.

Neil,

Sorry still need to catch up here. Should you need it in the future the RHEL6
source RPMs can be found here:
ftp://ftp.redhat.com/redhat/linux/enterprise/6Server/en/os/SRPMS/

RHEL7 should be here:
https://git.centos.org/project/rpms

I am not aware of there being git access to RHEL6 unfortunately.

Jes

Comment 27 NeilBrown 2014-08-15 05:32:53 UTC

(In reply to Narayan Balakrishnan from comment #25)
> Great find! Thanks for being at it.
> Yes, we have started running it in a few systems here. We will get back to
> you with the results.

Any success to report?  I'd like to publish the patch soon, and I'd like it to include a "Tested-by:" from you.
(My own testing has been quite positive).

Comment 28 Manibalan 2014-08-18 06:45:45 UTC

(In reply to NeilBrown from comment #27)
> (In reply to Narayan Balakrishnan from comment #25)
> > Great find! Thanks for being at it.
> > Yes, we have started running it in a few systems here. We will get back to
> > you with the results.
> 
> Any success to report?  I'd like to publish the patch soon, and I'd like it
> to include a "Tested-by:" from you.
> (My own testing has been quite positive).

Hi Neil,
We have done with the testing.we performed the testing on 10+ setups for more that 5 days. all the setups are holding good.
This fix is working fine.

Once again thanks for the fix.. Great finding..

Regards,
Manibalan.

Comment 29 NeilBrown 2014-08-18 07:05:49 UTC

Thanks for the confirmation, and for the original report.
The fix will be going upstream shortly.

Comment 33 Rafael Aquini 2014-08-25 19:39:26 UTC

Patch(es) available on kernel-2.6.32-498.el6

Comment 37 errata-xmlrpc 2014-10-14 06:05:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1392.html

Note You need to log in before you can comment on or make changes to this bug.