Bug 1090423
Summary: | Data integrity issue on rebuilding RAID 6 with 100MB resync speed | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Manibalan <manibalan.void> | |
Component: | kernel | Assignee: | Jes Sorensen <Jes.Sorensen> | |
Status: | CLOSED ERRATA | QA Contact: | XiaoNi <xni> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 6.5 | CC: | bdonahue, dhoward, dledford, manibalan.void, mgoodwin, msvoboda, nmurray, pasik, salmy, tlavigne, xni | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | kernel-2.6.32-498.el6 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, recovery of a double-degraded RAID6 array could, under certain circumstances, result in data corruption. This could happen because the md driver was using an optimization that is safe to use only for single-degraded arrays. This update ensures that this optimization is skipped during the recovery of double-degraded RAID6 arrays.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1130905 1131395 1131409 1131410 (view as bug list) | Environment: | ||
Last Closed: | 2014-10-14 06:05:30 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1130905, 1131395, 1131409, 1131410 | |||
Attachments: |
Description
Manibalan
2014-04-23 09:56:58 UTC
Can you add lsmod and lspci output to this? (In reply to Dan Williams from comment #2) > Can you add lsmod and lspci output to this? [root@root ~]# lsmod Module Size Used by iptable_mangle 3349 0 iptable_nat 6158 0 nf_nat 22759 1 iptable_nat raid456 76573 1 async_raid6_recov 6555 1 raid456 async_pq 4775 2 raid456,async_raid6_recov raid6_pq 90821 2 async_raid6_recov,async_pq async_xor 3691 3 raid456,async_raid6_recov,async_pq xor 8738 1 async_xor async_memcpy 2236 2 raid456,async_raid6_recov async_tx 3027 5 raid456,async_raid6_recov,async_pq,async_xor,async_memcpy autofs4 26513 3 cpufreq_ondemand 10544 16 acpi_cpufreq 7763 1 freq_table 4936 2 cpufreq_ondemand,acpi_cpufreq mperf 1557 1 acpi_cpufreq ipt_REJECT 2351 0 nf_conntrack_ipv4 9506 3 iptable_nat,nf_nat nf_defrag_ipv4 1483 1 nf_conntrack_ipv4 iptable_filter 2793 0 ip_tables 17831 3 iptable_mangle,iptable_nat,iptable_filter ip6t_REJECT 4628 2 nf_conntrack_ipv6 8748 2 nf_defrag_ipv6 11182 1 nf_conntrack_ipv6 xt_state 1492 2 nf_conntrack 79758 5 iptable_nat,nf_nat,nf_conntrack_ipv4,nf_conntrack_ipv6,xt_state ip6table_filter 2889 1 ip6_tables 18732 1 ip6table_filter ipv6 317340 77 ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6 microcode 112685 0 iTCO_wdt 7115 0 iTCO_vendor_support 3056 1 iTCO_wdt sg 29350 0 ses 6475 0 enclosure 8438 1 ses ixgbe 243968 0 mdio 4769 1 ixgbe igb 197536 0 i2c_algo_bit 5935 1 igb ptp 9614 2 ixgbe,igb pps_core 11458 1 ptp serio_raw 4594 0 i2c_i801 11359 0 i2c_core 31084 3 igb,i2c_algo_bit,i2c_i801 lpc_ich 12803 0 mfd_core 1895 1 lpc_ich ioatdma 58130 24 dca 7101 3 ixgbe,igb,ioatdma i7core_edac 17948 0 edac_core 46581 1 i7core_edac shpchp 32778 0 ext4 374902 3 jbd2 93427 1 ext4 mbcache 8193 1 ext4 sd_mod 39069 11 crc_t10dif 1541 1 sd_mod pata_acpi 3701 0 ata_generic 3837 0 ata_piix 24601 0 mpt2sas 186848 6 scsi_transport_sas 35652 1 mpt2sas raid_class 4516 1 mpt2sas dm_mirror 14384 0 dm_region_hash 12085 1 dm_mirror dm_log 9930 2 dm_mirror,dm_region_hash dm_mod 84209 11 dm_mirror,dm_log [root@root ~]# lspci 00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22) 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22) 00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22) 00:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 8 (rev 22) 00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22) 00:0e.0 Host bridge: Intel Corporation Device 341c (rev 22) 00:0e.1 Host bridge: Intel Corporation Device 341d (rev 22) 00:0e.2 Host bridge: Intel Corporation Device 341e (rev 22) 00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22) 00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers (rev 22) 00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22) 00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22) 00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers (rev 22) 00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 00:1f.0 ISA bridge: Intel Corporation 82801JIB (ICH10) LPC Interface Controller 00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1 00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller 01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 02:00.0 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01) 02:00.1 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01) 03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02) 06:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 06:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 07:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a) [root@root ~]# Created attachment 913989 [details]
sorage integrity checker - CAVEAT THIS DESTROYS ALL DATA ON THE NAMED DEVICE
I have attached 'integrity.c' to this bug. This program creates a number of parallel threads, each of which writes random data to a different offset of the specified device or file, and then verifies what was written. If the verification succeeds, then each thread will repeat this forever. NOTE THAT IT OVERWRITES ALL DATA ON THE NAMED DEVICE, including metadata and partition tables - use with extreme caution. There is a large comment at the start of the file explaining it's usage. Compile: cc -o integrity integrity.c -lpthread
I have been running this program with an 8 disk s/w raid6 60GB device, deliberately failing two drives and then adding them back in, causing a re-sync. I used 19 threads, each testing 3GB at sequential offsets in the 60GB raid6 device. I've run this for days and it has not failed once. Note that I do not have any iSCSI remote devices or Microsoft filesystems in this test environment - it is just testing the RHEL6 raid6 driver and the underlying scsi devices. I have also tried various re-sync speeds, as per the repro description.
Hi, We are able to reproduce the data-miscompare issue, with a Linux IO tool (aiodit).. Attached for your use. Steps for running AIO on RHEL. 1. Delete all volume/container/LD 3. Create RAID 6 LD using mdadm command, using the following command. Eg: “mdadm -C /dev/md0 -c 64 -l 6 -f -n 4 -e 1.2 -z 10G /dev/sdb /dev/sdc /dev/sdd /dev/sde” 4. Change the resync speed max and min to 100K. echo "100000" > /sys/block/md0/md/sync_speed_min echo "100000" > /sys/block/md0/md/sync_speed_max 5. mkdir /aiotest 6. sd /aiotest 7. Copy “aiodit, do_aio.sh, test.sh and RollingHotSpareTwoDriveFailure-noitx.sh” into /aiotest folder.(all this files are attached with this ticket) Before going to next step, ensure the LD is optimal, and the time formate is GMT(helpful for analysis) 9. Run “sh do_aio.sh” - This script will run 4 instance of "aiodit", each will do random io within the given start and end offset(in GB) - in each pass, it will write and configured load and read it back and verify. 10. Run “sh RollingHotSpareTwoDriveFailure-noitx.sh” - This script will simulate 2 drive failure in RAID6 array. will randomly remove 2 drives and add one by one, wait till rebuild complete. and repete it. - scripts works on md0. can change. Miscompare Info: One of the instance, of aiodit "./aiodit -d /dev/md0 -b 0 -e 2 -t 1 -i 27 -p 100000 -s -r -x test.sh > aio1.log &" got data mis-compare it does random IO on md0 from 0GB to 2GB, and the maximum io size is 27 SECTORS. Test is started by : Wed, 23 Jul 2014 18:33:42 GMT ./aiodit: args </dev/md0> begin<0> end<4193792> test-size<2097152> io-size<27> num-pass<100000> mode<r+s> time-stamp<1406140422:936989> disk<md0> pid 14561 doing random write cycle for pass 0 on disk /dev/md0 [1572864] Got miscompare on : Fri, 25 Jul 2014 16:25:20 GMT [root@root aio]# tail -f aio1.log Firing READ IO lba: 1539743, size: 9 sectors iocount 18001 lba_count 234339 g_io_pending 208 time<1406305461:672849> Firing READ IO lba: 306286, size: 3 sectors iocount 21001 lba_count 273372 g_io_pending 136 time<1406305488:252550> Firing READ IO lba: 3131038, size: 25 sectors iocount 24001 lba_count 312139 g_io_pending 192 time<1406305515:489132> Miscompare: @42930 IO[lba 42930 + i 0 nsectors 10] TIME[DATA 1406140422:936989 LASTWRITE 1406305292:215631 CURRENT 1406305520:859925] EXPCTED [SIG:ffffffff21544944 LBA:42930 DISK:md0 PASS:52 STS:1406140422:936989] != GOT [SIG:ffffffff21544944 LBA:42802 DISK:md0 PASS:52 STS:1406140422:936989] LBA : 42930 = 42930 * 512 = 21980160 = hex[14F6400] hexdump of MD0: 014f6400 44 49 54 21 ff ff ff ff 32 a7 00 00 00 00 00 00 |DIT!....2.......| 014f6410 6d 64 30 00 00 00 00 00 34 00 00 00 00 00 00 00 |md0.....4.......| So, Sig : 44 49 54 21 ff ff ff ff - OK LBA: 42930 = hex[a7b2] Expected, but got is hex[a732] = 42802 - MISCOMPARE And the miscompare happens during the second drive rebuild. [root@root aio]# tail -f /volumes/RAIDCONF/raid_conf_info.txt Fri Jul 25 21:57:29 IST 2014 : Iteration 2: After Rebuild Started Fri Jul 25 21:57:29 IST 2014 : /dev/md0: Version : 1.2 Creation Time : Wed Jul 23 20:55:54 2014 Raid Level : raid6 Array Size : 20971520 (20.00 GiB 21.47 GB) Used Dev Size : 10485760 (10.00 GiB 10.74 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Fri Jul 25 21:57:26 2014 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 8% complete Name : initiator:0 UUID : 6639e64c:4bbfaf35:d018922f:2a03e693 Events : 630 Number Major Minor RaidDevice State 4 8 22 0 active sync /dev/sdb6 1 8 38 1 active sync /dev/sdc6 5 8 70 2 active sync /dev/sde6 6 8 54 3 spare rebuilding /dev/sdd6 All the logs and scripts using in this setup has been attached. Thanks, Manibalan. Additional Information: [root@root aio]# ./aiodit --help Usage: ./aiodit -d </dev/bdev> -b <begin-GB> -e <end-GB> -t <test-size-GB> -i <io-max-size-sectors> -p <passes> ...[-x <exit-script.sh> -s <seq> &|-r <rand> -h <help>] System Info: root@KernelCentOS64 aio-RHEL]# uname -a Linux KernelCentOS64 2.6.32-431.5.1.el6.x86_64.debug #1 SMP Wed Feb 12 01:14:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux [root@root /]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] md0 : active raid6 sdd6[6] sde6[5] sdb6[4] sdc6[1] 20971520 blocks super 1.2 level 6, 64k chunk, algorithm 2 [4/4] [UUUU] [root@root /]# uname -a Linux root 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux [root@root /]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] md0 : active raid6 sdd6[6] sde6[5] sdb6[4] sdc6[1] 20971520 blocks super 1.2 level 6, 64k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none> [root@root /]# lsscsi [0:0:0:0] disk SEAGATE ST500NM0001 0002 /dev/sda [0:0:1:0] disk SEAGATE ST500NM0001 0002 /dev/sdb [0:0:2:0] disk SEAGATE ST500NM0001 0002 /dev/sdc [0:0:3:0] disk SEAGATE ST500NM0001 0002 /dev/sdd [0:0:4:0] disk SEAGATE ST500NM0001 0002 /dev/sde [0:0:5:0] enclosu LSI CORP SAS2X36 0724 - [root@root /]# lspci 00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22) 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22) 00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 (rev 22) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22) 00:08.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 8 (rev 22) 00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22) 00:0e.0 Host bridge: Intel Corporation Device 341c (rev 22) 00:0e.1 Host bridge: Intel Corporation Device 341d (rev 22) 00:0e.2 Host bridge: Intel Corporation Device 341e (rev 22) 00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22) 00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers (rev 22) 00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22) 00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22) 00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers (rev 22) 00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22) 00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 00:1f.0 ISA bridge: Intel Corporation 82801JIB (ICH10) LPC Interface Controller 00:1f.2 IDE interface: Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller #1 00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller 01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 02:00.0 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01) 02:00.1 Ethernet controller: Intel Corporation 10 Gigabit BR KX4 Dual Port Network Connection (rev 01) 03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03) 06:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02) 06:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network Connection (rev 02) 07:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a) Created attachment 921122 [details] Attachement for comment 5 - aiodit got the mis-compare Hi, the mis-compare shown in comment 5: LBA: 42930 = hex[a7b2] Expected, but got is hex[a732] = 42802 - MISCOMPARE is a single bit error. '32' instead of 'b2'. Alternatively, it is a difference of 128 sectors which is 64K which is the chunk size. So it could be returning data from the wrong chunk. It tried to read from 42930 which is in chunk 335, but got data that was presumably written to 42802 in chunk 334 (or possibly the writes got mixed up instead). Each stripe has 2 data chunks so an even numbered one and the following odd numbered one, so these two chunks are on the same stripe - stripe 167 From the code: pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks); qd_idx = (pd_idx + 1) % raid_disks; *dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks; stripe2 == 167, raid_disks == 4, so pd_idx = 4 - 1 - (167%4) = 4 - 1 - 3 == 0 qd_idx = (0 + 1) % 4 == 1 dd_idx = 2, or 3 The two data blocks are on the devices which were failed. The error was detected at Fri Jul 25 16:25:20 GMT 2014 That is about 2 minutes (109 seconds) before: Fri Jul 25 21:57:09 IST 2014 : Iteration 2: After Rebuild completed on First drive (aren't half-hour time zones fun!!!) which is (based on the script) between 5 and 10 seconds after the rebuild actually completed. The update time was recorded as Update Time : Fri Jul 25 21:57:03 2014 so I suspect that is when the rebuild of the first device actually finished. $ TZ=GMT date -d 'Fri Jul 25 21:57:03 IST 2014' Fri Jul 25 16:27:03 GMT 2014 So it looks to me like the error happened while the first drive was rebuilding, about 103 seconds before it finished in fact. The error was detected in stripe 167, and there are 10485760/64 == 163840 stripes on the device so the error wasn't anywhere near where recovery was happening. It seems that a read from a block on a faulty device received data from a device that was being recovered, but the recovery was happening much later in the array. In 2.6.32 a read from a recovering device is never attempted - raid6 will only read from fully in-sync devices. So it looks like it tried to recovery one data block from the two parity blocks, but got the other data block. So somehow ops_run_compute6_2 is producing a bad result, but only when lots of other 2-data recovery operations are ongoing at the same time. That seems like enough of a hint to find the problem. Do you have logs from other runs so I can see how robust the patterns are? Also, could you please report the 'xor:' and 'raid6:' and surrounding lines that are produced in the kernel messages at boot and report which xor/raid6 code is being used. Thanks. (In reply to NeilBrown from comment #8) > Also, could you please report the 'xor:' and 'raid6:' and surrounding lines > that are produced in the kernel messages at boot and report which xor/raid6 > code is being used. Jul 29 21:54:34 root kernel: async_tx: api initialized (async) Jul 29 21:54:34 root kernel: xor: automatically using best checksumming function: generic_sse Jul 29 21:54:34 root kernel: generic_sse: 9044.000 MB/sec Jul 29 21:54:34 root kernel: xor: using function: generic_sse (9044.000 MB/sec) Jul 29 21:54:34 root kernel: iTCO_vendor_support: vendor-support=0 Jul 29 21:54:34 root kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh Jul 29 21:54:34 root kernel: iTCO_wdt: Found a ICH10 TCO device (Version=2, TCOBASE=0x0860) Jul 29 21:54:34 root kernel: iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0) Jul 29 21:54:34 root kernel: raid6: sse2x1 5703 MB/s Jul 29 21:54:34 root kernel: raid6: sse2x2 6828 MB/s Jul 29 21:54:34 root kernel: raid6: sse2x4 6492 MB/s Jul 29 21:54:34 root kernel: raid6: using algorithm sse2x2 (6828 MB/s) Jul 29 21:54:34 root kernel: raid6: using ssse3x2 recovery algorithm following the var/log/message output around that line. Jul 29 21:54:34 root kernel: scsi 0:0:5:0: Attached scsi generic sg5 type 13 Jul 29 21:54:34 root kernel: ses 0:0:5:0: Attached Enclosure device Jul 29 21:54:34 root kernel: microcode: CPU0 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: md: bind<sdb3> Jul 29 21:54:34 root kernel: md: bind<sdb2> Jul 29 21:54:34 root kernel: md: bind<sdb6> Jul 29 21:54:34 root kernel: md: bind<sdb1> Jul 29 21:54:34 root kernel: md: bind<sdb4> Jul 29 21:54:34 root kernel: md: bind<sdc2> Jul 29 21:54:34 root kernel: md: bind<sdc3> Jul 29 21:54:34 root kernel: md: bind<sdc1> Jul 29 21:54:34 root kernel: md: bind<sdc4> Jul 29 21:54:34 root kernel: md: bind<sdc6> Jul 29 21:54:34 root kernel: md: bind<sdd1> Jul 29 21:54:34 root kernel: md: bind<sdd2> Jul 29 21:54:34 root kernel: md: bind<sdd3> Jul 29 21:54:34 root kernel: md: bind<sdd4> Jul 29 21:54:34 root kernel: microcode: CPU1 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: md: bind<sdd6> Jul 29 21:54:34 root kernel: microcode: CPU2 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU3 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU4 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: md: bind<sde6> Jul 29 21:54:34 root kernel: microcode: CPU5 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU6 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU7 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU8 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU9 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU10 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU11 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU12 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU13 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU14 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: microcode: CPU15 sig=0x206c2, pf=0x1, revision=0x10 Jul 29 21:54:34 root kernel: platform microcode: firmware: requesting intel-ucode/06-2c-02 Jul 29 21:54:34 root kernel: Microcode Update Driver: v2.00 <tigran.co.uk>, Peter Oruba Jul 29 21:54:34 root kernel: async_tx: api initialized (async) Jul 29 21:54:34 root kernel: xor: automatically using best checksumming function: generic_sse Jul 29 21:54:34 root kernel: generic_sse: 9044.000 MB/sec Jul 29 21:54:34 root kernel: xor: using function: generic_sse (9044.000 MB/sec) Jul 29 21:54:34 root kernel: iTCO_vendor_support: vendor-support=0 Jul 29 21:54:34 root kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh Jul 29 21:54:34 root kernel: iTCO_wdt: Found a ICH10 TCO device (Version=2, TCOBASE=0x0860) Jul 29 21:54:34 root kernel: iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0) Jul 29 21:54:34 root kernel: raid6: sse2x1 5703 MB/s Jul 29 21:54:34 root kernel: raid6: sse2x2 6828 MB/s Jul 29 21:54:34 root kernel: raid6: sse2x4 6492 MB/s Jul 29 21:54:34 root kernel: raid6: using algorithm sse2x2 (6828 MB/s) Jul 29 21:54:34 root kernel: raid6: using ssse3x2 recovery algorithm Jul 29 21:54:34 root kernel: md: raid6 personality registered for level 6 Jul 29 21:54:34 root kernel: md: raid5 personality registered for level 5 Jul 29 21:54:34 root kernel: md: raid4 personality registered for level 4 Jul 29 21:54:34 root kernel: bio: create slab <bio-1> at 1 Jul 29 21:54:34 root kernel: md/raid:md125: device sde6 operational as raid disk 3 Jul 29 21:54:34 root kernel: md/raid:md125: device sdd6 operational as raid disk 2 Jul 29 21:54:34 root kernel: md/raid:md125: device sdc6 operational as raid disk 1 Jul 29 21:54:34 root kernel: md/raid:md125: device sdb6 operational as raid disk 0 Jul 29 21:54:34 root kernel: md/raid:md125: allocated 4314kB Jul 29 21:54:34 root kernel: md/raid:md125: raid level 6 active with 4 out of 4 devices, algorithm 2 Jul 29 21:54:34 root kernel: md125: detected capacity change from 0 to 21474836480 Jul 29 21:54:34 root kernel: md125: unknown partition table Jul 29 21:54:34 root kernel: md: raid1 personality registered for level 1 > Thanks. (In reply to NeilBrown from comment #7) > Hi, > the mis-compare shown in comment 5: > > LBA: 42930 = hex[a7b2] Expected, but got is hex[a732] = 42802 - MISCOMPARE > > is a single bit error. '32' instead of 'b2'. > > Alternatively, it is a difference of 128 sectors which is 64K which is the > chunk size. So it could be returning data from the wrong chunk. > > It tried to read from 42930 which is in chunk 335, but got data that was > presumably written to 42802 in chunk 334 (or possibly the writes got mixed > up instead). > Each stripe has 2 data chunks so an even numbered one and the following odd > numbered one, so these two chunks are on the same stripe - stripe 167 > > From the code: > > pd_idx = raid_disks - 1 - sector_div(stripe2, raid_disks); > qd_idx = (pd_idx + 1) % raid_disks; > *dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks; > > stripe2 == 167, raid_disks == 4, so > > pd_idx = 4 - 1 - (167%4) = 4 - 1 - 3 == 0 > qd_idx = (0 + 1) % 4 == 1 > dd_idx = 2, or 3 > > The two data blocks are on the devices which were failed. > > The error was detected at Fri Jul 25 16:25:20 GMT 2014 > > That is about 2 minutes (109 seconds) before: > > Fri Jul 25 21:57:09 IST 2014 : Iteration 2: After Rebuild completed on First > drive > > (aren't half-hour time zones fun!!!) > > which is (based on the script) between 5 and 10 seconds after the rebuild > actually completed. The update time was recorded as > > Update Time : Fri Jul 25 21:57:03 2014 > > so I suspect that is when the rebuild of the first device actually finished. > > $ TZ=GMT date -d 'Fri Jul 25 21:57:03 IST 2014' > Fri Jul 25 16:27:03 GMT 2014 > > > So it looks to me like the error happened while the first drive was > rebuilding, > about 103 seconds before it finished in fact. > > The error was detected in stripe 167, and there are 10485760/64 == 163840 > stripes on the device so the error wasn't anywhere near where recovery was > happening. > > > It seems that a read from a block on a faulty device received data from a > device that was being recovered, but the recovery was happening much later > in the array. > > In 2.6.32 a read from a recovering device is never attempted - raid6 will > only read from fully in-sync devices. So it looks like it tried to recovery > one data block from the two parity blocks, but got the other data block. > So somehow ops_run_compute6_2 is producing a bad result, but only when lots > of other 2-data recovery operations are ongoing at the same time. > > That seems like enough of a hint to find the problem. > > Do you have logs from other runs so I can see how robust the patterns are? Hi Neil, Following are few more mis-compare patterns we got, also I have attached all the logs for each test (compressed file : aio-RHEL-29-Jul-2014).. ----------------------------------------------------------------------------------- Log folder name: aiotest-R6-20k Resync speed max and min = 20000 Mon Jul 28 00:05:46 IST 2014 : AIO Started Mon Jul 28 01:46:06 IST 2014 : AIO Stopped RHS iteration 2, During the first drive rebuild. Duration : 1h 40 minutes Miscompare: @11537481 IO[lba 11537481 + i 0 nsectors 49] TIME[DATA 0:0 LASTWRITE 1406508344:393802 CURRENT 1406508366:134762] EXPCTED [SIG:ffffffff21544944 LBA:11537481 DISK:md0 PASS:5 STS:1406502346:551921] != GOT [SIG:0 LBA:128 DISK: PASS:0 STS:0:0] 160189200 00 00 00 00 00 00 00 00 80 00 00 00 00 00 00 00 |................| 160189210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| -------------------------------------------------------------------------------- Log folder name: aiotest-R6-30k Resync speed max and min = 30000 Sun Jul 27 19:38:40 IST 2014 : AIO Started = Sun, 27 Jul 2014 18:38:40 GMT Sun Jul 27 21:52:22 IST 2014 : AIO Stopped = Sun, 27 Jul 2014 20:52:22 GMT RHS iteration 2, During the first drive rebuild. Duration : 2h 10 minutes Miscompare: @8350080 IO[lba 8350023 + i 57 nsectors 58] TIME[DATA 4116891984:1233141 LASTWRITE 1406494269:897097 CURRENT 1406494342:121414] EXPCTED [SIG:ffffffff21544944 LBA:8350080 DISK:md0 PASS:4 STS:1406486320:965459] != GOT [SIG:1c1c1c1c63fcdbcc LBA:8502045 DISK:▒▒P PASS:9 STS:4116891984:1233141] fed30000 cc db fc 63 1c 1c 1c 1c 1d bb 81 00 00 00 00 00 |▒▒▒c.....▒......| fed30010 b7 ac 50 00 00 00 00 00 09 00 00 00 00 00 00 00 |▒▒P.............| -------------------------------------------------------------------------------- Log folder name: aiotest-R6-50k Resync speed max and min = 50000 Sun Jul 27 13:43:25 IST 2014 : AIO Started = Sun, 27 Jul 2014 12:43:25 GMT Sun Jul 27 18:55:15 IST 2014 : AIO Stopped = Sun, 27 Jul 2014 17:55:15 GMT RHS iteration 5, During the first drive rebuild. Duration: 5 h 11 minutes Miscompare: @9712523 IO[lba 9712523 + i 0 nsectors 93] TIME[DATA 1406465005:979025 LASTWRITE 1406483631:662432 CURRENT 1406483715:389030] EXPCTED [SIG:ffffffff21544944 LBA:9712523 DISK:md0 PASS:19 STS:1406465005:979025] != GOT [SIG:ffffffff21544944 LBA:9712395 DISK:md0 PASS:19 STS:1406465005:979025] 128671600 44 49 54 21 ff ff ff ff 0b 33 94 00 00 00 00 00 |DIT!▒▒▒▒.3......| 128671610 6d 64 30 00 00 00 00 00 13 00 00 00 00 00 00 00 |md0.............| LBA: Expected 9712523 [94338B] , Got 9712395 [94330B] -------------------------------------------------------------------------- Log folder name: aiotest-R6-100k Resync speed max and min = 100000 Sun Jul 27 11:13:00 IST 2014 : AIO Started = Sun, 27 Jul 2014 10:13:00 GMT Sun Jul 27 12:50:35 IST 2014 : AIO Stopped = Sun, 27 Jul 2014 11:50:35 GMT RHS iteration 1, During the first drive rebuild. Duration : 1 h 35 minutes Miscompare: @4273718 IO[lba 4273718 + i 0 nsectors 27] TIME[DATA 1406455980:710048 LASTWRITE 1406461821:511449 CURRENT 1406461835:71978] EXPCTED [SIG:ffffffff21544944 LBA:4273718 DISK:md0 PASS:3 STS:1406455980:710048] != GOT [SIG:ffffffff21544944 LBA:4273846 DISK:md0 PASS:0 STS:1406455980:710048] 826c6c00 44 49 54 21 ff ff ff ff b6 36 41 00 00 00 00 00 |DIT!▒▒▒▒▒6A.....| 826c6c10 6d 64 30 00 3f 00 00 00 00 00 00 00 00 00 00 00 |md0.?...........| 826c6c20 ac d0 d4 53 00 00 00 00 a0 d5 0a 00 00 00 00 00 |▒▒▒S....▒▒......| 826c6c30 06 d9 d4 53 00 00 00 00 8b 7e 04 00 00 00 00 00 |.▒▒S.....~......| 826c6c40 00 00 00 00 00 00 00 00 00 b0 00 00 00 00 00 00 |.........▒......| 826c6c50 f0 18 00 00 00 00 00 00 28 76 b5 b4 3f 00 00 00 |▒.......(v▒▒?...| 826c6c60 40 80 00 02 00 00 00 00 40 80 00 02 00 00 00 00 |@.......@.......| 826c6c70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| LBA: Expected 4273718 [36 36 41] , Got 4273846 [b6 36 41] PASS: Expected 3, got 0 Created attachment 922175 [details] Attachment for comment 10 Few more mis-compare pattern aiotest-R6-50k read from 9712523, got data from 9712395, the previous chunk. There is something odd with the time stamps. The "Miscompare" line says the CURRENT time is 1406483715 which is (TZ=GMT date -d @1406483715) Sun Jul 27 17:55:15 GMT 2014 The modify time on that file (TZ=GMT ls -l aio3.log) is Jul 27 13:56 4 hours *before* the time data was written to it. The last entry in raid_conf_info.txt is $ TZ=GMT date -d 'Sun Jul 27 18:55:15 IST 2014' Sun Jul 27 13:25:15 GMT 2014 Is "IST" GMT+6:30 ?? It's confusing, but it looks like device 2 was rebuilding and device 3 was removed. Rebuild started 11 minutes before the error and takes about half an hour, so it was probably 1/3 of the way through. The read was stripe 37939 out of 163840 stripes so 23%. The read might have been near the recovery, but cannot be sure. 0 1 were in-sync, 2 was recovering, 3 was missing Chunks 75878 and 75879 on stripe 37939 pd_idx==0, qd_idx==1, data on 2 and 3. So similar pattern to earlier except we were reading from device 3 and got device 2 data rather than the reverse. aiotest-R6-100k read from 4273718, got data from 4273846, the next chunk Recovery was 33 minutes into rebuilding device 0 while device 3 was missing. Chunks 33388 and 33389 on stripe 16694. pd_idx = 1 pd_idx=2 data on 3 and 0 So once again, both data disks are non-sync, and one of them is being recovered. Unfortunately cannot tell how far through the recovery was. (if test.sh could collect the "mdadm -D" when the aio stops, that might help a little). So it seems that: when lots of to-blocks-missing stripes are being recovered, reading from a double-data-missing stripe can sometimes return the wrong block. Can someone point me at the exact raid5 source code that this kernel is using? I think you said elsewhere that the bug happen on mainline too, but it's still good to look at the exact code being tested. Hi Neil, I am Manibalan's colleague. I have a question, since the tool aiodit writes a specific data pattern, is it possible to put in some instrumentation...perhaps in ops_run_compute6_2 or other function to see if the data block we are computing/reading is what we expect? Maybe that way we can narrow down the problem. Also, do you think trying a different xor/raid6 algorithm will help? We will share the raid5 source code shortly - but it is the standard kernel in RHEL 6.5.... Thanks, Narayan (In reply to NeilBrown from comment #12) > aiotest-R6-50k > read from 9712523, got data from 9712395, the previous chunk. > > There is something odd with the time stamps. > The "Miscompare" line says the CURRENT time is 1406483715 > which is (TZ=GMT date -d @1406483715) > Sun Jul 27 17:55:15 GMT 2014 > > The modify time on that file (TZ=GMT ls -l aio3.log) is > Jul 27 13:56 > 4 hours *before* the time data was written to it. > The last entry in raid_conf_info.txt is > $ TZ=GMT date -d 'Sun Jul 27 18:55:15 IST 2014' > Sun Jul 27 13:25:15 GMT 2014 > > Is "IST" GMT+6:30 ?? > > It's confusing, but it looks like device 2 was rebuilding > and device 3 was removed. > Rebuild started 11 minutes before the error and takes about > half an hour, so it was probably 1/3 of the way through. > The read was stripe 37939 out of 163840 stripes so 23%. > The read might have been near the recovery, but cannot be sure. > > 0 1 were in-sync, 2 was recovering, 3 was missing > Chunks 75878 and 75879 on stripe 37939 > pd_idx==0, qd_idx==1, data on 2 and 3. > > So similar pattern to earlier except we were reading from > device 3 and got device 2 data rather than the reverse. > > aiotest-R6-100k > read from 4273718, got data from 4273846, the next chunk > Recovery was 33 minutes into rebuilding device 0 while device 3 was missing. > Chunks 33388 and 33389 on stripe 16694. > pd_idx = 1 pd_idx=2 data on 3 and 0 > So once again, both data disks are non-sync, and one of them is being > recovered. > Unfortunately cannot tell how far through the recovery was. > > (if test.sh could collect the "mdadm -D" when the aio stops, that might help > a little). > > So it seems that: when lots of to-blocks-missing stripes are being > recovered, reading from a double-data-missing stripe can sometimes return > the wrong block. > > Can someone point me at the exact raid5 source code that this kernel is > using? > I think you said elsewhere that the bug happen on mainline too, but it's > still good to look at the exact code being tested. Hi Neil, I have attached the raid5.c and raid5.h files(in md.6.5.rar file) Also, I tried to check the following patch in this raid5.c code. since most of it is re-factored, I am not able to follow up the code to verify this patch is available or not. Please clarify me on this patch http://board.issociate.de/thread/508107/rhel5-raid6-corruption.html ---- diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b8a2c5d..f8cd6ef 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2436,10 +2436,16 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf, BUG(); case 1: compute_block_1(sh, r6s->failed_num[0], 0); + set_bit(R5_LOCKED, + &sh->dev[r6s->failed_num[0]].flags); break; case 2: compute_block_2(sh, r6s->failed_num[0], r6s->failed_num[1]); + set_bit(R5_LOCKED, + &sh->dev[r6s->failed_num[0]].flags); + set_bit(R5_LOCKED, + &sh->dev[r6s->failed_num[1]].flags); break; default: /* This request should have been failed? */ BUG(); diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index f8cd6ef..83f83cd 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2466,8 +2466,6 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf, if (s->locked == disks) if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state)) atomic_inc(&conf->pending_full_writes); - /* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */ - set_bit(STRIPE_INSYNC, &sh->state); if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) { atomic_dec(&conf->preread_active_stripes); ---- Thanks, Manibalan. Created attachment 922482 [details] Attachment for comment 13 contains raid5.c and raid5.h code for linux-2.6.32-431.el6 Hi Narayan. It's possible that instrumenting the kernel as you suggest might help, but it would be hard to know within the kernel whether the block being read has been written in the current run and so should have known data, or if it still has whatever it was initialised with. I guess you could write a similar signature to every block at the start so that every data block computed should have some signature. However determining when to test would be tricky. We know we get the wrong data - without a theory as to why I cannot suggest anywhere where watching... I hope a different raid6 algorithm wouldn't make a difference... It might make a difference if you have xor-capable DMA hardware so the xor and raid6 calculation wasn't being done by the CPU, but I don't think you do. I realise the code is just standard RHEL, but I don't know much about RHEL and don't know where to look to find that code. I figured someone here would be able to tell me. The first patch you quote was wrong. The second patch you quote, the correct one, is in the raid5.c you gave me. That raid5.c code is *very* different from upstream 2.6.32. When reading from a device that is being recovered, that code will read from the device if the recovery has already passed - it won't try to reconstruct. So my analysis in comment #7 isn't quite right. When reading the block that caused the mis-match it would have read from all the other devices, performed a single-data-block recovery and returned the result. This recovery would have been with a simple xor. In comment 5 where is says: hexdump of MD0: presumably that was run after the mis-compare was reported and everything was stopped. This suggests there is bad data on the device, it wasn't made bad while reading. I guess it must have been corrupted when writing.... Just FYI I have been running you test script over night and just got a data corruption reported. I haven't examined closely yet, but I will. If the source code for "aiodit" publicly available? I had a look around but couldn't find anything. It might be helpful. I've been staring at my test results and getting nowhere. So I thought I'd write down what I do know so at least I have a record for next time I look. Failure was reported at 08:57:55 while the "second" disk was rebuilding. lba=6386506 chunk=6236 (I'm using 512K chunks) stripe=3118 So layout of this stripe is D1 P Q D0 D0 was rebuilding and had the error. D1 was the previous disk to rebuild (while only P and Q were working) D1 has the correct signature and content. The errors in D0 were that the lsb of ever 32bit word were all flipped. In the P block, the lsb of each word as 1. Other bits in P were correct. The timestamp in D1 of when it was written was 06:13:55 The timestamp in D0 (which might be corrupt) was 04:41:36 At 6:13 it was nearly finished rebuilding the P drive while Q drive was missing. At 4:41 it was rebuilding D1 drive while D0 was missing. I suspect the corruption happened before or when D1 was written. As the D0 drive was missing we are depending on P to restore the D0 block. When the D0 and D1 devices were removed, P and Q were able to restore D1 correctly, but not D0. I cannot figure out the significance of all the '1' bits in the P block. Maybe it is an artefact of the Q calculation when something is wrong, but as Q is calculated a byte at a time, that doesn't seem likely. I guess I'll just wait until it happens again and look for more patterns then. Neal, aiodit is a tool we wrote. We can share the source with you. If it helps, we had done an experiment in our systems some time back where the same script we had run with just one drive failure i.e. in a RAID 6 we will remove one disk, and add it back later and let it rebuild. That test had passed without any problems. So, even though the miscompare was found when the second drive was rebuilding, the corruption, I feel is happening much before at the first drive rebuild itself. (In reply to NeilBrown from comment #18) > Just FYI I have been running you test script over night and just got a data > corruption reported. I haven't examined closely yet, but I will. > > If the source code for "aiodit" publicly available? I had a look around but > couldn't find anything. It might be helpful. Hi Neil, The aiodit.c file has been attached for your reference. Thanks, Manibalan. Created attachment 922826 [details] Attachment for comment 21 aiodit.c Neil, Here is one analysis. Here if you see the expected and actual LBA were in same stripe as well. But, the disk corresponding to the lba whose data we expect is the one that is rebuilding. The lba that we got the data from is an active disk. We checked the history of the drive removal in that test and we did not see disk 1 and disk 2 were never removed in the same pass in any of the cycles. Is it possible that we had two outstanding IOs to the same stripe that was getting rebuilt and it resulted in corruption. Also, what would be Q if I have to get the same data as D1 from D2 (not sure if D1 is also corrupted as the system is not in the same state) -- suspect Q should be 0s then... We can try dumping the entire IO pattern for the pass when we hit the corruption to see whether the two IOs were written simultaneously...I am sure it was written in the same pass as the pass numbers are matching... Miscompare: @9712523 IO[lba 9712523 + i 0 nsectors 93] TIME[DATA 1406465005:979025 LASTWRITE 1406483631:662432 CURRENT 1406483715:389030] EXPCTED [SIG:ffffffff21544944 LBA:9712523 DISK:md0 PASS:19 STS:1406465005:979025] != GOT [SIG:ffffffff21544944 LBA:9712395 DISK:md0 PASS:19 STS:1406465005:979025] Expected: LBA:9712523 chunk: 75879 stripe: 37939 Got: LBA:9712395 chunk: 75878 stripe: 37939 pd_idx = 4 - 1 - (37939%4) = 3 qd_idx = (3 + 1) % 4 = 0 dd_idx = 1, 2 Number Major Minor RaidDevice State q 4 8 134 0 active sync /dev/sdi6 d1 5 8 150 1 active sync /dev/sdj6 d2 6 8 182 2 spare rebuilding /dev/sdl6 p 3 0 0 3 removed I think I've found the problem. I ran a test last night, but I had the fix wrong and it still failed :-( So I'm now starting another run. If you could test as well, that would be great. The problem occurs when the array is doubly degraded and a stripe_head is being recovered and written to at the same time. If this stripe_head has 2 data blocks missing (i.e. both P and Q are intact), and one of the data blocks is being written while the device that stores the other data block is being recovered, then that recovered block can sometimes not be written. i.e. the recovery misses some blocks. The problem is caused by: if (!test_bit(R5_Insync, &dev->flags) || ((i == sh->pd_idx || i == sh->qd_idx) && s.failed == 0)) set_bit(STRIPE_INSYNC, &sh->state); in handle_stripe(). This says "if we are writing to a device that was not Insync, then the stripe must now be fully uptodate. This is true for RAID5 and for singly-degraded RAID6, but not for doubly degraded RAID6. Just before that test, insert if (s.failed > 1) continue; This should make sure the full recovery happens properly. If you can test that fix I would really appreciate it. Thanks. Great find! Thanks for being at it. Yes, we have started running it in a few systems here. We will get back to you with the results. (In reply to NeilBrown from comment #16) > I realise the code is just standard RHEL, but I don't know much about RHEL > and don't know where to look to find that code. I figured someone here > would be able to tell me. Neil, Sorry still need to catch up here. Should you need it in the future the RHEL6 source RPMs can be found here: ftp://ftp.redhat.com/redhat/linux/enterprise/6Server/en/os/SRPMS/ RHEL7 should be here: https://git.centos.org/project/rpms I am not aware of there being git access to RHEL6 unfortunately. Jes (In reply to Narayan Balakrishnan from comment #25) > Great find! Thanks for being at it. > Yes, we have started running it in a few systems here. We will get back to > you with the results. Any success to report? I'd like to publish the patch soon, and I'd like it to include a "Tested-by:" from you. (My own testing has been quite positive). (In reply to NeilBrown from comment #27) > (In reply to Narayan Balakrishnan from comment #25) > > Great find! Thanks for being at it. > > Yes, we have started running it in a few systems here. We will get back to > > you with the results. > > Any success to report? I'd like to publish the patch soon, and I'd like it > to include a "Tested-by:" from you. > (My own testing has been quite positive). Hi Neil, We have done with the testing.we performed the testing on 10+ setups for more that 5 days. all the setups are holding good. This fix is working fine. Once again thanks for the fix.. Great finding.. Regards, Manibalan. Thanks for the confirmation, and for the original report. The fix will be going upstream shortly. Patch(es) available on kernel-2.6.32-498.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-1392.html |