Bug 1418869

Summary: non synced raid10 volumes going through transient failures are unable to drop their 'D' failed status
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac
Version: 6.9   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-06 10:51:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2017-02-02 23:17:19 UTC
Description of problem:
This is one of the many cases being attempted for bug 1265191.
Non synced raid10 volumes appear unable to lose their failed 'D' kernel status once refreshed.

================================================================================
Iteration 0.12 started at Thu Feb  2 16:40:53 CST 2017
================================================================================
Scenario kill_random_non_synced_raid10_3legs: Kill random leg of NON synced 3 leg raid10 volume(s)

********* RAID hash info for this scenario *********
* names:              non_synced_random_raid10_3legs_1
* sync:               0
* type:               raid10
* -m |-i value:       3
* leg devices:        /dev/sdg1 /dev/sdc1 /dev/sda1 /dev/sdb1 /dev/sde1 /dev/sdd1
* spanned legs:       0
* manual repair:      0
* failpv(s):          /dev/sda1
* failnode(s):        host-073
* lvmetad:            0
* raid fault policy:  warn
******************************************************

Creating raids(s) on host-073...
host-073: lvcreate --type raid10 -i 3 -n non_synced_random_raid10_3legs_1 -L 10G black_bird /dev/sdg1:0-3600 /dev/sdc1:0-3600 /dev/sda1:0-3600 /dev/sdb1:0-3600 /dev/sde1:0-3600 /dev/sdd1:0-3600

Current mirror/raid device structure(s):
  LV                                          Attr       LSize   Cpy%Sync Devices
   non_synced_random_raid10_3legs_1            rwi-a-r---  10.01g 0.00     non_synced_random_raid10_3legs_1_rimage_0(0),non_synced_random_raid10_3legs_1_rimage_1(0),non_synced_random_raid10_3legs_1_rimage_2(0),non_synced_random_raid10_3legs_1_rimage_3(0),non_synced_random_raid10_3legs_1_rimage_4(0),non_synced_random_raid10_3legs_1_rimage_5(0)
   [non_synced_random_raid10_3legs_1_rimage_0] Iwi-aor---   3.34g          /dev/sdg1(1)
   [non_synced_random_raid10_3legs_1_rimage_1] Iwi-aor---   3.34g          /dev/sdc1(1)
   [non_synced_random_raid10_3legs_1_rimage_2] Iwi-aor---   3.34g          /dev/sda1(1)
   [non_synced_random_raid10_3legs_1_rimage_3] Iwi-aor---   3.34g          /dev/sdb1(1)
   [non_synced_random_raid10_3legs_1_rimage_4] Iwi-aor---   3.34g          /dev/sde1(1)
   [non_synced_random_raid10_3legs_1_rimage_5] Iwi-aor---   3.34g          /dev/sdd1(1)
   [non_synced_random_raid10_3legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdg1(0)
   [non_synced_random_raid10_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdc1(0)
   [non_synced_random_raid10_3legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sda1(0)
   [non_synced_random_raid10_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdb1(0)
   [non_synced_random_raid10_3legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sde1(0)
   [non_synced_random_raid10_3legs_1_rmeta_5]  ewi-aor---   4.00m          /dev/sdd1(0)

Creating ext on top of mirror(s) on host-073...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-073...

PV=/dev/sda1
        non_synced_random_raid10_3legs_1_rimage_2: 2
        non_synced_random_raid10_3legs_1_rmeta_2: 2

Writing verification files (checkit) to mirror(s) on...
        ---- host-073 ----

<start name="host-073_non_synced_random_raid10_3legs_1"  pid="13602" time="Thu Feb  2 16:41:16 2017 -0600" type="cmd" />
Verifying files (checkit) on mirror(s) on...
        ---- host-073 ----

Current sync percent just before failure
        ( 9.02% )

Disabling device sda on host-073

Attempting I/O to cause mirror down conversion(s) on host-073
dd if=/dev/zero of=/mnt/non_synced_random_raid10_3legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.179561 s, 234 MB/s

HACK TO KILL XDOIO...
<fail name="host-073_non_synced_random_raid10_3legs_1"  pid="13602" time="Thu Feb  2 16:41:37 2017 -0600" type="cmd" duration="21" ec="143" />
ALL STOP!
Unmounting ext and removing mnt point on host-073...

Verifying proper "D"ead kernel status state for failed raid images(s)
black_bird-non_synced_random_raid10_3legs_1: 0 20987904 raid raid10 6 AADAAA 3801088/20987904 resync 0

Reactivaing the raids containing transiently failed raid images
lvchange -an black_bird/non_synced_random_raid10_3legs_1
  /dev/sda1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sda1: read failed after 0 of 1024 at 22545367040: Input/output error
  /dev/sda1: read failed after 0 of 1024 at 22545448960: Input/output error
  /dev/sda1: read failed after 0 of 1024 at 0: Input/output error
  /dev/sda1: read failed after 0 of 1024 at 4096: Input/output error
  Couldn't find device with uuid 3a4nD3-sTJC-G3XT-eU1b-2hFl-i3Ue-kX5296.
  Couldn't find device for segment belonging to black_bird/non_synced_random_raid10_3legs_1_rimage_2 while checking used and assumed devices.


lvchange -ay  black_bird/non_synced_random_raid10_3legs_1
  /dev/sda1: open failed: No such device or address
  Couldn't find device with uuid 3a4nD3-sTJC-G3XT-eU1b-2hFl-i3Ue-kX5296.

Verifying proper kernel table state of failed image(s)
Verifying proper "D"ead kernel status state for failed raid images(s)
black_bird-non_synced_random_raid10_3legs_1: 0 20987904 raid raid10 6 aaDaaa 0/20987904 resync 0

Enabling device sda on host-073 Running vgs to make LVM update metadata version if possible (will restore a-m PVs)

Refreshing raids now that transiently failed raid images should be back
lvchange --refresh black_bird/non_synced_random_raid10_3legs_1
HACK: additional refresh as outlined in bug 1265191#c22: lvchange --refresh black_bird/non_synced_random_raid10_3legs_1

Verifying current sanity of lvm after the failure
Verifying proper "a"ctive while resyncing kernel status state for raid image(s)
Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 25.59% )
   0/1 mirror(s) are fully synced: ( 39.42% )
   0/1 mirror(s) are fully synced: ( 54.77% )
   0/1 mirror(s) are fully synced: ( 71.77% )
   0/1 mirror(s) are fully synced: ( 88.51% )
   1/1 mirror(s) are fully synced: ( 100.00% )
Verifying proper "A"ctive kernel status state for raid image(s)
No dead "D" kernel state should be found after --refresh cmds


[root@host-073 ~]# dmsetup status
black_bird-non_synced_random_raid10_3legs_1_rmeta_5: 0 8192 linear 
black_bird-non_synced_random_raid10_3legs_1_rmeta_4: 0 8192 linear 
black_bird-non_synced_random_raid10_3legs_1_rmeta_3: 0 8192 linear 
black_bird-non_synced_random_raid10_3legs_1_rmeta_2: 0 8192 linear 
black_bird-non_synced_random_raid10_3legs_1_rmeta_1: 0 8192 linear 
black_bird-non_synced_random_raid10_3legs_1_rimage_5: 0 6995968 linear 
black_bird-non_synced_random_raid10_3legs_1_rmeta_0: 0 8192 linear 
black_bird-non_synced_random_raid10_3legs_1_rimage_4: 0 6995968 linear 
black_bird-non_synced_random_raid10_3legs_1_rimage_3: 0 6995968 linear 
black_bird-non_synced_random_raid10_3legs_1_rimage_2: 0 6995968 linear 
### THIS SHOULD BE ALL A's:
black_bird-non_synced_random_raid10_3legs_1: 0 20987904 raid raid10 6 AADAAA 20987904/20987904 idle 0
black_bird-non_synced_random_raid10_3legs_1_rimage_1: 0 6995968 linear 
black_bird-non_synced_random_raid10_3legs_1_rimage_0: 0 6995968 linear 

[root@host-073 ~]# dmsetup table
black_bird-non_synced_random_raid10_3legs_1_rmeta_5: 0 8192 linear 8:49 2048
black_bird-non_synced_random_raid10_3legs_1_rmeta_4: 0 8192 linear 8:65 2048
black_bird-non_synced_random_raid10_3legs_1_rmeta_3: 0 8192 linear 8:17 2048
vg_host073-lv_swap: 0 1671168 linear 252:2 14075904
black_bird-non_synced_random_raid10_3legs_1_rmeta_2: 0 8192 linear 8:1 2048
vg_host073-lv_root: 0 14073856 linear 252:2 2048
black_bird-non_synced_random_raid10_3legs_1_rmeta_1: 0 8192 linear 8:33 2048
black_bird-non_synced_random_raid10_3legs_1_rimage_5: 0 6995968 linear 8:49 10240
black_bird-non_synced_random_raid10_3legs_1_rmeta_0: 0 8192 linear 8:97 2048
black_bird-non_synced_random_raid10_3legs_1_rimage_4: 0 6995968 linear 8:65 10240
black_bird-non_synced_random_raid10_3legs_1_rimage_3: 0 6995968 linear 8:17 10240
black_bird-non_synced_random_raid10_3legs_1_rimage_2: 0 6995968 linear 8:1 10240
black_bird-non_synced_random_raid10_3legs_1: 0 20987904 raid raid10 3 128 region_size 1024 6 253:2 253:3 253:4 253:5 253:7 253:9 253:10 253:11 253:12 253:13 253:14 253:15
black_bird-non_synced_random_raid10_3legs_1_rimage_1: 0 6995968 linear 8:33 10240
black_bird-non_synced_random_raid10_3legs_1_rimage_0: 0 6995968 linear 8:97 10240

[root@host-073 ~]# lvs -a -o +devices
  LV                                          VG         Attr       LSize   Cpy%Sync Devices
  non_synced_random_raid10_3legs_1            black_bird rwi-a-r-r-  10.01g 100.00   non_synced_random_raid10_3legs_1_rimage_0(0),non_synced_random_raid10_3legs_1_rimage_1(0),non_synced_random_raid10_3legs_1_rimage_2(0),non_synced_random_raid10_3legs_1_rimage_3(0),non_synced_random_raid10_3legs_1_rimage_4(0),non_synced_random_raid10_3legs_1_rimage_5(0)
  [non_synced_random_raid10_3legs_1_rimage_0] black_bird iwi-aor---   3.34g          /dev/sdg1(1)
  [non_synced_random_raid10_3legs_1_rimage_1] black_bird iwi-aor---   3.34g          /dev/sdc1(1)
  [non_synced_random_raid10_3legs_1_rimage_2] black_bird iwi-aor-r-   3.34g          /dev/sda1(1)
  [non_synced_random_raid10_3legs_1_rimage_3] black_bird iwi-aor---   3.34g          /dev/sdb1(1)
  [non_synced_random_raid10_3legs_1_rimage_4] black_bird iwi-aor---   3.34g          /dev/sde1(1)
  [non_synced_random_raid10_3legs_1_rimage_5] black_bird iwi-aor---   3.34g          /dev/sdd1(1)
  [non_synced_random_raid10_3legs_1_rmeta_0]  black_bird ewi-aor---   4.00m          /dev/sdg1(0)
  [non_synced_random_raid10_3legs_1_rmeta_1]  black_bird ewi-aor---   4.00m          /dev/sdc1(0)
  [non_synced_random_raid10_3legs_1_rmeta_2]  black_bird ewi-aor-r-   4.00m          /dev/sda1(0)
  [non_synced_random_raid10_3legs_1_rmeta_3]  black_bird ewi-aor---   4.00m          /dev/sdb1(0)
  [non_synced_random_raid10_3legs_1_rmeta_4]  black_bird ewi-aor---   4.00m          /dev/sde1(0)
  [non_synced_random_raid10_3legs_1_rmeta_5]  black_bird ewi-aor---   4.00m          /dev/sdd1(0)




Version-Release number of selected component (if applicable):
2.6.32-688.el6.x86_64

lvm2-2.02.143-12.el6    BUILT: Wed Jan 11 09:35:04 CST 2017
lvm2-libs-2.02.143-12.el6    BUILT: Wed Jan 11 09:35:04 CST 2017
lvm2-cluster-2.02.143-12.el6    BUILT: Wed Jan 11 09:35:04 CST 2017
udev-147-2.73.el6_8.2    BUILT: Tue Aug 30 08:17:19 CDT 2016
device-mapper-1.02.117-12.el6    BUILT: Wed Jan 11 09:35:04 CST 2017
device-mapper-libs-1.02.117-12.el6    BUILT: Wed Jan 11 09:35:04 CST 2017
device-mapper-event-1.02.117-12.el6    BUILT: Wed Jan 11 09:35:04 CST 2017
device-mapper-event-libs-1.02.117-12.el6    BUILT: Wed Jan 11 09:35:04 CST 2017
device-mapper-persistent-data-0.6.2-0.1.rc7.el6    BUILT: Tue Mar 22 08:58:09 CDT 2016


How reproducible:
Everytime

Comment 2 Jan Kurik 2017-12-06 10:51:48 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/