1468590 – vg is in limbo "Recovery failed" state after raid failure until 'vgreduce --removemissing' is run

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1468590 - vg is in limbo "Recovery failed" state after raid failure until 'vgreduce --removemissing' is run

Summary: vg is in limbo "Recovery failed" state after raid failure until 'vgreduce --r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1560739
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-07 13:14 UTC by Roman Bednář
Modified:	2021-09-03 12:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:	lvm2-2.02.186-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-31 20:04:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
test results (50.46 KB, text/plain) 2019-10-09 15:26 UTC, Roman Bednář	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:1129	0	None	None	None	2020-03-31 20:05:27 UTC

Description Roman Bednář 2017-07-07 13:14:37 UTC

Volume group recovery fails after single leg non-synced raid10 failure when using lvmlockd and lvmetad (lvmetad auto-disabled after repair).


./black_bird -L virt-388 -w EXT -F -e kill_primary_non_synced_raid10_3legs

Enabling raid allocate fault policies on: virt-388
================================================================================
Iteration 0.1 started at Fri Jul  7 13:26:31 CEST 2017
================================================================================
Scenario kill_primary_non_synced_raid10_3legs: Kill primary leg of NON synced 3 leg raid10 volume(s)
********* RAID hash info for this scenario *********
* names:              non_synced_primary_raid10_3legs_1
* sync:               0
* type:               raid10
* -m |-i value:       3
* leg devices:        /dev/sde1 /dev/sdj1 /dev/sdc1 /dev/sdb1 /dev/sdg1 /dev/sdi1
* spanned legs:       0
* manual repair:      0
* no MDA devices:
* failpv(s):          /dev/sde1
* additional snap:    /dev/sdj1
* failnode(s):        virt-388
* lvmetad:            0
* raid fault policy:  allocate
******************************************************
 
Creating raids(s) on virt-388...
virt-388: lvcreate -aye --type raid10 -i 3 -n non_synced_primary_raid10_3legs_1 -L 10G black_bird /dev/sde1:0-3600 /dev/sdj1:0-3600 /dev/sdc1:0-3600 /dev/sdb1:0-3600 /dev/sdg1:0-360
0 /dev/sdi1:0-3600  
 
Current mirror/raid device structure(s):
  LV                                           Attr       LSize   Cpy%Sync Devices
   [lvmlock]                                    -wi-ao---- 256.00m          /dev/sdd1(0)
   non_synced_primary_raid10_3legs_1            rwi-a-r--- <10.01g 0.00     non_synced_primary_raid10_3legs_1_rimage_0(0),non_synced_primary_raid10_3legs_1_rimage_1(0),non_synced_pr
imary_raid10_3legs_1_rimage_2(0),non_synced_primary_raid10_3legs_1_rimage_3(0),non_synced_primary_raid10_3legs_1_rimage_4(0),non_synced_primary_raid10_3legs_1_rimage_5(0)
   [non_synced_primary_raid10_3legs_1_rimage_0] Iwi-aor---  <3.34g          /dev/sde1(1)
   [non_synced_primary_raid10_3legs_1_rimage_1] Iwi-aor---  <3.34g          /dev/sdj1(1)
   [non_synced_primary_raid10_3legs_1_rimage_2] Iwi-aor---  <3.34g          /dev/sdc1(1)
   [non_synced_primary_raid10_3legs_1_rimage_3] Iwi-aor---  <3.34g          /dev/sdb1(1)
   [non_synced_primary_raid10_3legs_1_rimage_4] Iwi-aor---  <3.34g          /dev/sdg1(1)
   [non_synced_primary_raid10_3legs_1_rimage_5] Iwi-aor---  <3.34g          /dev/sdi1(1)
   [non_synced_primary_raid10_3legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sde1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdj1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdc1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdb1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sdg1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_5]  ewi-aor---   4.00m          /dev/sdi1(0)
   [lvmlock]                                    -wi-ao---- 256.00m          /dev/sda1(0)
   root                                         -wi-ao----  <6.20g          /dev/vda2(205)
   swap                                         -wi-ao---- 820.00m          /dev/vda2(0)
 
 
* NOTE: not enough available devices for allocation fault polices to fully work *
  
Creating ext on top of mirror(s) on virt-388...

mke2fs 1.42.9 (28-Dec-2013)
Mounting mirrored ext filesystems on virt-388...
 
PV=/dev/sde1
       non_synced_primary_raid10_3legs_1_rimage_0: 1.0
       non_synced_primary_raid10_3legs_1_rmeta_0: 1.0
 
Creating a snapshot volume of each of the raids
Writing verification files (checkit) to mirror(s) on...
       ---- virt-388 ----
Verifying files (checkit) on mirror(s) on...
       ---- virt-388 ----
 
Name             GrpID RgID ObjType ArID ArStart ArSize  RMrg/s WMrg/s R/s  W/s    RSz/s WSz/s   AvgRqSz QSize Util% AWait RdAWait WrAWait
virt-388_load        0    0 group      0 133.00m 980.00k   0.00   0.00 0.00 245.00     0 980.00k   4.00k  1.01  0.50  4.13    0.00    4.13
Name             GrpID RgID ObjType RgStart RgSize  #Areas ArSize  ProgID
virt-388_load        0    0 group   133.00m 980.00k      1 980.00k dmstats
 
Current sync percent just before failure
       ( 18.50% )
 
Disabling device sde on virt-388rescan device...
  /dev/sde1: read failed after 0 of 1024 at 42944036864: Input/output error
  /dev/sde1: read failed after 0 of 1024 at 42944143360: Input/output error
  /dev/sde1: read failed after 0 of 1024 at 0: Input/output error
  /dev/sde1: read failed after 0 of 1024 at 4096: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error

Attempting I/O to cause mirror down conversion(s) on virt-388
dd if=/dev/zero of=/mnt/non_synced_primary_raid10_3legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.205385 s, 204 MB/s
 
Verifying current sanity of lvm after the failure
 
Current mirror/raid device structure(s):
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  LV                                           Attr       LSize   Cpy%Sync Devices
   bb_snap1                                     swi-a-s--- 252.00m          /dev/sdj1(855)
   [lvmlock]                                    -wi-ao---- 256.00m          /dev/sdd1(0)
   non_synced_primary_raid10_3legs_1            owi-aor--- <10.01g 100.00   non_synced_primary_raid10_3legs_1_rimage_0(0),non_synced_primary_raid10_3legs_1_rimage_1(0),non_synced_pr
imary_raid10_3legs_1_rimage_2(0),non_synced_primary_raid10_3legs_1_rimage_3(0),non_synced_primary_raid10_3legs_1_rimage_4(0),non_synced_primary_raid10_3legs_1_rimage_5(0)
   [non_synced_primary_raid10_3legs_1_rimage_0] iwi-aor---  <3.34g          /dev/sdd1(65)
   [non_synced_primary_raid10_3legs_1_rimage_1] iwi-aor---  <3.34g          /dev/sdj1(1)
   [non_synced_primary_raid10_3legs_1_rimage_2] iwi-aor---  <3.34g          /dev/sdc1(1)
   [non_synced_primary_raid10_3legs_1_rimage_3] iwi-aor---  <3.34g          /dev/sdb1(1)
   [non_synced_primary_raid10_3legs_1_rimage_4] iwi-aor---  <3.34g          /dev/sdg1(1)
   [non_synced_primary_raid10_3legs_1_rimage_5] iwi-aor---  <3.34g          /dev/sdi1(1)
   [non_synced_primary_raid10_3legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdd1(64)
   [non_synced_primary_raid10_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdj1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdc1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdb1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sdg1(0)
   [non_synced_primary_raid10_3legs_1_rmeta_5]  ewi-aor---   4.00m          /dev/sdi1(0)
   [lvmlock]                                    -wi-ao---- 256.00m          /dev/sda1(0)
   root                                         -wi-ao----  <6.20g          /dev/vda2(205)
   swap                                         -wi-ao---- 820.00m          /dev/vda2(0)
 
 
Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
Verifying IMAGE device /dev/sdj1 *IS* in the volume(s)
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
Verifying IMAGE device /dev/sdc1 *IS* in the volume(s)
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
Verifying IMAGE device /dev/sdb1 *IS* in the volume(s)
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
Verifying IMAGE device /dev/sdg1 *IS* in the volume(s)
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
Verifying IMAGE device /dev/sdi1 *IS* in the volume(s)
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
Verify the rimage/rmeta dm devices remain after the failures
Checking EXISTENCE and STATE of non_synced_primary_raid10_3legs_1_rimage_0 on: virt-388
Checking EXISTENCE and STATE of non_synced_primary_raid10_3legs_1_rmeta_0 on: virt-388
 
Verify the raid image order is what's expected based on raid fault policy
EXPECTED LEG ORDER: unknown /dev/sdj1 /dev/sdc1 /dev/sdb1 /dev/sdg1 /dev/sdi1
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid enqEiN-09mX-Zx1d-nBrB-sCcj-IWLw-yK3IxP.
ACTUAL LEG ORDER: /dev/sdd1 /dev/sdj1 /dev/sdc1 /dev/sdb1 /dev/sdg1 /dev/sdi1
unknown ne /dev/sdd1
/dev/sdj1 ne /dev/sdj1
/dev/sdc1 ne /dev/sdc1
/dev/sdb1 ne /dev/sdb1
/dev/sdg1 ne /dev/sdg1
/dev/sdi1 ne /dev/sdi1
Verifying files (checkit) on mirror(s) on...
       ---- virt-388 ----
 
Enabling device sde on virt-388
Running vgs to make LVM update metadata version if possible (will restore a-m PVs)
  WARNING: Not using lvmetad because a repair command was run.
  WARNING: Missing device /dev/sde1 reappeared, updating metadata for VG black_bird to version 9.
  Recovery of volume group "black_bird" failed.
  Cannot process volume group black_bird
Simple vgs cmd failed after brining sde back online
Possible regression of bug 1412843/1434054

==================================================================
Check device and services are ok:

[root@virt-388 ~]# cat /sys/block/sde/device/state
running

[root@virt-388 ~]# systemctl is-active lvm2-lvmetad lvm2-lvmlockd sanlock
active
active
active

[root@virt-388 ~]# vgs
  WARNING: Not using lvmetad because a repair command was run.
  WARNING: Missing device /dev/sde1 reappeared, updating metadata for VG black_bird to version 9.
  Recovery of volume group "black_bird" failed.
  Cannot process volume group black_bird
  VG            #PV #LV #SN Attr   VSize  VFree 
  global          1   0   0 wz--ns 39.98g 39.73g
  rhel_virt-388   1   2   0 wz--n- <7.00g     0 



===================================================================
3.10.0-689.el7.x86_64

lvm2-2.02.171-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
lvm2-libs-2.02.171-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
lvm2-cluster-2.02.171-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
device-mapper-1.02.140-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
device-mapper-libs-1.02.140-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
device-mapper-event-1.02.140-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
device-mapper-event-libs-1.02.140-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 17:15:46 CEST 2017
cmirror-2.02.171-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017
sanlock-3.5.0-1.el7    BUILT: Wed Apr 26 16:37:30 CEST 2017
sanlock-lib-3.5.0-1.el7    BUILT: Wed Apr 26 16:37:30 CEST 2017
lvm2-lockd-2.02.171-8.el7    BUILT: Wed Jun 28 20:28:58 CEST 2017

Comment 2 David Teigland 2017-07-07 15:58:06 UTC

For a shared VG we have to disable repairs (writing the VG) that vg_read() usually does.  I'm in the middle of a big overhaul of vg_read() at the moment which is addressing this problem.

Comment 3 Corey Marthaler 2017-07-07 17:13:59 UTC

I was looking into this as well yesterday. This seems to be a state that the VG goes through after a failed device reappears, but before/until a 'vgreduce --removemissing' is run. This affects all raid types, not just raid10.

# Basic raid1 (non primary device) failure:


host-113: pvcreate /dev/sdb1 /dev/sda1 /dev/sdf1 /dev/sdd1 /dev/sdh1 /dev/sdc1 /dev/sde1
host-113: vgcreate  --shared black_bird /dev/sdb1 /dev/sda1 /dev/sdf1 /dev/sdd1 /dev/sdh1 /dev/sdc1 /dev/sde1
host-113: vgchange --lock-start black_bird
host-114: vgchange --lock-start black_bird
host-115: vgchange --lock-start black_bird

Enabling raid allocate fault policies on: host-115
================================================================================
Iteration 0.1 started at Thu Jul  6 14:27:11 CDT 2017
================================================================================
Scenario kill_random_synced_raid1_3legs: Kill random leg of synced 3 leg raid1 volume(s)
********* RAID hash info for this scenario *********
* names:              synced_random_raid1_3legs_1
* sync:               1
* type:               raid1
* -m |-i value:       3
* leg devices:        /dev/sdf1 /dev/sdd1 /dev/sde1 /dev/sdh1
* spanned legs:       0
* manual repair:      0
* no MDA devices:     
* failpv(s):          /dev/sde1
* additional snap:    /dev/sdf1
* failnode(s):        host-115
* lvmetad:            0
* raid fault policy:  allocate
******************************************************

Creating raids(s) on host-115...
host-115: lvcreate -aye --type raid1 -m 3 -n synced_random_raid1_3legs_1 -L 500M black_bird /dev/sdf1:0-2400 /dev/sdd1:0-2400 /dev/sde1:0-2400 /dev/sdh1:0-2400

Current mirror/raid device structure(s):
  LV                                     Attr       LSize   Cpy%Sync Devices
   [lvmlock]                              -wi-ao---- 256.00m          /dev/sdb1(0)
   synced_random_raid1_3legs_1            rwi-a-r--- 500.00m 6.26     synced_random_raid1_3legs_1_rimage_0(0),synced_random_raid1_3legs_1_rimage_1(0),synced_random_raid1_3legs_1_rimage_2(0),synced_random_raid1_3legs_1_rimage_3(0)
   [synced_random_raid1_3legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sdf1(1)
   [synced_random_raid1_3legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/sdd1(1)
   [synced_random_raid1_3legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/sde1(1)
   [synced_random_raid1_3legs_1_rimage_3] Iwi-aor--- 500.00m          /dev/sdh1(1)
   [synced_random_raid1_3legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdf1(0)
   [synced_random_raid1_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdd1(0)
   [synced_random_raid1_3legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sde1(0)
   [synced_random_raid1_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdh1(0)
   [lvmlock]                              -wi-ao---- 256.00m          /dev/sdg1(0)

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Creating gfs2 on top of mirror(s) on host-115...
mkfs.gfs2 -J 32M -j 1 -p lock_nolock /dev/black_bird/synced_random_raid1_3legs_1 -O
Mounting mirrored gfs2 filesystems on host-115...

PV=/dev/sde1
        synced_random_raid1_3legs_1_rimage_2: 1.0
        synced_random_raid1_3legs_1_rmeta_2: 1.0

Creating a snapshot volume of each of the raids
Writing verification files (checkit) to mirror(s) on...
        ---- host-115 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-115 ----

Disabling device sde on host-115rescan device...

Attempting I/O to cause mirror down conversion(s) on host-115
dd if=/dev/zero of=/mnt/synced_random_raid1_3legs_1/ddfile count=10 bs=4M

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid icmWuc-ACno-MeJy-HVOs-XU12-3Wat-0JZ0Pj.
  LV                                     Attr       LSize   Cpy%Sync Devices
  bb_snap1                               swi-a-s--- 252.00m          /dev/sdf1(126)
  [lvmlock]                              -wi-ao---- 256.00m          /dev/sdb1(0)
  synced_random_raid1_3legs_1            owi-aor--- 500.00m 100.00   synced_random_raid1_3legs_1_rimage_0(0),synced_random_raid1_3legs_1_rimage_1(0),synced_random_raid1_3legs_1_rimage_2(0),synced_random_raid1_3legs_1_rimage_3(0)
  [synced_random_raid1_3legs_1_rimage_0] iwi-aor--- 500.00m          /dev/sdf1(1)
  [synced_random_raid1_3legs_1_rimage_1] iwi-aor--- 500.00m          /dev/sdd1(1)
  [synced_random_raid1_3legs_1_rimage_2] iwi-aor--- 500.00m          /dev/sdb1(65)
  [synced_random_raid1_3legs_1_rimage_3] iwi-aor--- 500.00m          /dev/sdh1(1)
  [synced_random_raid1_3legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdf1(0)
  [synced_random_raid1_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdd1(0)
  [synced_random_raid1_3legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdb1(64)
  [synced_random_raid1_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdh1(0)
  [lvmlock]                              -wi-ao---- 256.00m          /dev/sdg1(0)

Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
Verifying IMAGE device /dev/sdf1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdd1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdh1 *IS* in the volume(s)
Verify the rimage/rmeta dm devices remain after the failures
Checking EXISTENCE and STATE of synced_random_raid1_3legs_1_rimage_2 on: host-115 
Checking EXISTENCE and STATE of synced_random_raid1_3legs_1_rmeta_2 on: host-115 

Verify the raid image order is what's expected based on raid fault policy
EXPECTED LEG ORDER: /dev/sdf1 /dev/sdd1 unknown /dev/sdh1
ACTUAL LEG ORDER: /dev/sdf1 /dev/sdd1 /dev/sdb1 /dev/sdh1

Verifying files (checkit) on mirror(s) on...
        ---- host-115 ----

Enabling device sde on host-115 Running vgs to make LVM update metadata version if possible (will restore a-m PVs)

  WARNING: Not using lvmetad because a repair command was run.
  WARNING: Missing device /dev/sde1 reappeared, updating metadata for VG black_bird to version 9.
  Recovery of volume group "black_bird" failed.
  Cannot process volume group black_bird

Simple vgs cmd failed after bringing sde back online

# If you comment out this vgs failure in the test, it should continue on and eventually pass...

# Here's where lvm/lvmlockd is in a limbo type state until a "vgreduce --removemissing" is run 

# VG is "gone"
[root@host-115 ~]# lvs -a -o +devices
  WARNING: Not using lvmetad because a repair command was run.
  WARNING: Missing device /dev/sde1 reappeared, updating metadata for VG black_bird to version 9.
  Recovery of volume group "black_bird" failed.
  Cannot process volume group black_bird
  LV        VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices       
  [lvmlock] global        -wi-ao---- 256.00m                                                     /dev/sdg1(0)  

[root@host-115 ~]# pvscan
  WARNING: Inconsistent metadata found for VG black_bird
  WARNING: Missing device /dev/sde1 reappeared, updating metadata for VG black_bird to version 9.
  Recovery of volume group "black_bird" failed.
  Cannot process volume group black_bird
  PV /dev/sdg1   VG global          lvm2 [<21.00 GiB / <20.75 GiB free]
  WARNING: Missing device /dev/sde1 reappeared, updating metadata for VG black_bird to version 9.
  Recovery of volume group "black_bird" failed.
  Cannot process volume group black_bird

Jul  6 14:28:13 host-115 qarshd[6468]: Running cmdline: echo offline > /sys/block/sde/device/state
Jul  6 14:28:13 host-115 qarshd[6472]: Running cmdline: pvscan --cache /dev/sde1
Jul  6 14:28:13 host-115 kernel: sd 7:0:0:1: rejecting I/O to offline device
Jul  6 14:28:13 host-115 kernel: sd 7:0:0:1: rejecting I/O to offline device
Jul  6 14:28:13 host-115 kernel: md: super_written gets error=-5, uptodate=0
Jul  6 14:28:13 host-115 kernel: md/raid1:mdX: Disk failure on dm-9, disabling device.#012md/raid1:mdX: Operation continuing on 3 devices.
Jul  6 14:28:13 host-115 lvm[6137]: WARNING: Device #2 of raid1 array, black_bird-synced_random_raid1_3legs_1-real, has failed.
Jul  6 14:28:13 host-115 lvm[6137]: WARNING: Disabling lvmetad cache for repair command.
Jul  6 14:28:13 host-115 lvm[6137]: WARNING: Not using lvmetad because of repair.
[...]
Jul  6 14:28:13 host-115 lvm[6137]: Couldn't find device with uuid icmWuc-ACno-MeJy-HVOs-XU12-3Wat-0JZ0Pj.
Jul  6 14:28:13 host-115 lvm[6137]: WARNING: Couldn't find all devices for LV black_bird/synced_random_raid1_3legs_1_rimage_2 while checking used and assumed devices.
Jul  6 14:28:13 host-115 lvm[6137]: WARNING: Couldn't find all devices for LV black_bird/synced_random_raid1_3legs_1_rmeta_2 while checking used and assumed devices.
Jul  6 14:28:13 host-115 kernel: device-mapper: raid: Device 2 specified for rebuild; clearing superblock
Jul  6 14:28:13 host-115 kernel: md/raid1:mdX: active with 3 out of 4 mirrors
Jul  6 14:28:13 host-115 kernel: md: recovery of RAID array mdX


### Repair does eventually finish successfully, just like any normal raid failure.

Jul  6 14:28:14 host-115 kernel: md/raid1:mdX: active with 3 out of 4 mirrors
Jul  6 14:28:14 host-115 kernel: md: mdX: recovery interrupted.
Jul  6 14:28:14 host-115 lvm[6137]: Faulty devices in black_bird/synced_random_raid1_3legs_1 successfully replaced.
Jul  6 14:28:14 host-115 kernel: md: recovery of RAID array mdX
Jul  6 14:28:14 host-115 lvm[6137]: raid1 array, black_bird-synced_random_raid1_3legs_1-real, is not in-sync.
Jul  6 14:28:15 host-115 qarshd[6634]: Running cmdline: dd if=/dev/zero of=/mnt/synced_random_raid1_3legs_1/ddfile count=10 bs=4M
Jul  6 14:28:16 host-115 qarshd[6639]: Running cmdline: sync
Jul  6 14:28:20 host-115 kernel: md: mdX: recovery done.
Jul  6 14:28:20 host-115 lvm[6137]: raid1 array, black_bird-synced_random_raid1_3legs_1-real, is now in-sync.


Jul  6 14:31:49 host-115 qarshd[6855]: Running cmdline: echo running > /sys/block/sde/device/state
Jul  6 14:31:49 host-115 qarshd[6859]: Running cmdline: vgs
Jul  6 14:32:04 host-115 crmd[2082]:  notice: High CPU load detected: 1.420000
Jul  6 14:32:34 host-115 crmd[2082]:  notice: High CPU load detected: 1.380000
[...]
Jul  6 14:41:34 host-115 crmd[2082]:  notice: High CPU load detected: 1.180000
Jul  6 14:42:04 host-115 crmd[2082]:  notice: High CPU load detected: 1.110000
Jul  6 14:42:13 host-115 lvmetad[472]: update_metadata ignoring outdated metadata on PV icmWuc-ACno-MeJy-HVOs-XU12-3Wat-0JZ0Pj seqno 7 for 42Gol8-sLb1-xFXL-dZrr-fdbj-8Ypn-LpqjNQ black_bird seqno 9
Jul  6 14:42:13 host-115 lvmetad[472]: PV icmWuc-ACno-MeJy-HVOs-XU12-3Wat-0JZ0Pj has outdated metadata for VG 42Gol8-sLb1-xFXL-dZrr-fdbj-8Ypn-LpqjNQ
Jul  6 14:42:13 host-115 lvmetad[472]: Cannot use VG metadata for black_bird 42Gol8-sLb1-xFXL-dZrr-fdbj-8Ypn-LpqjNQ from PV icmWuc-ACno-MeJy-HVOs-XU12-3Wat-0JZ0Pj on 2113


[root@host-115 ~]# vgs
  WARNING: Missing device /dev/sde1 reappeared, updating metadata for VG black_bird to version 9.
  Recovery of volume group "black_bird" failed.
  Cannot process volume group black_bird
  VG            #PV #LV #SN Attr   VSize   VFree  
  global          1   0   0 wz--ns <21.00g <20.75g
  rhel_host-115   1   2   0 wz--n-  <7.00g      0 

[root@host-115 ~]# vgreduce --removemissing --force black_bird
  Wrote out consistent volume group black_bird.

### Seems to be back to a relative "normal" state now. /dev/sde needs to be added back to VG

[root@host-115 ~]# pvscan
  PV /dev/sdb1   VG black_bird      lvm2 [<21.00 GiB / 20.25 GiB free]
  PV /dev/sda1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV /dev/sdd1   VG black_bird      lvm2 [<21.00 GiB / 20.50 GiB free]
  PV /dev/sdh1   VG black_bird      lvm2 [<21.00 GiB / 20.50 GiB free]
  PV /dev/sdc1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV /dev/sdf1   VG black_bird      lvm2 [<21.00 GiB / <20.26 GiB free]
  PV /dev/vda2   VG rhel_host-115   lvm2 [<7.00 GiB / 0    free]
  PV /dev/sdg1   VG global          lvm2 [<21.00 GiB / <20.75 GiB free]
  WARNING: PV /dev/sde1 is marked in use but no VG was found using it.
  WARNING: PV /dev/sde1 might need repairing.
  PV /dev/sde1                      lvm2 [<21.00 GiB]
  Total: 9 [<174.97 GiB] / in use: 8 [<153.97 GiB] / in no VG: 1 [<21.00 GiB]
[root@host-115 ~]# lvs
  LV                          VG         Attr       LSize   Pool Origin                      Data% Cpy%Sync
  bb_snap1                    black_bird swi-a-s--- 252.00m      synced_random_raid1_3legs_1 29.50
  synced_random_raid1_3legs_1 black_bird owi-aor--- 500.00m                                        100.00

Comment 4 Corey Marthaler 2017-07-07 23:09:31 UTC

Another affect of this, or possible an entirely different bug, is that once the test case passes, and you remove the raid LV, you're left with an "unknown" PV that you're unable to get rid of. All the devices sd[abcdefg] are back in the VG.

[root@host-115 ~]# pvscan
  PV /dev/vda2   VG rhel_host-115   lvm2 [<7.00 GiB / 0    free]
  PV /dev/sdg1   VG global          lvm2 [<21.00 GiB / <20.75 GiB free]
  WARNING: Device for PV po9UOb-IiEU-evit-nEgQ-dpmf-ZECl-zKxQFI not found or rejected by a filter.
  Reading VG black_bird without a lock.
  PV /dev/sdb1   VG black_bird      lvm2 [<21.00 GiB / <20.75 GiB free]
  PV /dev/sda1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV /dev/sdh1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV /dev/sdc1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV /dev/sdd1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV /dev/sde1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV [unknown]   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  PV /dev/sdf1   VG black_bird      lvm2 [<21.00 GiB / <21.00 GiB free]
  Total: 10 [195.96 GiB] / in use: 10 [195.96 GiB] / in no VG: 0 [0   ]

Comment 5 Heinz Mauelshagen 2019-08-21 21:48:54 UTC

Assumed fixed by dependency 1560739

Comment 7 Roman Bednář 2019-09-18 06:51:21 UTC

Adding QA ack for 7.8. Covered by automated tests, see qa whiteboard.

Comment 9 Roman Bednář 2019-10-09 15:26:05 UTC

Created attachment 1623843 [details]
test results

Verified with latest RPMs. Attaching test result log.

lvm2-2.02.186-2.el7.x86_64
kernel-3.10.0-1100.el7.x86_64

Comment 11 errata-xmlrpc 2020-03-31 20:04:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1129

Note You need to log in before you can comment on or make changes to this bug.