1447097 – device mapper keeps missing_0_0 devices listed even after the LV/VG containing raid is removed

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1447097 - device mapper keeps missing_0_0 devices listed even after the LV/VG containing raid is removed

Summary: device mapper keeps missing_0_0 devices listed even after the LV/VG containin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Heinz Mauelshagen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1025322
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-01 17:43 UTC by Corey Marthaler
Modified:	2021-09-03 12:36 UTC (History)
CC List:	16 users (show)
Fixed In Version:	lvm2-2.02.177-2.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1025322
Environment:
Last Closed:	2018-04-10 15:20:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
test_log (79.46 KB, text/plain) 2018-02-12 11:37 UTC, Roman Bednář	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:0853	0	None	None	None	2018-04-10 15:21:32 UTC

Comment 2 Corey Marthaler 2017-05-03 15:27:08 UTC

================================================================================
Iteration 0.3 started at Tue May  2 14:20:49 CDT 2017
================================================================================
  WARNING: Not using lvmetad because a repair command was run.
Scenario kill_three_synced_raid10_3legs: Kill three legs (none of which share the same stripe leg) of synced 3 leg raid10 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_three_raid10_3legs_1
* sync:               1
* type:               raid10
* -m |-i value:       3
* leg devices:        /dev/sdh1 /dev/sdg1 /dev/sdc1 /dev/sdf1 /dev/sda1 /dev/sde1
* spanned legs:       0
* manual repair:      0
* no MDA devices:     
* failpv(s):          /dev/sdh1 /dev/sdc1 /dev/sda1
* failnode(s):        host-073
* lvmetad:            1
* raid fault policy:  allocate
******************************************************

Creating raids(s) on host-073...
host-073: lvcreate  --type raid10 -i 3 -n synced_three_raid10_3legs_1 -L 500M black_bird /dev/sdh1:0-2400 /dev/sdg1:0-2400 /dev/sdc1:0-2400 /dev/sdf1:0-2400 /dev/sda1:0-2400 /dev/sde1:0-2400
  WARNING: Not using lvmetad because a repair command was run.

Current mirror/raid device structure(s):
  WARNING: Not using lvmetad because a repair command was run.
  LV                                     Attr       LSize   Cpy%Sync Devices
   synced_three_raid10_3legs_1            rwi-aor--- 504.00m 80.16    synced_three_raid10_3legs_1_rimage_0(0),synced_three_raid10_3legs_1_rimage_1(0),synced_three_raid10_3legs_1_rimage_2(0),synced_three_raid10_3legs_1_rimage_3(0),synced_three_raid10_3legs_1_rimage_4(0),synced_three_raid10_3legs_1_rimage_5(0)
   [synced_three_raid10_3legs_1_rimage_0] Iwi-aor--- 168.00m          /dev/sdh1(1)
   [synced_three_raid10_3legs_1_rimage_1] Iwi-aor--- 168.00m          /dev/sdg1(1)
   [synced_three_raid10_3legs_1_rimage_2] Iwi-aor--- 168.00m          /dev/sdc1(1)
   [synced_three_raid10_3legs_1_rimage_3] Iwi-aor--- 168.00m          /dev/sdf1(1)
   [synced_three_raid10_3legs_1_rimage_4] Iwi-aor--- 168.00m          /dev/sda1(1)
   [synced_three_raid10_3legs_1_rimage_5] Iwi-aor--- 168.00m          /dev/sde1(1)
   [synced_three_raid10_3legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdh1(0)
   [synced_three_raid10_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdg1(0)
   [synced_three_raid10_3legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdc1(0)
   [synced_three_raid10_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdf1(0)
   [synced_three_raid10_3legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sda1(0)
   [synced_three_raid10_3legs_1_rmeta_5]  ewi-aor---   4.00m          /dev/sde1(0)

* NOTE: not enough available devices for allocation fault polices to fully work *
(well technically, since we have 1, some allocation should work)

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Creating xfs on top of mirror(s) on host-073...
Mounting mirrored xfs filesystems on host-073...

PV=/dev/sda1
        synced_three_raid10_3legs_1_rimage_4: 1.P
        synced_three_raid10_3legs_1_rmeta_4: 1.P
PV=/dev/sdc1
        synced_three_raid10_3legs_1_rimage_2: 1.P
        synced_three_raid10_3legs_1_rmeta_2: 1.P
PV=/dev/sdh1
        synced_three_raid10_3legs_1_rimage_0: 1.P
        synced_three_raid10_3legs_1_rmeta_0: 1.P

Writing verification files (checkit) to mirror(s) on...
        ---- host-073 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-073 ----

Disabling device sdh on host-073rescan device...
Disabling device sdc on host-073rescan device...
Disabling device sda on host-073rescan device...

Getting recovery check start time from /var/log/messages: May  2 14:22
Attempting I/O to cause mirror down conversion(s) on host-073
dd if=/dev/zero of=/mnt/synced_three_raid10_3legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.0922284 s, 455 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  WARNING: Not using lvmetad because a repair command was run.
  Couldn't find device with uuid 2wqEs7-9v93-0cLU-7emZ-ZSfu-QyLv-dpZ7vN.
  Couldn't find device with uuid gMS2fR-Enm5-YCNL-4dHx-TxUO-S77N-d0QIPr.
  Couldn't find device with uuid MwRsWQ-uEI5-Q7Os-JPyk-B9oj-C3q4-tANjaS.
  LV                                     Attr       LSize   Cpy%Sync Devices
   synced_three_raid10_3legs_1            rwi-aor-p- 504.00m 100.00   synced_three_raid10_3legs_1_rimage_0(0),synced_three_raid10_3legs_1_rimage_1(0),synced_three_raid10_3legs_1_rimage_2(0),synced_three_raid10_3legs_1_rimage_3(0),synced_three_raid10_3legs_1_rimage_4(0),synced_three_raid10_3legs_1_rimage_5(0)
   [synced_three_raid10_3legs_1_rimage_0] Iwi-aor-p- 168.00m          [unknown](1)
   [synced_three_raid10_3legs_1_rimage_1] iwi-aor--- 168.00m          /dev/sdg1(1)
   [synced_three_raid10_3legs_1_rimage_2] Iwi-aor-p- 168.00m          [unknown](1)
   [synced_three_raid10_3legs_1_rimage_3] iwi-aor--- 168.00m          /dev/sdf1(1)
   [synced_three_raid10_3legs_1_rimage_4] iwi-aor--- 168.00m          /dev/sdb1(1)
   [synced_three_raid10_3legs_1_rimage_5] iwi-aor--- 168.00m          /dev/sde1(1)
   [synced_three_raid10_3legs_1_rmeta_0]  ewi-aor-p-   4.00m          [unknown](0)
   [synced_three_raid10_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdg1(0)
   [synced_three_raid10_3legs_1_rmeta_2]  ewi-aor-p-   4.00m          [unknown](0)
   [synced_three_raid10_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdf1(0)
   [synced_three_raid10_3legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sdb1(0)
   [synced_three_raid10_3legs_1_rmeta_5]  ewi-aor---   4.00m          /dev/sde1(0)

Verifying FAILED device /dev/sdh1 is *NOT* in the volume(s)
Verifying FAILED device /dev/sdc1 is *NOT* in the volume(s)
Verifying FAILED device /dev/sda1 is *NOT* in the volume(s)
Verifying IMAGE device /dev/sdg1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdf1 *IS* in the volume(s)
Verifying IMAGE device /dev/sde1 *IS* in the volume(s)
Verify the rimage/rmeta dm devices remain after the failures

Checking EXISTENCE and STATE of synced_three_raid10_3legs_1_rimage_4 on: host-073 
Checking EXISTENCE and STATE of synced_three_raid10_3legs_1_rmeta_4 on: host-073 
Checking EXISTENCE and STATE of synced_three_raid10_3legs_1_rimage_2 on: host-073 
Checking EXISTENCE and STATE of synced_three_raid10_3legs_1_rmeta_2 on: host-073 
Checking EXISTENCE and STATE of synced_three_raid10_3legs_1_rimage_0 on: host-073 
Checking EXISTENCE and STATE of synced_three_raid10_3legs_1_rmeta_0 on: host-073 

Verify the raid image order is what's expected based on raid fault policy
EXPECTED LEG ORDER: unknown /dev/sdg1 unknown /dev/sdf1 unknown /dev/sde1
ACTUAL LEG ORDER: [unknown] /dev/sdg1 [unknown] /dev/sdf1 /dev/sdb1 /dev/sde1

Verifying files (checkit) on mirror(s) on...
        ---- host-073 ----

Enabling device sdh on host-073  WARNING: Not using lvmetad because a repair command was run.
        Running vgs to make LVM update metadata version if possible (will restore a-m PVs)
        Simple vgs cmd failed after brining sdh back online
        (Possible regression of bug 1412843/1434054)
Enabling device sdc on host-073  WARNING: Not using lvmetad because a repair command was run.
        Running vgs to make LVM update metadata version if possible (will restore a-m PVs)
        Simple vgs cmd failed after brining sdc back online
        (Possible regression of bug 1412843/1434054)
Enabling device sda on host-073  WARNING: Not using lvmetad because a repair command was run.
        Running vgs to make LVM update metadata version if possible (will restore a-m PVs)

-------------------------------------------------------------------------------
Force a vgreduce to clean up the corrupt additional LV
 ( vgreduce --removemissing --force black_bird )
-------------------------------------------------------------------------------

Recreating PVs /dev/sdh1 /dev/sdc1 /dev/sda1 and then extending back into black_bird
host-073 pvcreate /dev/sdh1
host-073 pvcreate /dev/sdc1
host-073 pvcreate /dev/sda1
  WARNING: Not using lvmetad because a repair command was run.
  Can't initialize physical volume "/dev/sda1" of volume group "black_bird" without -ff
recreation of /dev/sda1 failed, must still be in VG
host-073 vgextend black_bird /dev/sdh1
  WARNING: Not using lvmetad because a repair command was run.
host-073 vgextend black_bird /dev/sdc1
  WARNING: Not using lvmetad because a repair command was run.
host-073 vgextend black_bird /dev/sda1
  WARNING: Not using lvmetad because a repair command was run.
  Physical volume '/dev/sda1' is already in volume group 'black_bird'
  Unable to add physical volume '/dev/sda1' to volume group 'black_bird'
extension of /dev/sda1 back into black_bird failed

Checking for leftover '-missing_0_0' or 'unknown devices'
'-missing' devices still exist (normal for partial allocation scenarios, see BUG 825026)
  WARNING: Not using lvmetad because a repair command was run.

Checking for PVs marked as missing (a-m)...
  WARNING: Not using lvmetad because a repair command was run.

Verifying files (checkit) on mirror(s) on...
        ---- host-073 ----

Stopping the io load (collie/xdoio) on mirror(s)
Unmounting xfs and removing mnt point on host-073...

Deactivating and removing raid(s)
May  2 14:26:16 host-073 qarshd[7551]: Running cmdline: lvremove -f /dev/black_bird/synced_three_raid10_3legs_1
This appears to be bug: 1447097


[root@host-073 ~]# lvs -a -o +devices
  LV   VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices       
  root rhel_host-073 -wi-ao----  <6.20g                                                     /dev/vda2(205)
  swap rhel_host-073 -wi-ao---- 820.00m                                                     /dev/vda2(0)  

[root@host-073 ~]# dmsetup ls
black_bird-synced_three_raid10_3legs_1_rmeta_0-missing_0_0      (253:18)
black_bird-synced_three_raid10_3legs_1_rimage_2-missing_0_0     (253:15)
black_bird-synced_three_raid10_3legs_1_rmeta_2-missing_0_0      (253:16)
rhel_host--073-swap     (253:1)
rhel_host--073-root     (253:0)
black_bird-synced_three_raid10_3legs_1_rimage_0-missing_0_0     (253:17)

Comment 3 Heinz Mauelshagen 2017-05-10 13:02:42 UTC

"lvchange -an" doesn't deactivate any "*-missing_N_0" transient devices yet.

Comment 4 Alasdair Kergon 2017-05-12 14:38:59 UTC

The "missing" devices should not be appearing like this at all - these situations should be handled in a controlled way via the VG metadata.

Comment 5 Heinz Mauelshagen 2017-05-15 13:07:30 UTC

(In reply to Alasdair Kergon from comment #4)
> The "missing" devices should not be appearing like this at all - these
> situations should be handled in a controlled way via the VG metadata.

We'll do that as the final solution.

With "lvchange --refresh ..." we already have lv_deactivate_any_missing_subdevs() we can call in LV deactivation for the time being.

Comment 6 Zdenek Kabelac 2017-05-15 13:12:20 UTC

This is however not handling and transitional states which may happen during conversion/resize operation - so it's really only a temporary hack with no big usability and proper activation-code based solution needs to be found.

(Similar applies to handling caching on failing devices).

Comment 12 Heinz Mauelshagen 2017-12-05 17:50:59 UTC

Upstream commit 94632eb155a0a1814e69b1baa0516323cbbae648

Comment 13 Zdenek Kabelac 2017-12-06 09:07:46 UTC

This patch is unfortunately only somewhat ugly hack.

The core reason here which need major rewrite is - the  raid code is using  'missing' devices for suspend/resume.

Such logic is however unsupported by lvm2 code - so 'missing' devices are leaked during various 'lvconvert' operations that may happen with 'active' LV that was partially activated.

This patch only 'covers' most obvious case where the raid was not i.e. converted and some just run 'lvchange --refresh' on raid with missing devices.

We need proper design to deal with missing devices on 'raid'  where we stop misusing partial activation (where the only supported next state after partial activation is deactivation)

Supporting 'missing' devices for suspend/resume would require rather major rework of dm tree handling so this path is rather not an option for going forward (as there is also further major detach of data in dm table and data committed on disk.

Comment 16 Zdenek Kabelac 2018-02-01 22:21:40 UTC

We may try to redesign  handling of missing device.

In practice we need just a single device per VG to handle all missing pieces.

In that case there is easy way to always include deptree for such device and make sure  there are no leaked device when devices reapers..

Also note that ATM there are several bugs in misuse of  --partial activation where several pieces of code needs bugfixing.

Comment 18 Roman Bednář 2018-02-12 11:37:28 UTC

Created attachment 1394860 [details]
test_log

Verified. Attaching log with test results.

 
3.10.0-847.el7.x86_64

lvm2-2.02.177-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
lvm2-libs-2.02.177-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
lvm2-cluster-2.02.177-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
cmirror-2.02.177-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
device-mapper-1.02.146-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
device-mapper-libs-1.02.146-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
device-mapper-event-1.02.146-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
device-mapper-event-libs-1.02.146-2.el7    BUILT: Wed Feb  7 17:39:26 CET 2018
device-mapper-persistent-data-0.7.3-3.el7    BUILT: Tue Nov 14 12:07:18 CET 2017

Comment 21 errata-xmlrpc 2018-04-10 15:20:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0853

Note You need to log in before you can comment on or make changes to this bug.