Bug 1025322
Summary: | device mapper keeps missing_0_0 devices listed even after the LV/VG containing raid is removed | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Nenad Peric <nperic> | ||||
Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.5 | CC: | agk, cmarthal, dwysocha, heinzm, jbrassow, jcastillo, msnitzer, prajnoha, prockai, thornber, tlavigne, zkabelac | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | lvm2-2.02.143-12.el6 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1447097 (view as bug list) | Environment: | |||||
Last Closed: | 2017-03-21 12:01:48 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1447097 | ||||||
Attachments: |
|
Description
Nenad Peric
2013-10-31 13:38:20 UTC
Additional info: All of this is with lvmetad running. After reboot, this is what happens as well: [root@virt-008 ~]# vgs VG #PV #LV #SN Attr VSize VFree black_bird 7 0 0 wz--n- 69.95g 69.95g vg_virt008 1 2 0 wz--n- 7.51g 0 It seems some PVs held the info regarding the removed VG... there are no LVs however, just the VG is back after the reboot. [root@virt-008 ~]# dmsetup status vg_virt008-lv_swap: 0 1671168 linear vg_virt008-lv_root: 0 14073856 linear [root@virt-008 ~]# dmsetup ls vg_virt008-lv_swap (253:1) vg_virt008-lv_root (253:0) [root@virt-008 ~]# pvs PV VG Fmt Attr PSize PFree /dev/sda1 black_bird lvm2 a-- 9.99g 9.99g /dev/sdb1 black_bird lvm2 a-- 9.99g 9.99g /dev/sdd1 black_bird lvm2 a-- 9.99g 9.99g /dev/sdg1 black_bird lvm2 a-- 9.99g 9.99g /dev/sdh1 black_bird lvm2 a-- 9.99g 9.99g /dev/sdi1 black_bird lvm2 a-- 9.99g 9.99g /dev/sdj1 black_bird lvm2 a-- 9.99g 9.99g /dev/vda2 vg_virt008 lvm2 a-- 7.51g 0 FWIW, I'm hitting this as well. Adding hacks to our tests to not trip on this issue until it's solved. The problem can be isolated to updating an LV with missing devices in it, which happens here in vgreduce --removemissing --force. The repair you trigger on the RAID10 device only has 1 spare available but 3 devices are missing, so what happens is that dmeventd replaces 1 of the bad devices but leaves 2 “holes” -- still referring to the currently missing PVs (so if you plug in the PVs back at this point, the array will pick them up). The activation code plugs in those “holes” with the missing_0_0 error devices at that point, because we want the LV to stay active after the repair. The same would happen if you only refreshed the array instead of repairing it. So the problem happens when vgreduce --removemissing --force permanently removes those missing devices with error segments, in lv_raid_remove_missing. The LV is currently active, and points to devices that could not be found, and missing_0_0 were substituted in. Now lv_raid_remove_missing changes the metadata of the LV, and calls suspend -- at this point, the missing_0_0 devices are still part of the LV since we are using the previous version of metadata there. Nonetheless, at the time of the following resume, the missing_0_0 devices are not referenced anywhere -- the metadata for the LV no longer refers to any missing devices. Here, the missing_0_0 devices become unhinged, becoming “top level” -- they are resumed, but in fact they should instead be removed during the resume, since they are not referenced by the newly active tables anymore. I am looking at the activation code to see if I can coax it into removing the unhinged devices during resume. A simplified reproducer: aux prepare_vg 2 lvcreate --type raid1 -m 1 -n lv -l 1 $vg aux disable_dev "$dev1" lvchange --refresh $vg/lv --activationmode partial dmsetup ls --tree vgreduce --removemissing --force $vg -vvvv dmsetup ls --tree lvchange -an $vg/lv dmsetup table | not grep $vg I was slightly wrong there -- the missing_0_0 devices are not added to the DM tree based on LVM metadata during suspend in lv_raid_remove_missing, but they are added by _add_dev in libdm-deptree.c, since they show up as _deps of the existing rimage/rmeta nodes. So LVM has no idea that there are any missing_0_0 devices involved at all, at this point. Created attachment 936052 [details]
debug log for lv_raid_remove_missing
Let's check this with recent versions... This still exists in the latest 6.8. Deactivating and removing raid(s) Cleaning up missing dm devices that are still around... [root@host-113 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices lv_root vg_host113 -wi-ao---- 6.71g /dev/vda2(0) lv_swap vg_host113 -wi-ao---- 816.00m /dev/vda2(1718) [root@host-113 ~]# dmsetup ls black_bird-synced_three_raid10_3legs_1_rmeta_0-missing_0_0 (253:18) black_bird-synced_three_raid10_3legs_1_rimage_2-missing_0_0 (253:15) black_bird-synced_three_raid10_3legs_1_rmeta_2-missing_0_0 (253:16) vg_host113-lv_swap (253:1) vg_host113-lv_root (253:0) black_bird-synced_three_raid10_3legs_1_rimage_0-missing_0_0 (253:17) 2.6.32-610.el6.x86_64 lvm2-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 lvm2-libs-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 lvm2-cluster-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 udev-147-2.66.el6 BUILT: Mon Jan 18 02:42:20 CST 2016 device-mapper-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-libs-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-event-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-event-libs-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-persistent-data-0.6.0-2.el6 BUILT: Thu Jan 21 02:40:25 CST 2016 cmirror-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 Upstream commit 95d68f1d0e16 (and kernel patch "[dm-devel][PATCH] dm raid: fix transient device failure processing" backport to RHEL6 being worked on). In comment #11, the transient failure processing is mentioned as the fix. Does that mean the original test case used to find this needs to be altered to use the 'lvchange --refresh' cmd? When I run the original case with or with out lvmetad running, I still end up with _missing_ dm devices after the lv and vg removal with the latest kernel/lvm2. Below is I only way I saw the _missing_ dm devices disappear. Is this the correct fix sequence to test? If so, are the "Device mismatch detected" warnings expected? 2.6.32-683.el6.x86_64 lvm2-2.02.143-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 lvm2-libs-2.02.143-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 lvm2-cluster-2.02.143-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 udev-147-2.73.el6_8.2 BUILT: Tue Aug 30 08:17:19 CDT 2016 device-mapper-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-libs-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-event-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-event-libs-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-persistent-data-0.6.2-0.1.rc7.el6 BUILT: Tue Mar 22 08:58:09 CDT 2016 cmirror-2.02.143-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 ### This is after the failure and re-enable of raid 10 image devices [root@host-081 ~]# service lvm2-lvmetad status lvmetad (pid 2788) is running... [root@host-081 ~]# lvs -a -o +devices LV VG Attr LSize Cpy%Sync Devices synced_three_raid10_3legs_1 black_bird rwi-aor-p- 504.00m 100.00 synced_three_raid10_3legs_1_rimage_0(0),synced_three_raid10_3legs_1_rimage_1(0),synced_three_raid10_3legs_1_rimage_2(0),synced_three_raid10_3legs_1_rimage_3(0),synced_three_raid10_3legs_1_rimage_4(0),synced_three_raid10_3legs_1_rimage_5(0) [synced_three_raid10_3legs_1_rimage_0] black_bird iwi-a-r-p- 168.00m /dev/sda1(1) [synced_three_raid10_3legs_1_rimage_1] black_bird iwi-aor--- 168.00m /dev/sdf1(1) [synced_three_raid10_3legs_1_rimage_2] black_bird iwi-a-r-p- 168.00m /dev/sdc1(1) [synced_three_raid10_3legs_1_rimage_3] black_bird iwi-aor--- 168.00m /dev/sdd1(1) [synced_three_raid10_3legs_1_rimage_4] black_bird iwi-aor--- 168.00m /dev/sdh1(1) [synced_three_raid10_3legs_1_rimage_5] black_bird iwi-aor--- 168.00m /dev/sde1(1) [synced_three_raid10_3legs_1_rmeta_0] black_bird ewi-a-r-p- 4.00m /dev/sda1(0) [synced_three_raid10_3legs_1_rmeta_1] black_bird ewi-aor--- 4.00m /dev/sdf1(0) [synced_three_raid10_3legs_1_rmeta_2] black_bird ewi-a-r-p- 4.00m /dev/sdc1(0) [synced_three_raid10_3legs_1_rmeta_3] black_bird ewi-aor--- 4.00m /dev/sdd1(0) [synced_three_raid10_3legs_1_rmeta_4] black_bird ewi-aor--- 4.00m /dev/sdh1(0) [synced_three_raid10_3legs_1_rmeta_5] black_bird ewi-aor--- 4.00m /dev/sde1(0) [root@host-081 ~]# dmsetup ls black_bird-synced_three_raid10_3legs_1_rmeta_0-missing_0_0 (253:18) black_bird-synced_three_raid10_3legs_1_rmeta_2 (253:6) black_bird-synced_three_raid10_3legs_1_rimage_1 (253:5) black_bird-synced_three_raid10_3legs_1_rmeta_1 (253:4) black_bird-synced_three_raid10_3legs_1_rimage_0 (253:3) black_bird-synced_three_raid10_3legs_1_rmeta_0 (253:2) black_bird-synced_three_raid10_3legs_1_rimage_2-missing_0_0 (253:15) black_bird-synced_three_raid10_3legs_1_rmeta_2-missing_0_0 (253:16) black_bird-synced_three_raid10_3legs_1_rimage_5 (253:13) black_bird-synced_three_raid10_3legs_1_rmeta_5 (253:12) black_bird-synced_three_raid10_3legs_1_rimage_4 (253:20) black_bird-synced_three_raid10_3legs_1_rmeta_4 (253:19) black_bird-synced_three_raid10_3legs_1_rimage_3 (253:9) black_bird-synced_three_raid10_3legs_1_rimage_0-missing_0_0 (253:17) black_bird-synced_three_raid10_3legs_1 (253:14) black_bird-synced_three_raid10_3legs_1_rmeta_3 (253:8) black_bird-synced_three_raid10_3legs_1_rimage_2 (253:7) [root@host-081 ~]# vgs VG #PV #LV #SN Attr VSize VFree black_bird 7 1 0 wz-pn- 146.97g 145.96g vg_host081 1 2 0 wz--n- 7.51g 0 [root@host-081 ~]# lvchange --refresh black_bird/synced_three_raid10_3legs_1 Refusing refresh of partial LV black_bird/synced_three_raid10_3legs_1. Use '--activationmode partial' to override. [root@host-081 ~]# lvchange --refresh --activationmode partial black_bird/synced_three_raid10_3legs_1 PARTIAL MODE. Incomplete logical volumes will be processed. [root@host-081 ~]# lvs -a -o +devices WARNING: Device mismatch detected for black_bird/synced_three_raid10_3legs_1_rimage_0 which is accessing /dev/sda1 instead of (null). WARNING: Device mismatch detected for black_bird/synced_three_raid10_3legs_1_rmeta_0 which is accessing /dev/sda1 instead of (null). WARNING: Device mismatch detected for black_bird/synced_three_raid10_3legs_1_rimage_2 which is accessing /dev/sdc1 instead of (null). WARNING: Device mismatch detected for black_bird/synced_three_raid10_3legs_1_rmeta_2 which is accessing /dev/sdc1 instead of (null). LV VG Attr LSize Cpy%Sync Devices synced_three_raid10_3legs_1 black_bird rwi-aor-p- 504.00m 100.00 synced_three_raid10_3legs_1_rimage_0(0),synced_three_raid10_3legs_1_rimage_1(0),synced_three_raid10_3legs_1_rimage_2(0),synced_three_raid10_3legs_1_rimage_3(0),synced_three_raid10_3legs_1_rimage_4(0),synced_three_raid10_3legs_1_rimage_5(0) [synced_three_raid10_3legs_1_rimage_0] black_bird iwi-a-r-p- 168.00m /dev/sda1(1) [synced_three_raid10_3legs_1_rimage_1] black_bird iwi-aor--- 168.00m /dev/sdf1(1) [synced_three_raid10_3legs_1_rimage_2] black_bird iwi-a-r-p- 168.00m /dev/sdc1(1) [synced_three_raid10_3legs_1_rimage_3] black_bird iwi-aor--- 168.00m /dev/sdd1(1) [synced_three_raid10_3legs_1_rimage_4] black_bird iwi-aor--- 168.00m /dev/sdh1(1) [synced_three_raid10_3legs_1_rimage_5] black_bird iwi-aor--- 168.00m /dev/sde1(1) [synced_three_raid10_3legs_1_rmeta_0] black_bird ewi-a-r-p- 4.00m /dev/sda1(0) [synced_three_raid10_3legs_1_rmeta_1] black_bird ewi-aor--- 4.00m /dev/sdf1(0) [synced_three_raid10_3legs_1_rmeta_2] black_bird ewi-a-r-p- 4.00m /dev/sdc1(0) [synced_three_raid10_3legs_1_rmeta_3] black_bird ewi-aor--- 4.00m /dev/sdd1(0) [synced_three_raid10_3legs_1_rmeta_4] black_bird ewi-aor--- 4.00m /dev/sdh1(0) [synced_three_raid10_3legs_1_rmeta_5] black_bird ewi-aor--- 4.00m /dev/sde1(0) [root@host-081 ~]# dmsetup ls black_bird-synced_three_raid10_3legs_1_rmeta_2 (253:6) black_bird-synced_three_raid10_3legs_1_rimage_1 (253:5) black_bird-synced_three_raid10_3legs_1_rmeta_1 (253:4) black_bird-synced_three_raid10_3legs_1_rimage_0 (253:3) black_bird-synced_three_raid10_3legs_1_rmeta_0 (253:2) black_bird-synced_three_raid10_3legs_1_rimage_5 (253:13) black_bird-synced_three_raid10_3legs_1_rmeta_5 (253:12) black_bird-synced_three_raid10_3legs_1_rimage_4 (253:20) black_bird-synced_three_raid10_3legs_1_rmeta_4 (253:19) black_bird-synced_three_raid10_3legs_1_rimage_3 (253:9) black_bird-synced_three_raid10_3legs_1 (253:14) black_bird-synced_three_raid10_3legs_1_rmeta_3 (253:8) black_bird-synced_three_raid10_3legs_1_rimage_2 (253:7) Marking verified in the latest rpms/kernel for non stacked raid cases. As listed in comment #16, these images can still exist after the volume has been removed in scenarios where the raid is below a thin pool or thin meta device (see bug 1418478). 2.6.32-688.el6.x86_64 lvm2-2.02.143-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 lvm2-libs-2.02.143-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 lvm2-cluster-2.02.143-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 udev-147-2.73.el6_8.2 BUILT: Tue Aug 30 08:17:19 CDT 2016 device-mapper-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-libs-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-event-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-event-libs-1.02.117-12.el6 BUILT: Wed Jan 11 09:35:04 CST 2017 device-mapper-persistent-data-0.6.2-0.1.rc7.el6 BUILT: Tue Mar 22 08:58:09 CDT 2016 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0798.html |