Description of problem:
After a successful device failure and mirror repair, helter_skelter re-enables the failed device(s) and then assumes it's automatically added back into the VG before using them in an upconvert. This assumption appears to have regressed. Now lvm realizes they've returned but doesn't allow them to be used again until a pvscan is run. This used to be the case in early RHEL5, but then that changed later in RHEL5, and now appears to have changed again making it difficult to keep tests up to date.
[root@taft-01 ~]# lvscan
WARNING: Inconsistent metadata found for VG helter_skelter - updating to use version 8
Missing device /dev/sdh1 reappeared, updating metadata for VG helter_skelter to version 8.
Missing device /dev/sdg1 reappeared, updating metadata for VG helter_skelter to version 8.
ACTIVE '/dev/helter_skelter/syncd_primary_log_2legs_1' [600.00 MiB] inherit
ACTIVE '/dev/vg_taft01/lv_root' [32.30 GiB] inherit
ACTIVE '/dev/vg_taft01/lv_home' [25.62 GiB] inherit
ACTIVE '/dev/vg_taft01/lv_swap' [9.83 GiB] inherit
Version-Release number of selected component (if applicable):
Version 2.02.67+ (custom build by Brassow)
Actually, I am not sure. Treatment of inconsistent metadata should be the same as it used to be in later RHEL5 -- at least I don't remember changing any of this. We have a rudimentary test for this that is not catching any change in behaviour in past 12 months, but maybe we need to augment the test to cover some further scenarios.
I'll look into reproducing the problem. Corey, can you maybe provide the metadata from the system after the repair but before the inconsistent metadata is corrected? Thanks! (It should be available in the metadata backup directory.)
Posting everything found in /etc/lvm/[backup|cache], before a pvscan.
Created attachment 427002 [details]
Created attachment 427004 [details]
Created attachment 427006 [details]
Created attachment 427007 [details]
Corey, what command do you use to upconvert the mirror? I have written following script:
aux prepare_vg 3
lvcreate -m 1 --ig -L 1 -n 2way $vg $dev1 $dev2 $dev3:0
echo n | lvconvert --repair $vg/2way 2>&1 | tee 2way.out
lvs -a -o +devices | not grep unknown
# the device is linear at this point
lvconvert -m 1 $vg/2way $dev1 $dev2 $dev3:0
check mirror $vg 2way $dev3
and it seems to work as expected with current CVS: the last lvconvert is saying this:
+ lvconvert -m 1 LVMTEST28808vg/2way /srv/build/lvm2/cvs-upstream/default/test/LVMTEST28808.YJdmEzPsP6/dev/mapper/LVMTEST28808pv1 /srv/build/lvm2/cvs-upstream/default/test/LVMTEST28808.YJdmEzPsP6/dev/mapper/LVMTEST28808pv2 /srv/build/lvm2/cvs-upstream/default/test/LVMTEST28808.YJdmEzPsP6/dev/mapper/LVMTEST28808pv3:0
WARNING: Inconsistent metadata found for VG LVMTEST28808vg - updating to use version 8
Missing device /srv/build/lvm2/cvs-upstream/default/test/LVMTEST28808.YJdmEzPsP6/dev/mapper/LVMTEST28808pv2 reappeared, updating metadata for VG LVMTEST28808vg to version 8.
WARNING: This metadata update is NOT backed up
WARNING: This metadata update is NOT backed up
the volume is mirrored again after the lvconvert, as checked by the last line of the script.
'lvconvert -m $legnum -b $vg/$mirror @pvlist'
The pvlist includes the devices that were just failed/re-enabled. So that appears to be the only difference between our two cmds.
Hm, could you get the output (ideally with -vvvv) of the failing lvconvert? I.e. your
lvconvert -m $legnum -b $vg/$mirror @pvlist
(with -vvvv added) on a volume group that has inconsistent metadata after the disabled device returned. Something is going wrong with that command -- it could be failing to get a lock or something like that maybe, or some other environment dependency is tripping the code. Also, by any chance, are you running in a cluster, or just locally?
Created attachment 428028 [details]
I see. This shows up in the logs:
#metadata/metadata.c:3626 Cannot change VG helter_skelter while PVs are missing.
#metadata/metadata.c:3627 Consider vgreduce --removemissing.
this means, that your VG is incomplete at this point and you cannot upconvert the mirror without first fixing it. It would be interesting to know how you got into this situation.
What we have:
- /dev/sdg is failed
- a mirror write happens, which trips dmeventd, which
- runs lvconvert --repair --use-policies ... this recovers the mirror
- which in turn does (kind of) vgreduce --removemissing ... this removes /dev/sdg from helter_skelter
If you run vgextend at this point, it should notice that /dev/sdg disappeared from helter_skelter, update the inconsistent metadata and all should be well. This is assuming that all of the above worked.
So what I have found is that the second bullet under dmeventd, that is removal of /dev/sdg never happens with current code. This is a regression, and a likely cause for this bug. I will shortly submit a patch upstream. I have also corrected our upstream tests so this does not happen again...
Fixed upstream in Version 2.02.70 - 6th July 2010: Restore the removemissing behaviour of lvconvert --repair --use-policies.
Fix verified in the latest build.
lvm2-2.02.72-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
lvm2-libs-2.02.72-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
lvm2-cluster-2.02.72-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
udev-147-2.21.el6 BUILT: Mon Jul 12 04:55:00 CDT 2010
device-mapper-1.02.53-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
device-mapper-libs-1.02.53-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
device-mapper-event-1.02.53-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
device-mapper-event-libs-1.02.53-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
cmirror-2.02.72-3.el6 BUILT: Wed Jul 28 15:39:43 CDT 2010
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.