Bug 702308

Summary: Missing PVs lead to corrupted metadata, and "vgreduce --removemissing --force" is unable to correct the metadata
Product: Red Hat Enterprise Linux 4 Reporter: Dave Wysochanski <dwysocha>
Component: lvm2Assignee: Milan Broz <mbroz>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.9CC: agk, bmr, cmarthal, cww, dwysocha, heinzm, jbrassow, mbroz, mjuricek, mkhusid, prajnoha, prockai, pvrabec, ssaha, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.42-11.el4 Doc Type: Bug Fix
Doc Text:
Previously, if more physical volumes are missing in volume group, it can happen that written metadata contains wrong name for missing physical volumes and this situation is later detected as incorrect metadata for the whole volume group. If this condition occurs, the volume group cannot be repaired or removed, even with commands such as "vgreduce --removemissing --force" or "vgremove --force". For recovery procedures, refer to https://access.redhat.com/kb/docs/DOC-55800. This fix enforces using physical volume UUID to reference physical volumes and fixes this problem.
Story Points: ---
Clone Of:
: 727578 (view as bug list) Environment:
Last Closed: 2011-08-18 13:04:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 727578    
Attachments:
Description Flags
Script to attempt to repro the customer's failure
none
This script reproduces the customer failure.
none
Current KCS / Kbase article that describes the failure and recovery procedure none

Description Dave Wysochanski 2011-05-05 10:24:41 UTC
Created attachment 497030 [details]
lvmdump of customer system

Description of problem:
In some customer setups, when a PV goes missing that is part of an LV, "vgreduce --removemissing --force" does not make a consistent VG to remove.  As a result, they are unable to remove the VG.


Version-Release number of selected component (if applicable):
lvm2-2.02.42-9.el4

How reproducible:
I had a hard time reproducing it, but I'll attach all the info from the customer's system, including an lvmdump, and verbose output of the command.

Steps to Reproduce:
1. create a vg from multiple pvs
2. create at least one lv on the vg
3. remove at least one of the pvs in the lv
4. try using vgreduce, vgreduce --removemissing, and vgreduce --removemissing --force.
  
Actual results:
Unable to use vgreduce to make a consistent vg to remove.

Comment 1 Dave Wysochanski 2011-05-05 10:26:09 UTC
Created attachment 497031 [details]
output of verbose commands

archive of the verbose commands, created as follows:
# vgreduce -vvvv --removemissing --force local_3par-dg &> vgreduce_removemissing_force.txt
# vgremove -vvvv local_3par-dg &> vgremove.txt
# pvs -vvvv &> pvs.txt
# tar -cvjf output.tar.bz2 *.txt

Comment 2 Dave Wysochanski 2011-05-05 10:28:45 UTC
Created attachment 497032 [details]
metadata for dm devices, not picked up by lvmdump

lvmdump did not pick up the metadata on these dm devices (they are dm-mp devices), so we asked the customer do the following:
# dd if=/dev/dm-0 of=/tmp/dm-0.out bs=1M count=1
# dd if=/dev/dm-1 of=/tmp/dm-1.out bs=1M count=1
# dd if=/dev/dm-2 of=/tmp/dm-2.out bs=1M count=1
# dd if=/dev/dm-3 of=/tmp/dm-3.out bs=1M count=1
# dd if=/dev/dm-4 of=/tmp/dm-4.out bs=1M count=1
# dd if=/dev/dm-5 of=/tmp/dm-5.out bs=1M count=1
# dd if=/dev/dm-6 of=/tmp/dm-6.out bs=1M count=1
# tar -cvjf /tmp/metadata.tar.bz2 /tmp/*.out

Comment 3 Dave Wysochanski 2011-05-05 10:57:41 UTC
Created attachment 497036 [details]
Script to attempt to repro the customer's failure

This script gets somewhat close to the customer's failure.  I created this based on the history of the vg on the customer's system (lvmdump archive, grepping out the command history).  It does produce similar errors to what the customer was seeing, but does not reproduce the vgreduce failure.

Comment 4 Zdenek Kabelac 2011-05-05 11:14:58 UTC
Looks very similar to Bug 643538.

Comment 5 Dave Wysochanski 2011-05-05 11:52:09 UTC
Created attachment 497045 [details]
This script reproduces the customer failure.

This script reproduces the customer failure.  After PVs go missing, you need one more command to run and create the corrupt metadata - multiple entries of "pvNN" with "MISSING" flag.

Comment 6 Milan Broz 2011-05-05 11:59:52 UTC
That problem was fixed long ago in new packages but unfortunatelly not in
RHEL4,
from the a7cac2463c15c915636e511887f022b8cb63a97e commit log:
    Use PV UUID in hash for device name when exporting metadata.

    Currently code uses pv_dev_name() for hash when getting internal
    "pvX" name.

    This produce corrupted metadata if PVs are missing, pv->dev
    is NULL and all these missing devices returns one name
    (using "unknown device" for all missing devices as hash key).

I see here quite serious problem - when the simple VG with several PVs
experiences fails of several PVS, code apparently generates wrong metadata and
these metadata is not parsable, so it can lead to loss of the whole VG.

I think this bug should be fixed in post RHEL4.9 update, dev_ack.

Comment 14 Dave Wysochanski 2011-05-06 13:58:38 UTC
Created attachment 497362 [details]
Current KCS / Kbase article that describes the failure and recovery procedure

Since there's no plans or its impossible to have LVM tools fixup metadata that is mangled in this way, I've created an article describing the possible recovery procedures.

Comment 15 Dave Wysochanski 2011-06-15 15:36:49 UTC
Milan - it looks like there is no 4.10 planned.  Do you want to push this for release in some other mechanism (e.g. async errata)?

Comment 18 Milan Broz 2011-08-03 17:09:21 UTC
Fixed in lvm2-2.02.42-11.el4.

Comment 26 errata-xmlrpc 2011-08-18 13:04:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1185.html