Bug 702308 - Missing PVs lead to corrupted metadata, and "vgreduce --removemissing --force" is unable to correct the metadata
Summary: Missing PVs lead to corrupted metadata, and "vgreduce --removemissing --force...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: lvm2
Version: 4.9
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Milan Broz
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 727578
TreeView+ depends on / blocked
 
Reported: 2011-05-05 10:24 UTC by Dave Wysochanski
Modified: 2018-11-14 12:50 UTC (History)
16 users (show)

Fixed In Version: lvm2-2.02.42-11.el4
Doc Type: Bug Fix
Doc Text:
Previously, if more physical volumes are missing in volume group, it can happen that written metadata contains wrong name for missing physical volumes and this situation is later detected as incorrect metadata for the whole volume group. If this condition occurs, the volume group cannot be repaired or removed, even with commands such as "vgreduce --removemissing --force" or "vgremove --force". For recovery procedures, refer to https://access.redhat.com/kb/docs/DOC-55800. This fix enforces using physical volume UUID to reference physical volumes and fixes this problem.
Clone Of:
: 727578 (view as bug list)
Environment:
Last Closed: 2011-08-18 13:04:36 UTC
Target Upstream Version:


Attachments (Terms of Use)
Script to attempt to repro the customer's failure (534 bytes, application/x-shellscript)
2011-05-05 10:57 UTC, Dave Wysochanski
no flags Details
This script reproduces the customer failure. (566 bytes, application/x-shellscript)
2011-05-05 11:52 UTC, Dave Wysochanski
no flags Details
Current KCS / Kbase article that describes the failure and recovery procedure (12.78 KB, application/pdf)
2011-05-06 13:58 UTC, Dave Wysochanski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Legacy) 55800 0 None None None Never
Red Hat Product Errata RHBA-2011:1185 0 normal SHIPPED_LIVE lvm2 bug fix update 2011-08-18 13:04:29 UTC

Description Dave Wysochanski 2011-05-05 10:24:41 UTC
Created attachment 497030 [details]
lvmdump of customer system

Description of problem:
In some customer setups, when a PV goes missing that is part of an LV, "vgreduce --removemissing --force" does not make a consistent VG to remove.  As a result, they are unable to remove the VG.


Version-Release number of selected component (if applicable):
lvm2-2.02.42-9.el4

How reproducible:
I had a hard time reproducing it, but I'll attach all the info from the customer's system, including an lvmdump, and verbose output of the command.

Steps to Reproduce:
1. create a vg from multiple pvs
2. create at least one lv on the vg
3. remove at least one of the pvs in the lv
4. try using vgreduce, vgreduce --removemissing, and vgreduce --removemissing --force.
  
Actual results:
Unable to use vgreduce to make a consistent vg to remove.

Comment 1 Dave Wysochanski 2011-05-05 10:26:09 UTC
Created attachment 497031 [details]
output of verbose commands

archive of the verbose commands, created as follows:
# vgreduce -vvvv --removemissing --force local_3par-dg &> vgreduce_removemissing_force.txt
# vgremove -vvvv local_3par-dg &> vgremove.txt
# pvs -vvvv &> pvs.txt
# tar -cvjf output.tar.bz2 *.txt

Comment 2 Dave Wysochanski 2011-05-05 10:28:45 UTC
Created attachment 497032 [details]
metadata for dm devices, not picked up by lvmdump

lvmdump did not pick up the metadata on these dm devices (they are dm-mp devices), so we asked the customer do the following:
# dd if=/dev/dm-0 of=/tmp/dm-0.out bs=1M count=1
# dd if=/dev/dm-1 of=/tmp/dm-1.out bs=1M count=1
# dd if=/dev/dm-2 of=/tmp/dm-2.out bs=1M count=1
# dd if=/dev/dm-3 of=/tmp/dm-3.out bs=1M count=1
# dd if=/dev/dm-4 of=/tmp/dm-4.out bs=1M count=1
# dd if=/dev/dm-5 of=/tmp/dm-5.out bs=1M count=1
# dd if=/dev/dm-6 of=/tmp/dm-6.out bs=1M count=1
# tar -cvjf /tmp/metadata.tar.bz2 /tmp/*.out

Comment 3 Dave Wysochanski 2011-05-05 10:57:41 UTC
Created attachment 497036 [details]
Script to attempt to repro the customer's failure

This script gets somewhat close to the customer's failure.  I created this based on the history of the vg on the customer's system (lvmdump archive, grepping out the command history).  It does produce similar errors to what the customer was seeing, but does not reproduce the vgreduce failure.

Comment 4 Zdenek Kabelac 2011-05-05 11:14:58 UTC
Looks very similar to Bug 643538.

Comment 5 Dave Wysochanski 2011-05-05 11:52:09 UTC
Created attachment 497045 [details]
This script reproduces the customer failure.

This script reproduces the customer failure.  After PVs go missing, you need one more command to run and create the corrupt metadata - multiple entries of "pvNN" with "MISSING" flag.

Comment 6 Milan Broz 2011-05-05 11:59:52 UTC
That problem was fixed long ago in new packages but unfortunatelly not in
RHEL4,
from the a7cac2463c15c915636e511887f022b8cb63a97e commit log:
    Use PV UUID in hash for device name when exporting metadata.

    Currently code uses pv_dev_name() for hash when getting internal
    "pvX" name.

    This produce corrupted metadata if PVs are missing, pv->dev
    is NULL and all these missing devices returns one name
    (using "unknown device" for all missing devices as hash key).

I see here quite serious problem - when the simple VG with several PVs
experiences fails of several PVS, code apparently generates wrong metadata and
these metadata is not parsable, so it can lead to loss of the whole VG.

I think this bug should be fixed in post RHEL4.9 update, dev_ack.

Comment 14 Dave Wysochanski 2011-05-06 13:58:38 UTC
Created attachment 497362 [details]
Current KCS / Kbase article that describes the failure and recovery procedure

Since there's no plans or its impossible to have LVM tools fixup metadata that is mangled in this way, I've created an article describing the possible recovery procedures.

Comment 15 Dave Wysochanski 2011-06-15 15:36:49 UTC
Milan - it looks like there is no 4.10 planned.  Do you want to push this for release in some other mechanism (e.g. async errata)?

Comment 18 Milan Broz 2011-08-03 17:09:21 UTC
Fixed in lvm2-2.02.42-11.el4.

Comment 26 errata-xmlrpc 2011-08-18 13:04:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1185.html


Note You need to log in before you can comment on or make changes to this bug.