Bug 537913

Summary:	Volume Group doesn't recover if PV disappears and later reappers
Product:	Red Hat Enterprise Linux 5	Reporter:	Mikuláš Patočka <mpatocka>
Component:	lvm2	Assignee:	Petr Rockai <prockai>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.4	CC:	agk, cmarthal, dwysocha, heinzm, iannis, ipilcher, jbrassow, mbroz, prockai, vincent
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-01-13 22:39:51 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mikuláš Patočka 2009-11-16 19:05:38 UTC

This bug happens with LVM2.2.02.54.

Assume the following setup:

* LVM is compiled with dmeventd
* we have a mirror with two legs (and possibly a log)
* we have another linear logical volume, part of it is on the same PV as one mirror leg
* this PV with the linear volume and the mirror leg fails

Now, dmeventd correctly converts the mirror to a single volume. It marks the failed PV with MISSING flag in the metadata. When the failed disk is added back again, the PV is still marked as MISSING and there is no easy way how to clear the MISSING flag and recover the volume group.

* Using vgreduce --force would remove the missing flag and the PV can be later added with vgextend. However this action would destroy any logical volumes that are partially on the missing PV, so it is unusable in production environment.
* Dumping the metadata with vgcfgbackup, clearing the MISSING flag with a text editor and restoring the metadata with vgcfgrestore works and recovers the volume group, but you can't expect that the administrator will find it out quickly, if he has nonworking server.

The impact of this bug is quite serious, simply unplugging the disk cable and plugging it again may make the server unbootable. That's why I am suggesting it as a blocker.

I suggest to ignore the MISSING flag when reading the metadata and always check if the PV is there and set the MISSING flag according to that check.

Comment 2 Petr Rockai 2009-11-18 12:35:30 UTC

Hi, this is intentional, and I think the right fix is to introduce a command like "vgextend --recover" that'd erase the MISSING flag from the PV in question.

This behaviour exists because once a device went missing, all bets are off as to what may have changed on that device. I cannot currently come up with a convincing example of where this would lead to a catastrophic failure, so it may be that this case of paranoia was not completely justified. We just agreed with Alasdair that this is the safest course of action regarding integrity.

I agree that it should be possible to easily put that device back though, at least manually if not automatically.

Comment 3 Mikuláš Patočka 2009-11-18 14:09:23 UTC

But then vgextend --recover would have to be executed from initrd. And it isn't.

Example: root filesystem is on pv1. There is a mirror that has one leg on pv1 and another on pv2. You unplug and replug pv1 and reset the server. The server won't boot. To make the server boot, you have to insert the rescue CD and edit the metadata (and if the admin doesn't exactly know what to do with the metadata, he will have several hours downtime).

As for your argument "you may not know what happened to pv1 while it was unplugged"
- it is improbable that pv1 was inserted into another computer (if the admin intended to use the PV in another computer, would remove it cleanly with pvmove and vgreduce before unplugging it).
- if pv1 was inserted into another computer, LVM on that computer sees it as imcomplete VG and won't allow you to modify it. So it can't be secretly modified. The only allowed modification is to kill invisible LVs with vgreduce --removemissing --force and convert it into complete VG. But this is only hapenning on admin's request. If the admin hasn't converted pv1 into a complete VG, you can know that it hasn't been modified.

Comment 4 Petr Rockai 2009-11-19 09:09:31 UTC

Well, when PVs are missing and you issue vgchange -a y, all the LVs in the VG that are complete will be activated as usual. You just cannot edit the metadata until you fix that missing PV. There is of course a problem when the root LV is on a missing volume (which seems to be the problem you are pointing out).

Well, although this may be of concern somewhat, you should probably note that in previous versions of RHEL, this root LV would be simply erased in the scenario outlined, without easy ways back -- you could probably get vgcfgrestore fix it, if you kept the LVM archive dir outside of your root filesystem and the VG in general (which is, let's admit it, not the usual case). So I think this still is an improvement over things.

You should probably not be yanking the root filesystem of a running server either way (you cannot do that safely, LVM or not, unless maybe in some very special cases). Providing reasonably easy recovery should be, IMHO, acceptable.

Comment 5 Mikuláš Patočka 2009-11-19 09:45:35 UTC

"in previous versions of RHEL, this root LV would be simply erased in the
scenario outlined"

That's terrible. Why does dmeventd do it? Why it can't just kill the failed mirror leg and not touch anything else?

"You should probably not be yanking the root filesystem of a running server
either way"

But it may happen unintentionally. For example, for me it once happened that on a server-class mainboard with server-class scsi controller (MPT), loose contact on PCI-X bus caused the card to be lost while the server was running --- removing the card and cleaning the contacts fixed it. Another possibility is overheat, disks will automatically turn themselves off when overheating. I still remember how I was fixing unbootable MD-RAID5 when the server room overheated, two disks in the array turned themselves off and that stupid kernel driver wrote to the remaining disks metadata meaning that the array is disintegrated ... and after cooling the server room and bringing all the disks up, the system didn't boot. Raid-tools couldn't restore it at all (mdadm maybe could, but it was not installed there, it was new and not common back in that time). I finally resolved the problem by manually editing the metadata with a disk editor --- but it cost a day downtime (for 300 user server) to find out how to edit it.

You definitely must not damage data after temporary failures.

Comment 8 Ludek Smid 2010-03-11 12:21:26 UTC

Since it is too late to address this issue in RHEL 5.5, it has been proposed for RHEL 5.6.  Contact your support representative if you need to escalate this issue.

Comment 10 Alasdair Kergon 2010-07-09 21:54:02 UTC

"When the failed disk is added back
again, the PV is still marked as MISSING and there is no easy way how to clear
the MISSING flag and recover the volume group."

So let's look at the next step of this.

A device disappears.  MISSING_PV gets set.  The device reappears.

What is the process we offer now for clearing the MISSING_PV flags?
- If the metadata had not been changed since the device disappeared.

What if the VG had been cleaned up already, to remove the PV, and then the PV reappears and you decide you do want to merge some LVs in it back into the original VG again?
- Could it be renamed to appear as a separate VG still containing LVs, cleaned up, then vgmerged?

Does it make any difference if more than one PV is affected?

[Points for discussion. No specific proposals yet.]

Comment 11 Petr Rockai 2010-07-15 08:46:26 UTC

In the first case, nothing happens: if metadata did not change, there is nothing to do. The PV was not flagged and the VG continues undisrupted.

When the PV reappears but is already empty and removed from the VG, it is kicked out of the VG and the VG continues to work properly as well.

The problem happens when metadata *did* change (presumably due to repairing mirrors) but the PV is not empty (if it is empty and stays in the VG, it can be vgreduced away). One option is vgreduce --removemissing --force, which is likely not what you want.

Adding a command for pulling the PV back into normal use should be provided for these situations, I would think. My initial thought is "vgextend --restore vg pv". Once the UI is decided, this should be easy to implement.

Comment 12 Petr Rockai 2010-10-12 21:52:06 UTC

I have implemented vgextend --restore, and the patch is pending review. If that is accepted upstream, we will have these scenarios with transient failures:

- there are no mirrors involving this PV and no manual metadata edits happened while the device was gone: handled transparently, LVM does not do anything
- in all other cases, the PV will be flagged as MISSING in the metadata; mirrored volumes will be automatically repaired, and any non-mirrored LVs using the PV will be made inacessible

In the latter case, it is enough to run vgextend --restore vg pv to make the non-mirrored LVs accessible again. If a non-mirrored root volume happened to live on a failed PV, this will still prevent normal boot. There may further be ways to improve that, but they are out of scope for 5.6, in my opinion. I think that vgextend --restore reasonably improves the situation, relative to the original bug report.

Comment 13 Petr Rockai 2010-10-13 10:40:08 UTC

Checked in upstream.

Comment 14 Milan Broz 2010-10-15 15:09:04 UTC

vgextend --restoremissing added to lvm2-2.02.74-1.el5.

Comment 16 Corey Marthaler 2010-11-09 22:49:23 UTC

The basic 'vgextend --restoremissing' test case appears to work, marking verified. 

lvm2-2.02.74-2.el5    BUILT: Tue Nov  9 08:03:06 CST 2010
lvm2-cluster-2.02.74-3.el5    BUILT: Tue Nov  9 08:01:59 CST 2010
device-mapper-1.02.55-2.el5    BUILT: Tue Nov  9 06:41:00 CST 2010
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009


[root@taft-01 ~]# vgcreate taft /dev/sd[bcde]1
  Volume group "taft" successfully created

[root@taft-01 ~]# lvcreate -m 1 -n mirror -L 500M taft
  Logical volume "mirror" created

[root@taft-01 ~]# lvcreate -n linear -L 500M taft /dev/sdc1
  Logical volume "linear" created

[root@taft-01 ~]# lvs -a -o +devices
  LV                VG         Attr   LSize   Log         Copy%  Devices
  linear            taft       -wi-a- 500.00M                    /dev/sdc1(125)
  mirror            taft       mwi-a- 500.00M mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] taft       iwi-ao 500.00M                    /dev/sdb1(0)
  [mirror_mimage_1] taft       iwi-ao 500.00M                    /dev/sdc1(0)
  [mirror_mlog]     taft       lwi-ao   4.00M                    /dev/sde1(0)

[root@taft-01 ~]# echo offline > /sys/block/sdc/device/state

[root@taft-01 ~]# pvscan
  /dev/taft/linear: read failed after 0 of 4096 at 524222464: Input/output error
  [...]
  /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid TnnUdV-JLu3-QwBP-hvbl-bE4p-S8Y3-Wjhx2G.
  PV /dev/sdb1        VG taft            lvm2 [135.66 GB / 135.18 GB free]
  PV unknown device   VG taft            lvm2 [135.66 GB / 134.69 GB free]
  PV /dev/sdd1        VG taft            lvm2 [135.66 GB / 135.66 GB free]
  PV /dev/sde1        VG taft            lvm2 [135.66 GB / 135.66 GB free]

[root@taft-01 ~]# dd if=/dev/zero of=/dev/taft/mirror count=2
2+0 records in
2+0 records out
1024 bytes (1.0 kB) copied, 1.59008 seconds, 0.6 kB/s

[root@taft-01 ~]# lvs -a -o +devices
  /dev/taft/linear: read failed after 0 of 4096 at 524222464: Input/output error
  [...]
  /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
  Couldn't find device with uuid TnnUdV-JLu3-QwBP-hvbl-bE4p-S8Y3-Wjhx2G.
  LV                VG         Attr   LSize   Log         Copy%  Devices
  linear            taft       -wi-a- 500.00M                    unknown device(125)
  mirror            taft       mwi-a- 500.00M mirror_mlog 47.20  mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] taft       Iwi-ao 500.00M                    /dev/sdb1(0)
  [mirror_mimage_1] taft       Iwi-ao 500.00M                    /dev/sdd1(0)
  [mirror_mlog]     taft       lwi-ao   4.00M                    /dev/sde1(1)

[root@taft-01 ~]# echo running > /sys/block/sdc/device/state

[root@taft-01 ~]# lvs -a -o +devices
  WARNING: Inconsistent metadata found for VG taft - updating to use version 11
  Missing device /dev/sdc1 reappeared, updating metadata for VG taft to version 11.
  Device still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing.
  LV                VG         Attr   LSize   Log         Copy%  Devices
  linear            taft       -wi-a- 500.00M                    /dev/sdc1(125)
  mirror            taft       mwi-a- 500.00M mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] taft       iwi-ao 500.00M                    /dev/sdb1(0)
  [mirror_mimage_1] taft       iwi-ao 500.00M                    /dev/sdd1(0)
  [mirror_mlog]     taft       lwi-ao   4.00M                    /dev/sde1(1)

[root@taft-01 ~]# vgextend --restoremissing taft /dev/sdc1
  Volume group "taft" successfully extended

[root@taft-01 ~]# lvs -a -o +devices
  LV                VG         Attr   LSize   Log         Copy%  Devices
  linear            taft       -wi-a- 500.00M                    /dev/sdc1(125)
  mirror            taft       mwi-a- 500.00M mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] taft       iwi-ao 500.00M                    /dev/sdb1(0)
  [mirror_mimage_1] taft       iwi-ao 500.00M                    /dev/sdd1(0)
  [mirror_mlog]     taft       lwi-ao   4.00M                    /dev/sde1(1)

Comment 18 errata-xmlrpc 2011-01-13 22:39:51 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0052.html