Bug 537913
Summary: | Volume Group doesn't recover if PV disappears and later reappers | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Mikuláš Patočka <mpatocka> |
Component: | lvm2 | Assignee: | Petr Rockai <prockai> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.4 | CC: | agk, cmarthal, dwysocha, heinzm, iannis, ipilcher, jbrassow, mbroz, prockai, vincent |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-01-13 22:39:51 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mikuláš Patočka
2009-11-16 19:05:38 UTC
Hi, this is intentional, and I think the right fix is to introduce a command like "vgextend --recover" that'd erase the MISSING flag from the PV in question. This behaviour exists because once a device went missing, all bets are off as to what may have changed on that device. I cannot currently come up with a convincing example of where this would lead to a catastrophic failure, so it may be that this case of paranoia was not completely justified. We just agreed with Alasdair that this is the safest course of action regarding integrity. I agree that it should be possible to easily put that device back though, at least manually if not automatically. But then vgextend --recover would have to be executed from initrd. And it isn't. Example: root filesystem is on pv1. There is a mirror that has one leg on pv1 and another on pv2. You unplug and replug pv1 and reset the server. The server won't boot. To make the server boot, you have to insert the rescue CD and edit the metadata (and if the admin doesn't exactly know what to do with the metadata, he will have several hours downtime). As for your argument "you may not know what happened to pv1 while it was unplugged" - it is improbable that pv1 was inserted into another computer (if the admin intended to use the PV in another computer, would remove it cleanly with pvmove and vgreduce before unplugging it). - if pv1 was inserted into another computer, LVM on that computer sees it as imcomplete VG and won't allow you to modify it. So it can't be secretly modified. The only allowed modification is to kill invisible LVs with vgreduce --removemissing --force and convert it into complete VG. But this is only hapenning on admin's request. If the admin hasn't converted pv1 into a complete VG, you can know that it hasn't been modified. Well, when PVs are missing and you issue vgchange -a y, all the LVs in the VG that are complete will be activated as usual. You just cannot edit the metadata until you fix that missing PV. There is of course a problem when the root LV is on a missing volume (which seems to be the problem you are pointing out). Well, although this may be of concern somewhat, you should probably note that in previous versions of RHEL, this root LV would be simply erased in the scenario outlined, without easy ways back -- you could probably get vgcfgrestore fix it, if you kept the LVM archive dir outside of your root filesystem and the VG in general (which is, let's admit it, not the usual case). So I think this still is an improvement over things. You should probably not be yanking the root filesystem of a running server either way (you cannot do that safely, LVM or not, unless maybe in some very special cases). Providing reasonably easy recovery should be, IMHO, acceptable. "in previous versions of RHEL, this root LV would be simply erased in the scenario outlined" That's terrible. Why does dmeventd do it? Why it can't just kill the failed mirror leg and not touch anything else? "You should probably not be yanking the root filesystem of a running server either way" But it may happen unintentionally. For example, for me it once happened that on a server-class mainboard with server-class scsi controller (MPT), loose contact on PCI-X bus caused the card to be lost while the server was running --- removing the card and cleaning the contacts fixed it. Another possibility is overheat, disks will automatically turn themselves off when overheating. I still remember how I was fixing unbootable MD-RAID5 when the server room overheated, two disks in the array turned themselves off and that stupid kernel driver wrote to the remaining disks metadata meaning that the array is disintegrated ... and after cooling the server room and bringing all the disks up, the system didn't boot. Raid-tools couldn't restore it at all (mdadm maybe could, but it was not installed there, it was new and not common back in that time). I finally resolved the problem by manually editing the metadata with a disk editor --- but it cost a day downtime (for 300 user server) to find out how to edit it. You definitely must not damage data after temporary failures. Since it is too late to address this issue in RHEL 5.5, it has been proposed for RHEL 5.6. Contact your support representative if you need to escalate this issue. "When the failed disk is added back again, the PV is still marked as MISSING and there is no easy way how to clear the MISSING flag and recover the volume group." So let's look at the next step of this. A device disappears. MISSING_PV gets set. The device reappears. What is the process we offer now for clearing the MISSING_PV flags? - If the metadata had not been changed since the device disappeared. What if the VG had been cleaned up already, to remove the PV, and then the PV reappears and you decide you do want to merge some LVs in it back into the original VG again? - Could it be renamed to appear as a separate VG still containing LVs, cleaned up, then vgmerged? Does it make any difference if more than one PV is affected? [Points for discussion. No specific proposals yet.] In the first case, nothing happens: if metadata did not change, there is nothing to do. The PV was not flagged and the VG continues undisrupted. When the PV reappears but is already empty and removed from the VG, it is kicked out of the VG and the VG continues to work properly as well. The problem happens when metadata *did* change (presumably due to repairing mirrors) but the PV is not empty (if it is empty and stays in the VG, it can be vgreduced away). One option is vgreduce --removemissing --force, which is likely not what you want. Adding a command for pulling the PV back into normal use should be provided for these situations, I would think. My initial thought is "vgextend --restore vg pv". Once the UI is decided, this should be easy to implement. I have implemented vgextend --restore, and the patch is pending review. If that is accepted upstream, we will have these scenarios with transient failures: - there are no mirrors involving this PV and no manual metadata edits happened while the device was gone: handled transparently, LVM does not do anything - in all other cases, the PV will be flagged as MISSING in the metadata; mirrored volumes will be automatically repaired, and any non-mirrored LVs using the PV will be made inacessible In the latter case, it is enough to run vgextend --restore vg pv to make the non-mirrored LVs accessible again. If a non-mirrored root volume happened to live on a failed PV, this will still prevent normal boot. There may further be ways to improve that, but they are out of scope for 5.6, in my opinion. I think that vgextend --restore reasonably improves the situation, relative to the original bug report. Checked in upstream. vgextend --restoremissing added to lvm2-2.02.74-1.el5. The basic 'vgextend --restoremissing' test case appears to work, marking verified. lvm2-2.02.74-2.el5 BUILT: Tue Nov 9 08:03:06 CST 2010 lvm2-cluster-2.02.74-3.el5 BUILT: Tue Nov 9 08:01:59 CST 2010 device-mapper-1.02.55-2.el5 BUILT: Tue Nov 9 06:41:00 CST 2010 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 [root@taft-01 ~]# vgcreate taft /dev/sd[bcde]1 Volume group "taft" successfully created [root@taft-01 ~]# lvcreate -m 1 -n mirror -L 500M taft Logical volume "mirror" created [root@taft-01 ~]# lvcreate -n linear -L 500M taft /dev/sdc1 Logical volume "linear" created [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Log Copy% Devices linear taft -wi-a- 500.00M /dev/sdc1(125) mirror taft mwi-a- 500.00M mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] taft iwi-ao 500.00M /dev/sdb1(0) [mirror_mimage_1] taft iwi-ao 500.00M /dev/sdc1(0) [mirror_mlog] taft lwi-ao 4.00M /dev/sde1(0) [root@taft-01 ~]# echo offline > /sys/block/sdc/device/state [root@taft-01 ~]# pvscan /dev/taft/linear: read failed after 0 of 4096 at 524222464: Input/output error [...] /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error Couldn't find device with uuid TnnUdV-JLu3-QwBP-hvbl-bE4p-S8Y3-Wjhx2G. PV /dev/sdb1 VG taft lvm2 [135.66 GB / 135.18 GB free] PV unknown device VG taft lvm2 [135.66 GB / 134.69 GB free] PV /dev/sdd1 VG taft lvm2 [135.66 GB / 135.66 GB free] PV /dev/sde1 VG taft lvm2 [135.66 GB / 135.66 GB free] [root@taft-01 ~]# dd if=/dev/zero of=/dev/taft/mirror count=2 2+0 records in 2+0 records out 1024 bytes (1.0 kB) copied, 1.59008 seconds, 0.6 kB/s [root@taft-01 ~]# lvs -a -o +devices /dev/taft/linear: read failed after 0 of 4096 at 524222464: Input/output error [...] /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error Couldn't find device with uuid TnnUdV-JLu3-QwBP-hvbl-bE4p-S8Y3-Wjhx2G. LV VG Attr LSize Log Copy% Devices linear taft -wi-a- 500.00M unknown device(125) mirror taft mwi-a- 500.00M mirror_mlog 47.20 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] taft Iwi-ao 500.00M /dev/sdb1(0) [mirror_mimage_1] taft Iwi-ao 500.00M /dev/sdd1(0) [mirror_mlog] taft lwi-ao 4.00M /dev/sde1(1) [root@taft-01 ~]# echo running > /sys/block/sdc/device/state [root@taft-01 ~]# lvs -a -o +devices WARNING: Inconsistent metadata found for VG taft - updating to use version 11 Missing device /dev/sdc1 reappeared, updating metadata for VG taft to version 11. Device still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing. LV VG Attr LSize Log Copy% Devices linear taft -wi-a- 500.00M /dev/sdc1(125) mirror taft mwi-a- 500.00M mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] taft iwi-ao 500.00M /dev/sdb1(0) [mirror_mimage_1] taft iwi-ao 500.00M /dev/sdd1(0) [mirror_mlog] taft lwi-ao 4.00M /dev/sde1(1) [root@taft-01 ~]# vgextend --restoremissing taft /dev/sdc1 Volume group "taft" successfully extended [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Log Copy% Devices linear taft -wi-a- 500.00M /dev/sdc1(125) mirror taft mwi-a- 500.00M mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] taft iwi-ao 500.00M /dev/sdb1(0) [mirror_mimage_1] taft iwi-ao 500.00M /dev/sdd1(0) [mirror_mlog] taft lwi-ao 4.00M /dev/sde1(1) An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0052.html |