Bug 1451459
| Summary: | Attempting to activate a 7.4 cache metadata 2 format volume on a shared storage 7.3 machine results in the VG being deleted | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Zdenek Kabelac <zkabelac> |
| lvm2 sub component: | Cache Logical Volumes | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | agk, heinzm, heri, jbrassow, mcsontos, msnitzer, prajnoha, teigland, zkabelac |
| Version: | 7.4 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | lvm2-2.02.171-5.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-01 21:54:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Corey Marthaler
2017-05-16 16:58:40 UTC
The VG remains intact when lvmetad is not running on the 7.3 machine.
[root@harding-03 ~]# systemctl status lvm2-lvmetad
lvm2-lvmetad.service - LVM2 metadata daemon
Loaded: loaded (/usr/lib/systemd/system/lvm2-lvmetad.service; disabled; vendor preset: enabled)
Active: active (running) since Tue 2017-05-16 10:50:04 CDT; 1h 23min ago
Docs: man:lvmetad(8)
Main PID: 32643 (lvmetad)
CGroup: /system.slice/lvm2-lvmetad.service
32643 /usr/sbin/lvmetad -f
May 16 10:50:04 harding-03.lab.msp.redhat.com systemd[1]: Started LVM2 metadata daemon.
May 16 10:50:04 harding-03.lab.msp.redhat.com systemd[1]: Starting LVM2 metadata daemon...
[root@harding-02 ~]# systemctl status lvm2-lvmetad
lvm2-lvmetad.service - LVM2 metadata daemon
Loaded: loaded (/usr/lib/systemd/system/lvm2-lvmetad.service; disabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2017-05-16 12:13:19 CDT; 15s ago
Docs: man:lvmetad(8)
Main PID: 822 (code=exited, status=0/SUCCESS)
May 16 15:12:13 harding-02.lab.msp.redhat.com systemd[1]: Started LVM2 metadata daemon.
May 16 15:12:13 harding-02.lab.msp.redhat.com systemd[1]: Starting LVM2 metadata daemon...
May 16 12:13:19 harding-02.lab.msp.redhat.com systemd[1]: Stopping LVM2 metadata daemon...
May 16 12:13:19 harding-02.lab.msp.redhat.com lvmetad[822]: Failed to accept connection errno 11.
May 16 12:13:19 harding-02.lab.msp.redhat.com systemd[1]: Stopped LVM2 metadata daemon.
# 7.3 machine (with lvmetad stopped)
[root@harding-02 ~]# pvscan --cache
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
[root@harding-02 ~]# lvs
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Unknown status flag 'METADATA_FORMAT'.
Could not read status flags.
Couldn't read status flags for logical volume VG/pool.
Couldn't read all logical volume names for volume group VG.
Unknown status flag 'METADATA_FORMAT'.
Could not read status flags.
Couldn't read status flags for logical volume VG/pool.
Couldn't read all logical volume names for volume group VG.
Unknown status flag 'METADATA_FORMAT'.
Could not read status flags.
Couldn't read status flags for logical volume VG/pool.
Couldn't read all logical volume names for volume group VG.
Unknown status flag 'METADATA_FORMAT'.
Could not read status flags.
Couldn't read status flags for logical volume VG/pool.
Couldn't read all logical volume names for volume group VG.
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
home rhel_harding-02 -wi-ao---- 200.52g
root rhel_harding-02 -wi-ao---- 50.00g
swap rhel_harding-02 -wi-ao---- 27.95g
# Back to 7.4 machine
[root@harding-03 ~]# pvscan --cache
[root@harding-03 ~]# lvchange -ay VG/origin
[root@harding-03 ~]# lvs -a -o +devices
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
[lvol0_pmspare] VG ewi------- 12.00m /dev/mapper/mpatha1(5)
origin VG Cwi-a-C--- 20.00m [pool] [origin_corig] 0.00 0.39 0.00 origin_corig(0)
[origin_corig] VG owi-aoC--- 20.00m /dev/mapper/mpatha1(0)
[pool] VG Cwi---C--- 12.00m 0.00 0.39 0.00 pool_cdata(0)
[pool_cdata] VG Cwi-ao---- 12.00m /dev/mapper/mpathb1(0)
[pool_cmeta] VG ewi-ao---- 12.00m /dev/mapper/mpathb1(3)
The METADATA_FORMAT LV flag is added to the VG metadata so that older versions of lvm will not try to use the LV with the incompatible cache metadata format. Unfortunately, this is not a good way of handling backward compatability, because old versions fail to parse the VG metadata and return NULL. When using lvmetad, there is no VG metadata to send to lvmetad. This means that pvscan --cache sends the PV info to lvmetad, but then there is no VG metadata to send. In lvmetad, this looks like a PV exists with no VG using it. The recent "pv in use" flag does its job and tells lvm that the PV really is used by a VG, even though there's no VG metadata available for it. The next command, e.g. pvs, gets the PVs and VGs from lvmetad, and sees there is no VG using the PV marked in-use (and holding an mda). _check_or_repair_orphan_pv_ext() mistakenly thinks that this is a situation where the in-use flag was wrongly set and "repairs" the PV, and the VG is clobbered. There are multiple problems: 1. There needs to be a way of reading and returning VG metadata with unknown parts, so that the VG metadata can exist in lvm but not be used. Something similar is done for unknown segment types. The unknown segment types logic needs to be expanded to cover other cases, or the same idea needs to be implemented for other parts of the VG metadata that are unknown. With this in place, pvscan in the example above would get a kind of "opaque" VG metadata back from scanning, and send it to lvmetad. 2. vg_read should not be repairing PVs when it doesn't think the in-use flag is correct. This should be a manual operation. There are too many cases where a mistake or bug can trigger this and remove a VG (I've suggested this before when this happened to me in a different situation.) Even a user error related to a mistake zoning devices could cause lvm to clobber a VG just by reading it. 3. The cache metadata format 2 compatability needs to use some mechanism described in 1, or maybe there's some other mechanism. Adding to the problem is that it should use mechanisms that already exist in lvm versions that have been released. I'm wondering what methods we've used before when adding new features that are not compatible with old versions. When not using lvmetad, lvm from 7.3 reports the errors in comment 2 and shows the device as not even being a PV. Because it's not reported as a PV, lvm does not attempt to "repair" the in-use flag, so it doesn't automatically clobber the VG. However, by not thinking the device is a PV, it will also happily allow you to run pvcreate on it. So, the difference between with-lvmetad and without-lvmetad is that with lvmetad, lvm sees the device as an in-use PV, and without lvmetad, lvm sees the device as completely unused. The with-lvmetad state is more correct, and actually what we want, but unfortunately it also happens to trigger the incorrect repair code. We're discussing alternative solutions, such as appending these flags to the segment type in the on-disk metadata e.g. "cache+v2" so the old version will invoke the unknown segment type code which we believe still works. I assume raid is similarly affected by this problem. > 2. vg_read should not be repairing PVs when it doesn't think the in-use flag
> is correct.
In general, this repair only applies to PVs with no mda's on them, which makes it not quite as problematic (although I still think it's better to require a manual action to remove the PV.) To make the repair safer (e.g. less likely to be caused by other bugs in lvm as shown in this bz), the _check_or_repair_orphan_pv_ext() could read the mda itself directly from disk to verify there is no VG using the PV being repaired. As it stands now, unknown bugs elsewhere in lvm are liable to trigger the repair and wrongly remove a PV.
> > 2. vg_read should not be repairing PVs when it doesn't think the in-use flag
> > is correct.
>
> In general, this repair only applies to PVs with no mda's on them, which
> makes it not quite as problematic
This is still a misunderstanding of the pv-in-use repair case. It's in-use orphan PVs with mdas that are repaired (since in-use and orphan are contradictory, the in-use is cleared.) lvm expects that since the PV has an mda, it will know about the VG using it, and the PV won't look like an orphan. In this bug, the VG found in the mda is not understood (new flag unrecognized), and bad error handling makes lvm think there is no VG (instead of a VG that's not understood), which triggers the repair case of: PV in-use with mda which looks unused by a VG.
I no longer think there are "too many cases" where this repair could be wrongly triggered. I still think there are some (not least of which are unknown lvm bugs), which make me prefer an option for manual repair in the future.
I think it's impossible for the PV in-use repair code to be made both automatic and correct in all cases. The problematic case is when the VG metadata is damaged on a PV, and no other PVs with metadata are available at the moment. In this case, lvm will identify the PV with damaged metadata as an in-use orphan (the case that the repair code currently automatically repairs by clearing the in-use flag). The in-use flag is the one thing that identifies a PV as used when no metadata is available referencing it, i.e. exactly the kind of indicator that would be needed to protect the PV from the repair code from clobbering it. The automatic repair code does not account for the case where the PV does hold VG metadata, but the metadata has been damaged and can't be used to reference the PV. I think comment 9 isn't quite correct. lvm should keep a list of PVs with metadata that it can't read properly, and not do auto-repair of the in-use flag on those. This list of problematic PVs should also be used to: prevent pvcreate/vgextend from being run on them (like duplicate PVs), display them with some new flag in 'pvs'. (In reply to David Teigland from comment #10) > I think comment 9 isn't quite correct. lvm should keep a list of PVs with > metadata that it can't read properly, and not do auto-repair of the in-use > flag on those. I agree with this - it's simply not an "orphan PV" anymore if we know there are metadata available, but we failed to read them either because the checksum failed or because it contains metadata that the LVM version used doesn't understand. The code as it is today happily marks such PVs as orphans. If LVM skips the VG metadata in this case (and marks that device as orphan), the pv-in-use repair code needs to know that! (...to skip the repair in this case) We need a new state for this besides "in VG" an "orphan" PV. I'm still unclear about which auto-repair we are having trouble with. AFAIK - any PV 'auto-repair' should be ONLY running for a case the PV was KNOWN to be empty (no allocated extent) (Note: the problem is also closely related to 'raid' want-to-be transient-failure auto-repair - which is very fragile ATM) In case the PV is only marked 'IN-USE' but we don't know any other data about it, we should not consider this PV to be empty, so it should not qualify for auto repair. Running any repair on 'invalid' metadata is not going to work as long as the code cannot differentiate the reason of metadata read failure i.e. invalid metadata with i.e. broken checksum or failing parser with unsupported syntax and many other sorts of failure. (In reply to Zdenek Kabelac from comment #12) > I'm still unclear about which auto-repair we are having trouble with. > This code exactly in _vg_read fn: if (is_orphan_vg(vgname)) { if (use_precommitted) { log_error(INTERNAL_ERROR "vg_read_internal requires vgname " "with pre-commit."); return NULL; } return _vg_read_orphans(cmd, warn_flags, vgname, consistent); } Then _vg_read_orphans fn: if (!(vginfo = lvmcache_vginfo_from_vgname(orphan_vgname, NULL))) return_NULL; if (!(fmt = lvmcache_fmt_from_vgname(cmd, orphan_vgname, NULL, 0))) return_NULL; vg = fmt->orphan_vg; ... if (!lvmcache_foreach_pv(vginfo, _vg_read_orphan_pv, &baton)) return_NULL; ...and then _vg_read_orphan_pv calling _check_or_repair_pv_ext that fixes the "PV in use" flag - if we find some PV has this flag set, we try to fix that as we're processing "orphans" at this moment - so they're NOT in a VG. Unfortunately, we're mapping also PVs where we can't read VG metadata or VGs where CRC failed to "orphan PVs" too. That should be mapped onto a new entity like "PV with unknown VG metadata" so we can check properly for this and skip any "PV in use" repairs in this case. The "PV in use" repair code only fires if there's: - at least one metadata area present - AND at the same time the PV is marked as "used by a VG" - AND at the same time it's considered "orphan" by LVM In this case, the repair code tries to drop the "in use" flag. The problem is in the "at least one metadata area present" condition - if we know there's at least one metadata area and at the same time it's orphan PV, it can also mean: - the VG's CRC check failed (and hence we have no valid VG metadata) - we simply can't read the VG metadata (because it comes from newer LVM version?) This was not taken into account when coding the "PV in use repair" part and needs to be fixed. 1) In the short term, we will disable the code that automatically repairs in use flags. 2) We will consider how situations that are no longer repaired can be handled. E.g. make sure the messages/warnings are always appropriate and it's straightforward for someone to understand what they need to do to resolve them. This might involve command line extensions. 3) We will consider how to handle the low-level checksum/unrecognised metadata situations, perhaps with a new flag/internal state, so they can be distinguished. 4) We will consider whether metadata with correct checksum but nevertheless unrecognised either in whole or in part needs to be handled and preserved in some way - just as we do when a segment type is unrecognised. 5) We will consider whether we can re-enable an automatic in-use repair given some or all of (2)-(4), possibly introducing a multi-step VG metadata commit mechanism (where the VG metadata records PVs that still require their in-use flag cleared). 6) We will audit all the on-disk metadata extensions (especially raid) and identify similar changes that are incompatible with old versions of LVM and introduce segment_type+flags for each of them. 7) We will audit all changes to in-kernel state and ensure on-disk metadata adequately records the state required for recovery in line with (6). The first commit on this branch disables the auto repair: https://sourceware.org/git/?p=lvm2.git;a=shortlog;h=refs/heads/dev-dct-pv-invalid-metadata The second patch is a start of special handling for PVs with invalid metadata, keeping them in a special list of devices like duplicate PVs. This can be used to skip in-use repair on them, and prevent pvcreate from being run on them. lvm2 upstream commit converted 'cache2' support into new kind of flagging going with segtype name - this should produce metadata, which are readable by older lvm2 code and yet seen as unusable. Same change needs to be made for raid. With this patch part of problem should be solved since user will not be able to 'easily' trigger the incompatibility problem with unknown status flag. Patch seq: https://www.redhat.com/archives/lvm-devel/2017-May/msg00085.html added upstream this SEGTYPE_FLAG support. So any new STATUS_FLAG needs to be converted to SEGTYPE_FLAG, where we do know, unknow segtype works well. Fix verified in the latest rpms. 3.10.0-685.el7.x86_64 lvm2-2.02.171-6.el7 BUILT: Wed Jun 21 09:35:03 CDT 2017 lvm2-libs-2.02.171-6.el7 BUILT: Wed Jun 21 09:35:03 CDT 2017 lvm2-cluster-2.02.171-6.el7 BUILT: Wed Jun 21 09:35:03 CDT 2017 device-mapper-1.02.140-6.el7 BUILT: Wed Jun 21 09:35:03 CDT 2017 device-mapper-libs-1.02.140-6.el7 BUILT: Wed Jun 21 09:35:03 CDT 2017 device-mapper-event-1.02.140-6.el7 BUILT: Wed Jun 21 09:35:03 CDT 2017 device-mapper-event-libs-1.02.140-6.el7 BUILT: Wed Jun 21 09:35:03 CDT 2017 device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017 # format 2 # Created on a 7.4 machine and attempted to be activated on this 7.3 machine. [root@host-130 ~]# pvscan --cache WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT [root@host-130 ~]# vgchange -ay activator1 WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT Internal error: _emit_target cannot handle segment type cache-pool+METADATA_FORMAT 0 logical volume(s) in volume group "activator1" now active [root@host-130 ~]# lvs -a -o +devices WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT LV VG Attr LSize Pool Origin Data% Meta% Cpy%Sync Devices cache1 activator1 Cwi---C--- 100.00m [cache1_fast] [cache1_corig] cache1_corig(0) [cache1_corig] activator1 owi---C--- 100.00m /dev/sda2(0) [cache1_fast] activator1 vwi---u--- 52.00m [cache1_fast_cdata] activator1 -wi------- 52.00m /dev/sda3(0) [cache1_fast_cmeta] activator1 -wi------- 8.00m /dev/sda3(13) [lvol0_pmspare] activator1 ewi------- 8.00m /dev/sda3(15) # Unable to remove/alter the VG [root@host-130 ~]# lvremove -f activator1 WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT Cannot change VG activator1 with unknown segments in it! Cannot process volume group activator1 [root@host-130 ~]# vgremove activator1 WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT Cannot change VG activator1 with unknown segments in it! Cannot process volume group activator1 [root@host-132 ~]# vgrename activator1 activator7 WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT Cannot change VG activator1 with unknown segments in it! [root@host-132 ~]# lvcreate -n foo -L 100M activator1 WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT Cannot change VG activator1 with unknown segments in it! # Same thing on 7.2 machine [root@host-132 ~]# vgremove -f activator1 WARNING: Unrecognised segment type cache-pool+METADATA_FORMAT Cannot change VG activator1 with unknown segments in it! Cannot process volume group activator1 # Still usuable back on the 7.4 machine [root@host-127 ~]# pvscan --cache [root@host-127 ~]# lvchange -ay activator1 [root@host-127 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Cpy%Sync Devices cache1 activator1 Cwi-a-C--- 100.00m [cache1_fast] [cache1_corig] 0.24 0.63 0.00 cache1_corig(0) [cache1_corig] activator1 owi-aoC--- 100.00m /dev/sdb2(0) [cache1_fast] activator1 Cwi---C--- 52.00m 0.24 0.63 0.00 cache1_fast_cdata(0) [cache1_fast_cdata] activator1 Cwi-ao---- 52.00m /dev/sdb3(0) [cache1_fast_cmeta] activator1 ewi-ao---- 8.00m /dev/sdb3(13) [lvol0_pmspare] activator1 ewi------- 8.00m /dev/sdb3(15) # format 1 # Created on a 7.4 machine and attempted to be activated on this 7.3 machine. [root@host-130 ~]# lvchange -ay activator1 [root@host-130 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Cpy%Sync Devices cache1 activator1 Cwi-a-C--- 100.00m [cache1_fast] [cache1_corig] 0.00 0.54 0.00 cache1_corig(0) [cache1_corig] activator1 owi-aoC--- 100.00m /dev/sda2(0) [cache1_fast] activator1 Cwi---C--- 52.00m 0.00 0.54 0.00 cache1_fast_cdata(0) [cache1_fast_cdata] activator1 Cwi-ao---- 52.00m /dev/sda3(0) [cache1_fast_cmeta] activator1 ewi-ao---- 8.00m /dev/sda3(13) [lvol0_pmspare] activator1 ewi------- 8.00m /dev/sda3(15) # Same on 7.2 node [root@host-132 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Cpy%Sync Devices cache1 activator1 Cwi-a-C--- 100.00m [cache1_fast] [cache1_corig] 0.00 0.54 100.00 cache1_corig(0) [cache1_corig] activator1 owi-aoC--- 100.00m /dev/sda2(0) [cache1_fast] activator1 Cwi---C--- 52.00m 0.00 0.54 100.00 cache1_fast_cdata(0) [cache1_fast_cdata] activator1 Cwi-ao---- 52.00m /dev/sda3(0) [cache1_fast_cmeta] activator1 ewi-ao---- 8.00m /dev/sda3(13) [lvol0_pmspare] activator1 ewi------- 8.00m /dev/sda3(15) # Alteration is allowed [root@host-132 ~]# lvremove -f activator1 Logical volume "cache1_fast" successfully removed Logical volume "cache1" successfully removed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2222 |