Bug 1266932
Summary: | "Volume group not found" issues when doing meta intensive operations on shared VGs | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> | ||||||
Component: | lvm2 | Assignee: | David Teigland <teigland> | ||||||
lvm2 sub component: | LVM lock daemon / lvmlockd | QA Contact: | cluster-qe <cluster-qe> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | low | ||||||||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, prajnoha, teigland, zkabelac | ||||||
Version: | 7.2 | ||||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | lvm2-2.02.165-3.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-11-04 04:10:56 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Corey Marthaler
2015-09-28 14:53:30 UTC
This appears to have to do with mixing shared and exclusively locked lvs on the same vg. Bits of the logs from one reproduction shows that lvmetad holds VG version 54, the VG lock is acquired which finds the latest version on disk is 56, so lvmlockd invalidates the VG in lvmetad, and tells lvmetad the latest version is 56. This all looks to be correctly recorded in lvmetad. What's not clear is what the command does when it reads the VG. It should read lvmetad version 54, see the invalid flag, then read version 56 from disk. I'm guessing that it's complaining about version 56 on disk. We probably need to enable command debugging to print the actual metadata that it's complaining about. # VGID to INFO version mapping vgid_to_info { n2PS1S-ZQrj-ueSe-SyDJ-1RmD-3xig-bqKRJb = 0 BKOtn7-N6vL-r0wN-AgRP-iA1p-FUEm-8JgKrc = 0 3Kz9wm-iT3M-UCvn-zkep-RsCb-kVjS-r8gSH2 = 56 UYZH9P-4742-NB5N-d4Im-RF6C-6q2W-8WkUyO = 0 f02nyp-JAja-TCbt-DYFZ-dG67-25Kx-VbjT9C = 0 } # VGID to INFO flags mapping vgid_to_info { n2PS1S-ZQrj-ueSe-SyDJ-1RmD-3xig-bqKRJb = 0 BKOtn7-N6vL-r0wN-AgRP-iA1p-FUEm-8JgKrc = 0 3Kz9wm-iT3M-UCvn-zkep-RsCb-kVjS-r8gSH2 = 1 UYZH9P-4742-NB5N-d4Im-RF6C-6q2W-8WkUyO = 0 f02nyp-JAja-TCbt-DYFZ-dG67-25Kx-VbjT9C = 0 } # VGID to VGNAME mapping vgid_to_vgname { n2PS1S-ZQrj-ueSe-SyDJ-1RmD-3xig-bqKRJb = "activator3" BKOtn7-N6vL-r0wN-AgRP-iA1p-FUEm-8JgKrc = "activator2" 3Kz9wm-iT3M-UCvn-zkep-RsCb-kVjS-r8gSH2 = "activator1" 2AAFX5-QPAe-oTxp-O1I0-0VwH-gwBA-cVKNnF = "rhel_mckinley-01" UYZH9P-4742-NB5N-d4Im-RF6C-6q2W-8WkUyO = "activator4" f02nyp-JAja-TCbt-DYFZ-dG67-25Kx-VbjT9C = "global" } # VGNAME to VGID mapping vgname_to_vgid { activator1 = "3Kz9wm-iT3M-UCvn-zkep-RsCb-kVjS-r8gSH2" activator2 = "BKOtn7-N6vL-r0wN-AgRP-iA1p-FUEm-8JgKrc" activator4 = "UYZH9P-4742-NB5N-d4Im-RF6C-6q2W-8WkUyO" activator3 = "n2PS1S-ZQrj-ueSe-SyDJ-1RmD-3xig-bqKRJb" rhel_mckinley-01 = "2AAFX5-QPAe-oTxp-O1I0-0VwH-gwBA-cVKNnF" global = "f02nyp-JAja-TCbt-DYFZ-dG67-25Kx-VbjT9C" } 3Kz9wm-iT3M-UCvn-zkep-RsCb-kVjS-r8gSH2 { id="3Kz9wm-iT3M-UCvn-zkep-RsCb-kVjS-r8gSH2" seqno=54 format="lvm2" status=["RESIZEABLE","READ"] flags="WRITE_LOCKED" lock_type="sanlock" lock_args="1.0.0:lvmlock" extent_size=8192 max_lv=0 max_pv=0 metadata_copies=0 1443466540 recv lvcreate[77724.1549] lock vg "activator1" mode ex flags 0 1443466540 S lvm_activator1 R VGLK action lock ex 1443466540 S lvm_activator1 R VGLK res_lock mode ex 1443466540 S lvm_activator1 R VGLK lock_san acquire /dev/mapper/activator1-lvmlock:69206016 1443466540 S lvm_activator1 R VGLK lock_san get r_version 56 1443466540 S lvm_activator1 R VGLK res_lock lm done r_version 56 1443466540 S lvm_activator1 R VGLK res_lock set lvmetad vg version 56 1443466540 send lvcreate[77724.1549] lock vg rv 0 1443466541 close lvcreate[77724.1549] fd 9 1443466541 S lvm_activator1 R VGLK res_unlock from close 1443466541 S lvm_activator1 R VGLK unlock_san r_version 0 flags 0 1443466541 S lvm_activator1 R VGLK res_unlock lm done Sep 28 13:55:40 mckinley-01 qarshd[77723]: Running cmdline: lvcreate --activate ey --thinpool pool4 --zero n -L 150M activator1 VG metadata comparison failed Volume group "activator1" not found Created attachment 1078080 [details]
-vvvv of failing lvcreate cmd
I've added a commit upstream that will show what exactly is causing the comparison error. Will still to run with -vvv to see those messages. I reproduced this and have a fix that I'm doing more testing with. I fixed a metadata comparison bug here: https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=ae0a8740c5899520ec4c94fbf080623167d8ecba We never had enough information about the problem above to say if it would be fixed by this. Created attachment 1199597 [details]
verbose output of failing create cmd
This is still reproducible with the latest rpms.
#format_text/format-text.c:548 Read activator3 metadata (64) from /dev/mapper/mpathe1 at 1079296 size 33784
#cache/lvmetad.c:1923 VG activator3 metadata comparison failed for device /dev/mapper/mpathe1 vs /dev/mapper/mpathd1
#cache/lvmetad.c:70 VG activator3 metadata inequality at creation_time / creation_time: 1473449603 / 1473449604
#metadata/vg.c:89 Freeing VG activator3 at 0x7f6bf47a1de0.
#cache/lvmetad.c:1093 VG activator3 from lvmetad not found during rescan.
#metadata/vg.c:89 Freeing VG activator3 at 0x7f6bf474dca0.
#metadata/metadata.c:4793 <backtrace>
#metadata/metadata.c:5821 Volume group "activator3" not found
3.10.0-498.el7.x86_64
lvm2-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
lvm2-libs-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
lvm2-cluster-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
device-mapper-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
device-mapper-libs-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
device-mapper-event-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
device-mapper-event-libs-1.02.134-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 05:29:13 CDT 2016
cmirror-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
sanlock-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016
sanlock-lib-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016
lvm2-lockd-2.02.165-1.el7 BUILT: Wed Sep 7 11:04:22 CDT 2016
This actually doesn't require many LVs or VGs. I'm able to reproduce this with a test using just one VG and doing meta data intensive operations on a few LVs (create, up/down mirror convert, read/write perm changes, up/down activation, deletion). Do you have a record of the last few commands that are run prior to hitting this, and possibly an lvm log of those commands? There's a good chance that the problem is created by a command that has already completed, not by the command that reports the problem. Also, if this happens again, it might be useful to just stop everything and let me take a look at the system. I'm curious to inspect the metadata on each of the PVs and in the cache at that point. Indeed, while we can fix the symptom (mismatched creation_time) pretty easily and it's harmless in the case you've reported, we want to understand what's causing it in case it indicates a more serious problem that needs fixing. The internal VG metadata is translated into a buffer which is attached to the 'fid' and vg_write shares this so all disks should indeed have the same content and show identical creation_times if the locking and sequence number incrementing is behaving correctly and the calls to vg_write() and fid initialisation/sharing are done correctly. So we need to see a system in this state and examine the on-disk metadata on each PV (including previous versions), the backup/archive files on each node, the -vvvv traces of the previous commands that were run, and reconstruct how the state arose to find the error in the logic. This appears to be an issue where the lvmcache state is not quite correct for the one command. Perhaps the creation_time is not being invalidated or refreshed in lvmcache when the rest of the metadata is. Analyzing the lvmcache related messages when reading the two devices with mismatching creation_time may give us a more exact answer (that's the log in comment 10.) The problem has been found. To compare the metadata between two devs, the code is generating a new config tree for each instance of the metadata from the two devs, and then comparing those config trees. The generation of the config tree inserts the extraneous data (desc and creation_time), which causes a mismatch if it just happens that a one second boundary is crossed in the process. The actual VG metadata itself matches. The extraneous data should be excluded from the comparison; working on a fix for that now. fix is here https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=d455300d7baff341f1c7fe5776d975369dd3a23a Marking verified in the latest rpms. Many iterations of the same test that used to reproduce this right away now pass. 3.10.0-498.el7.x86_64 lvm2-2.02.165-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 lvm2-libs-2.02.165-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 lvm2-cluster-2.02.165-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 device-mapper-1.02.134-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 device-mapper-libs-1.02.134-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 device-mapper-event-1.02.134-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 device-mapper-event-libs-1.02.134-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 05:29:13 CDT 2016 cmirror-2.02.165-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 sanlock-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 sanlock-lib-3.4.0-1.el7 BUILT: Fri Jun 10 11:41:03 CDT 2016 lvm2-lockd-2.02.165-3.el7 BUILT: Wed Sep 21 08:26:18 CDT 2016 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1445.html |