Red Hat Bugzilla – Bug 1434054
vgs segfault after re-enabling failed raid10 images when lvmetad is not running
Last modified: 2017-10-31 10:42:36 EDT
Rhel7.4 failure: Mar 20 10:31:38 host-092 kernel: vgs[6962]: segfault at 80 ip 00007fa666c7c0c7 sp 00007ffc69b90358 error 4 in libc-2.17.so[7fa666be8000+1b8000] Core was generated by `vgs'. Program terminated with signal 11, Segmentation fault. #0 0x00007f67dc5330c7 in __strncpy_sse2 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.168-5.el7.x86_64 elfutils-libs-0.168-5.el7.x86_64 glibc-2.17-184.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-9.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 pcre-8.32-17.el7.x86_64 readline-6.2-10.el7.x86_64 systemd-libs-219-32.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) (gdb) (gdb) (gdb) bt #0 0x00007f67dc5330c7 in __strncpy_sse2 () from /lib64/libc.so.6 #1 0x000055b37a4f6771 in strncpy (__len=32, __src=<optimized out>, __dest=0x7fff0384fd80 "21b0N4GkK13xlvCPo2UlYhlw7bKwHVVU") at /usr/include/bits/string3.h:120 #2 lvmcache_info_from_pvid (pvid=<optimized out>, dev=0x0, valid_only=valid_only@entry=0) at cache/lvmcache.c:717 #3 0x000055b37a55302f in _check_or_repair_pv_ext (inconsistent_pvs=<synthetic pointer>, repair=1, vg=0x55b37c9bc1d0, cmd=0x55b37c90c020) at metadata/metadata.c:4055 #4 _vg_read (cmd=cmd@entry=0x55b37c90c020, vgname=<optimized out>, vgname@entry=0x55b37c9546d0 "helter_skelter", vgid=<optimized out>, vgid@entry=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", warn_flags=warn_flags@entry=1, consistent=consistent@entry=0x7fff03850004, precommitted=precommitted@entry=0) at metadata/metadata.c:4625 #5 0x000055b37a553bc6 in vg_read_internal (cmd=cmd@entry=0x55b37c90c020, vgname=vgname@entry=0x55b37c9546d0 "helter_skelter", vgid=vgid@entry=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", warn_flags=warn_flags@entry=1, consistent=consistent@entry=0x7fff03850004) at metadata/metadata.c:4791 #6 0x000055b37a555976 in _recover_vg (vgid=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", vg_name=0x55b37c9546d0 "helter_skelter", cmd=0x55b37c90c020) at metadata/metadata.c:5525 #7 _vg_lock_and_read (lockd_state=4, read_flags=262144, status_flags=0, lock_flags=33, vgid=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", vg_name=0x55b37c9546d0 "helter_skelter", cmd=0x55b37c90c020) at metadata/metadata.c:5830 #8 vg_read (cmd=cmd@entry=0x55b37c90c020, vg_name=vg_name@entry=0x55b37c9546d0 "helter_skelter", vgid=vgid@entry=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", read_flags=read_flags@entry=262144, lockd_state=4) at metadata/metadata.c:5916 #9 0x000055b37a4dbb26 in _process_vgnameid_list (process_single_vg=0x55b37a4d1d40 <_vgs_single>, handle=0x55b37c9536e0, arg_tags=0x7fff03850100, arg_vgnames=0x7fff03850110, vgnameids_to_process=0x7fff03850130, read_flags=262144, cmd=0x55b37c90c020) at toollib.c:1946 #10 process_each_vg (cmd=cmd@entry=0x55b37c90c020, argc=<optimized out>, argv=<optimized out>, one_vgname=one_vgname@entry=0x0, use_vgnames=use_vgnames@entry=0x0, read_flags=262144, read_flags@entry=0, include_internal=include_internal@entry=0, handle=handle@entry=0x55b37c9536e0, process_single_vg=0x55b37a4d1d40 <_vgs_single>) at toollib.c:2277 #11 0x000055b37a4d3b7d in _do_report (cmd=cmd@entry=0x55b37c90c020, handle=handle@entry=0x55b37c9536e0, args=args@entry=0x7fff03850300, single_args=single_args@entry=0x7fff03850348) at reporter.c:1183 #12 0x000055b37a4d3d82 in _report (cmd=0x55b37c90c020, argc=0, argv=0x7fff03850970, report_type=<optimized out>) at reporter.c:1428 #13 0x000055b37a4c61c7 in lvm_run_command (cmd=cmd@entry=0x55b37c90c020, argc=0, argc@entry=1, argv=0x7fff03850970, argv@entry=0x7fff03850968) at lvmcmdline.c:2880 #14 0x000055b37a4c703e in lvm2_main (argc=1, argv=0x7fff03850968) at lvmcmdline.c:3408 #15 0x00007f67dc4c0c05 in __libc_start_main () from /lib64/libc.so.6 #16 0x000055b37a4a5dd3 in _start () 3.10.0-609.el7.x86_64 lvm2-2.02.169-0.1253.el7 BUILT: Fri Mar 17 09:27:00 CDT 2017 lvm2-libs-2.02.169-0.1253.el7 BUILT: Fri Mar 17 09:27:00 CDT 2017 lvm2-cluster-2.02.169-0.1253.el7 BUILT: Fri Mar 17 09:27:00 CDT 2017 device-mapper-1.02.138-0.1253.el7 BUILT: Fri Mar 17 09:27:00 CDT 2017 device-mapper-libs-1.02.138-0.1253.el7 BUILT: Fri Mar 17 09:27:00 CDT 2017 device-mapper-event-1.02.138-0.1253.el7 BUILT: Fri Mar 17 09:27:00 CDT 2017 device-mapper-event-libs-1.02.138-0.1253.el7 BUILT: Fri Mar 17 09:27:00 CDT 2017 device-mapper-persistent-data-0.6.3-1.el7 BUILT: Fri Jul 22 05:29:13 CDT 2016
Quick update that this exists in the latest errata rpms as well. 3.10.0-635.el7.x86_64 lvm2-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 lvm2-libs-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 lvm2-cluster-2.02.169-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-libs-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-event-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-event-libs-1.02.138-3.el7 BUILT: Wed Mar 29 09:17:46 CDT 2017 device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017 Mar 30 11:56:30 host-077 kernel: vgs[19420]: segfault at 80 ip 00007ff9c18630c7 sp 00007ffd292e8da8 error 4 in libc-2.17.so[7ff9c17cf000+1b8000] (gdb) bt #0 0x00007f6a4a1320c7 in __strncpy_sse2 () from /lib64/libc.so.6 #1 0x00005632f56a634e in strncpy (__len=32, __src=<optimized out>, __dest=0x7ffddcdb4e50 "YxoVmYt7UrcZNS9S3ozLNAUymIe5bidz") at /usr/include/bits/string3.h:120 #2 lvmcache_info_from_pvid (pvid=<optimized out>, dev=0x0, valid_only=valid_only@entry=0) at cache/lvmcache.c:717 #3 0x00005632f570093a in _check_or_repair_pv_ext (inconsistent_pvs=<synthetic pointer>, repair=1, vg=0x5632f79624e0, cmd=0x5632f7899020) at metadata/metadata.c:4055 #4 _vg_read (cmd=cmd@entry=0x5632f7899020, vgname=<optimized out>, vgname@entry=0x5632f78e5548 "black_bird", vgid=<optimized out>, vgid@entry=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", warn_flags=warn_flags@entry=1, consistent=consistent@entry=0x7ffddcdb50c4, precommitted=precommitted@entry=0) at metadata/metadata.c:4625 #5 0x00005632f57014aa in vg_read_internal (cmd=cmd@entry=0x5632f7899020, vgname=vgname@entry=0x5632f78e5548 "black_bird", vgid=vgid@entry=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", warn_flags=warn_flags@entry=1, consistent=consistent@entry=0x7ffddcdb50c4) at metadata/metadata.c:4791 #6 0x00005632f57031bd in _recover_vg (vgid=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", vg_name=0x5632f78e5548 "black_bird", cmd=0x5632f7899020) at metadata/metadata.c:5525 #7 _vg_lock_and_read (lockd_state=4, read_flags=262144, status_flags=0, lock_flags=33, vgid=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", vg_name=0x5632f78e5548 "black_bird", cmd=0x5632f7899020) at metadata/metadata.c:5830 #8 vg_read (cmd=cmd@entry=0x5632f7899020, vg_name=vg_name@entry=0x5632f78e5548 "black_bird", vgid=vgid@entry=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", read_flags=read_flags@entry=262144, lockd_state=4) at metadata/metadata.c:5916 #9 0x00005632f568c324 in _process_vgnameid_list (process_single_vg=0x5632f56829d0 <_vgs_single>, handle=0x5632f78e4760, arg_tags=0x7ffddcdb51b0, arg_vgnames=0x7ffddcdb51c0, vgnameids_to_process=0x7ffddcdb51e0, read_flags=262144, cmd=0x5632f7899020) at toollib.c:1946 #10 process_each_vg (cmd=cmd@entry=0x5632f7899020, argc=<optimized out>, argv=<optimized out>, one_vgname=one_vgname@entry=0x0, use_vgnames=use_vgnames@entry=0x0, read_flags=262144, read_flags@entry=0, include_internal=include_internal@entry=0, handle=handle@entry=0x5632f78e4760, process_single_vg=0x5632f56829d0 <_vgs_single>) at toollib.c:2277 #11 0x00005632f56846af in _do_report (cmd=cmd@entry=0x5632f7899020, handle=handle@entry=0x5632f78e4760, args=args@entry=0x7ffddcdb53b0, single_args=single_args@entry=0x7ffddcdb53f8) at reporter.c:1183 #12 0x00005632f568489f in _report (cmd=0x5632f7899020, argc=0, argv=0x7ffddcdb5a00, report_type=<optimized out>) at reporter.c:1428 #13 0x00005632f56773b5 in lvm_run_command (cmd=cmd@entry=0x5632f7899020, argc=0, argc@entry=1, argv=0x7ffddcdb5a00, argv@entry=0x7ffddcdb59f8) at lvmcmdline.c:2880 #14 0x00005632f5677f83 in lvm2_main (argc=1, argv=0x7ffddcdb59f8) at lvmcmdline.c:3406 #15 0x00007f6a4a0bfc05 in __libc_start_main () from /lib64/libc.so.6 #16 0x00005632f5657e2e in _start ()
I think this is another case where is_missing_pv(pvl->pv) is not always sufficient to check for a missing device, and pvl->pv->dev also needs to be checked for NULL. Another commit fixing the same kind of segfault: 3c53acb378478f23acf624be8836c0cb24c2724e Here's a patch that I suspect will fix this, but I've not been able to reproduce the segfault to verify. diff --git a/lib/metadata/metadata.c b/lib/metadata/metadata.c index 1b500ed0382a..972306eea853 100644 --- a/lib/metadata/metadata.c +++ b/lib/metadata/metadata.c @@ -4039,6 +4039,7 @@ static int _check_or_repair_pv_ext(struct cmd_context *cmd, struct volume_group *vg, int repair, int *inconsistent_pvs) { + char uuid[64] __attribute__((aligned(8))); struct lvmcache_info *info; uint32_t ext_version, ext_flags; struct pv_list *pvl; @@ -4052,6 +4053,14 @@ static int _check_or_repair_pv_ext(struct cmd_context *cmd, if (is_missing_pv(pvl->pv)) continue; + if (!pvl->pv->dev) { + /* is_missing_pv doesn't catch NULL dev */ + memset(&uuid, 0, sizeof(uuid)); + id_write_format(&pvl->pv->id, uuid, sizeof(uuid)); + log_warn("WARNING: Not repairing PV %s with missing device.", uuid); + continue; + } + if (!(info = lvmcache_info_from_pvid(pvl->pv->dev->pvid, pvl->pv->dev, 0))) { log_error("Failed to find cached info for PV %s.", pv_dev_name(pvl->pv)); goto out;
This seems like the likely fix, although I've not reproduced the bug to verify it. https://sourceware.org/git/?p=lvm2.git;a=commit;h=d45531712d7c18035975980069a1683c218abec2
This is a correct fix to avoid the segfault. However, the bug report has one puzzling line in it: Missing device unknown device reappeared, updating metadata for VG black_bird to version 7. which is noticeably NOT followed by Device still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing. as in the other cases. Why has the code apparently reinstated a missing device? Firstly, let's find an easier way to reach that path through the code to get a simpler reproducer: Create a VG with 3 PVs pv1, pv2, pv3. Hide pv2. Run vgreduce --removemissing. - the VG now contains pv1 and pv3. pv2 has out-dated metadata on it. Reinstate the hidden PV pv2 and at the same time hide a different PV pv3. The code has to do two things at once now - sort out the reinstated pv2 which requires a metadata update and deal with the missing pv3. This takes it through _check_reinstated_pv() responsible for the messages while there is already a PV marked internally as missing (pv3). Soon afterwards, it crashes. The patch in the last couple of comments shows that pv->dev is NULL for an unknown device. A quick glance at _check_reinstated_pv() shows that it's happily considering that because NULL == NULL the unknown device is no longer a MISSING_PV!
https://www.redhat.com/archives/lvm-devel/2017-May/msg00038.html https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=80900dcf76d2c91d8892e54913367bfed1012056
_check_reappeared_pv() incorrectly clears the MISSING_PV flags of PVs with unknown devices. While one caller avoids passing such PVs into the function, the other doesn't. Move the check inside the function so it's not forgotten. Without this patch (but with the first patch on this bug), if the normal VG reading code tries to repair inconsistent metadata while there is an unknown PV, it incorrectly considers the missing PVs no longer to be missing and produces incorrect 'pvs' output omitting the missing PV, for example. Easy reproducer: Create a VG with 3 PVs pv1, pv2, pv3. Hide pv2. Run vgreduce --removemissing. Reinstate the hidden PV pv2 and at the same time hide a different PV pv3. Run 'pvs' - incorrect output. Run 'pvs' again - correct output.
Marking verified in the latest rpms. 3.10.0-672.el7.x86_64 lvm2-2.02.171-3.el7 BUILT: Wed May 31 08:36:29 CDT 2017 lvm2-libs-2.02.171-3.el7 BUILT: Wed May 31 08:36:29 CDT 2017 lvm2-cluster-2.02.171-3.el7 BUILT: Wed May 31 08:36:29 CDT 2017 device-mapper-1.02.140-3.el7 BUILT: Wed May 31 08:36:29 CDT 2017 device-mapper-libs-1.02.140-3.el7 BUILT: Wed May 31 08:36:29 CDT 2017 device-mapper-event-1.02.140-3.el7 BUILT: Wed May 31 08:36:29 CDT 2017 device-mapper-event-libs-1.02.140-3.el7 BUILT: Wed May 31 08:36:29 CDT 2017 device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017 The test case in comment #0 no longer causes the vgs segfault w/o lvmetad running. ================================================================================ Iteration 0.1 started at Tue Jun 6 17:25:02 CDT 2017 ================================================================================ Scenario kill_three_synced_raid10_3legs: Kill three legs (none of which share the same stripe leg) of synced 3 leg raid10 volume(s) ********* RAID hash info for this scenario ********* * names: synced_three_raid10_3legs_1 * sync: 1 * type: raid10 * -m |-i value: 3 * leg devices: /dev/sdg1 /dev/sdc1 /dev/sde1 /dev/sdb1 /dev/sdh1 /dev/sda1 * spanned legs: 0 * manual repair: 0 * no MDA devices: * failpv(s): /dev/sdg1 /dev/sde1 /dev/sdh1 * additional snap: /dev/sdc1 * failnode(s): host-073 * lvmetad: 0 * raid fault policy: allocate ******************************************************
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2222
The segfault could happen in any lvm command, not just the 'vgs' command. Otherwise it looks fine.
Thanks, David. Fixed to refer to LVM tools in general.
I did some minor formal changes and published the description: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.4_release_notes/bug_fixes_storage Thanks for the input!