Bug 1434054 - vgs segfault after re-enabling failed raid10 images when lvmetad is not running
Summary: vgs segfault after re-enabling failed raid10 images when lvmetad is not running
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2
Version: 7.4
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: cluster-qe@redhat.com
Marek Suchánek
URL:
Whiteboard:
Depends On: 1412843
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-20 15:58 UTC by Corey Marthaler
Modified: 2017-10-31 14:42 UTC (History)
13 users (show)

Fixed In Version: lvm2-2.02.171-2.el7
Doc Type: Bug Fix
Doc Text:
LVM tools no longer crash due to an incorrect status of PVs When LVM observes particular types of inconsistencies between the metadata on Physical Volumes (PVs) in a Volume Group (VG), LVM can automatically repair them. Such inconsistencies happen, for example, if a VG is changed while some of its PVs are temporarily invisible to the system and then the PVs reappear. Prior to this update, when such a repair operation was performed, all the PVs were sometimes temporarily considered to have returned even if this was not the case. As a consequence, LVM tools sometimes terminated unexpectedly with a segmentation fault. With this update, the described problem no longer occurs.
Clone Of: 1412843
Environment:
Last Closed: 2017-08-01 21:49:49 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2222 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2017-08-01 18:42:41 UTC

Comment 2 Corey Marthaler 2017-03-20 16:00:44 UTC
Rhel7.4 failure:

Mar 20 10:31:38 host-092 kernel: vgs[6962]: segfault at 80 ip 00007fa666c7c0c7 sp 00007ffc69b90358 error 4 in libc-2.17.so[7fa666be8000+1b8000]


Core was generated by `vgs'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f67dc5330c7 in __strncpy_sse2 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.168-5.el7.x86_64 elfutils-libs-0.168-5.el7.x86_64 glibc-2.17-184.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-9.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 pcre-8.32-17.el7.x86_64 readline-6.2-10.el7.x86_64 systemd-libs-219-32.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) 
(gdb) 
(gdb) 
(gdb) bt
#0  0x00007f67dc5330c7 in __strncpy_sse2 () from /lib64/libc.so.6
#1  0x000055b37a4f6771 in strncpy (__len=32, __src=<optimized out>, __dest=0x7fff0384fd80 "21b0N4GkK13xlvCPo2UlYhlw7bKwHVVU") at /usr/include/bits/string3.h:120
#2  lvmcache_info_from_pvid (pvid=<optimized out>, dev=0x0, valid_only=valid_only@entry=0) at cache/lvmcache.c:717
#3  0x000055b37a55302f in _check_or_repair_pv_ext (inconsistent_pvs=<synthetic pointer>, repair=1, vg=0x55b37c9bc1d0, cmd=0x55b37c90c020) at metadata/metadata.c:4055
#4  _vg_read (cmd=cmd@entry=0x55b37c90c020, vgname=<optimized out>, vgname@entry=0x55b37c9546d0 "helter_skelter", vgid=<optimized out>, 
    vgid@entry=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", warn_flags=warn_flags@entry=1, consistent=consistent@entry=0x7fff03850004, precommitted=precommitted@entry=0)
    at metadata/metadata.c:4625
#5  0x000055b37a553bc6 in vg_read_internal (cmd=cmd@entry=0x55b37c90c020, vgname=vgname@entry=0x55b37c9546d0 "helter_skelter", 
    vgid=vgid@entry=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", warn_flags=warn_flags@entry=1, consistent=consistent@entry=0x7fff03850004) at metadata/metadata.c:4791
#6  0x000055b37a555976 in _recover_vg (vgid=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", vg_name=0x55b37c9546d0 "helter_skelter", cmd=0x55b37c90c020) at metadata/metadata.c:5525
#7  _vg_lock_and_read (lockd_state=4, read_flags=262144, status_flags=0, lock_flags=33, vgid=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", vg_name=0x55b37c9546d0 "helter_skelter", 
    cmd=0x55b37c90c020) at metadata/metadata.c:5830
#8  vg_read (cmd=cmd@entry=0x55b37c90c020, vg_name=vg_name@entry=0x55b37c9546d0 "helter_skelter", vgid=vgid@entry=0x55b37c9546a8 "1iipE0RgcjNeCLT2s3O2IBK3bBWOZSSb", 
    read_flags=read_flags@entry=262144, lockd_state=4) at metadata/metadata.c:5916
#9  0x000055b37a4dbb26 in _process_vgnameid_list (process_single_vg=0x55b37a4d1d40 <_vgs_single>, handle=0x55b37c9536e0, arg_tags=0x7fff03850100, arg_vgnames=0x7fff03850110, 
    vgnameids_to_process=0x7fff03850130, read_flags=262144, cmd=0x55b37c90c020) at toollib.c:1946
#10 process_each_vg (cmd=cmd@entry=0x55b37c90c020, argc=<optimized out>, argv=<optimized out>, one_vgname=one_vgname@entry=0x0, use_vgnames=use_vgnames@entry=0x0, read_flags=262144, 
    read_flags@entry=0, include_internal=include_internal@entry=0, handle=handle@entry=0x55b37c9536e0, process_single_vg=0x55b37a4d1d40 <_vgs_single>) at toollib.c:2277
#11 0x000055b37a4d3b7d in _do_report (cmd=cmd@entry=0x55b37c90c020, handle=handle@entry=0x55b37c9536e0, args=args@entry=0x7fff03850300, single_args=single_args@entry=0x7fff03850348)
    at reporter.c:1183
#12 0x000055b37a4d3d82 in _report (cmd=0x55b37c90c020, argc=0, argv=0x7fff03850970, report_type=<optimized out>) at reporter.c:1428
#13 0x000055b37a4c61c7 in lvm_run_command (cmd=cmd@entry=0x55b37c90c020, argc=0, argc@entry=1, argv=0x7fff03850970, argv@entry=0x7fff03850968) at lvmcmdline.c:2880
#14 0x000055b37a4c703e in lvm2_main (argc=1, argv=0x7fff03850968) at lvmcmdline.c:3408
#15 0x00007f67dc4c0c05 in __libc_start_main () from /lib64/libc.so.6
#16 0x000055b37a4a5dd3 in _start ()


3.10.0-609.el7.x86_64

lvm2-2.02.169-0.1253.el7    BUILT: Fri Mar 17 09:27:00 CDT 2017
lvm2-libs-2.02.169-0.1253.el7    BUILT: Fri Mar 17 09:27:00 CDT 2017
lvm2-cluster-2.02.169-0.1253.el7    BUILT: Fri Mar 17 09:27:00 CDT 2017
device-mapper-1.02.138-0.1253.el7    BUILT: Fri Mar 17 09:27:00 CDT 2017
device-mapper-libs-1.02.138-0.1253.el7    BUILT: Fri Mar 17 09:27:00 CDT 2017
device-mapper-event-1.02.138-0.1253.el7    BUILT: Fri Mar 17 09:27:00 CDT 2017
device-mapper-event-libs-1.02.138-0.1253.el7    BUILT: Fri Mar 17 09:27:00 CDT 2017
device-mapper-persistent-data-0.6.3-1.el7    BUILT: Fri Jul 22 05:29:13 CDT 2016

Comment 3 Corey Marthaler 2017-03-30 19:52:03 UTC
Quick update that this exists in the latest errata rpms as well.

3.10.0-635.el7.x86_64

lvm2-2.02.169-3.el7    BUILT: Wed Mar 29 09:17:46 CDT 2017
lvm2-libs-2.02.169-3.el7    BUILT: Wed Mar 29 09:17:46 CDT 2017
lvm2-cluster-2.02.169-3.el7    BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-1.02.138-3.el7    BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-libs-1.02.138-3.el7    BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-event-1.02.138-3.el7    BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-event-libs-1.02.138-3.el7    BUILT: Wed Mar 29 09:17:46 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017




Mar 30 11:56:30 host-077 kernel: vgs[19420]: segfault at 80 ip 00007ff9c18630c7 sp 00007ffd292e8da8 error 4 in libc-2.17.so[7ff9c17cf000+1b8000]

(gdb) bt
#0  0x00007f6a4a1320c7 in __strncpy_sse2 () from /lib64/libc.so.6
#1  0x00005632f56a634e in strncpy (__len=32, __src=<optimized out>, __dest=0x7ffddcdb4e50 "YxoVmYt7UrcZNS9S3ozLNAUymIe5bidz") at /usr/include/bits/string3.h:120
#2  lvmcache_info_from_pvid (pvid=<optimized out>, dev=0x0, valid_only=valid_only@entry=0) at cache/lvmcache.c:717
#3  0x00005632f570093a in _check_or_repair_pv_ext (inconsistent_pvs=<synthetic pointer>, repair=1, vg=0x5632f79624e0, cmd=0x5632f7899020) at metadata/metadata.c:4055
#4  _vg_read (cmd=cmd@entry=0x5632f7899020, vgname=<optimized out>, vgname@entry=0x5632f78e5548 "black_bird", vgid=<optimized out>, vgid@entry=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", warn_flags=warn_flags@entry=1, 
    consistent=consistent@entry=0x7ffddcdb50c4, precommitted=precommitted@entry=0) at metadata/metadata.c:4625
#5  0x00005632f57014aa in vg_read_internal (cmd=cmd@entry=0x5632f7899020, vgname=vgname@entry=0x5632f78e5548 "black_bird", vgid=vgid@entry=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", warn_flags=warn_flags@entry=1, 
    consistent=consistent@entry=0x7ffddcdb50c4) at metadata/metadata.c:4791
#6  0x00005632f57031bd in _recover_vg (vgid=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", vg_name=0x5632f78e5548 "black_bird", cmd=0x5632f7899020) at metadata/metadata.c:5525
#7  _vg_lock_and_read (lockd_state=4, read_flags=262144, status_flags=0, lock_flags=33, vgid=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", vg_name=0x5632f78e5548 "black_bird", cmd=0x5632f7899020) at metadata/metadata.c:5830
#8  vg_read (cmd=cmd@entry=0x5632f7899020, vg_name=vg_name@entry=0x5632f78e5548 "black_bird", vgid=vgid@entry=0x5632f78e5520 "imDXYqjzWo5Fo0nAsbaxnrbuqtlq9ex0", read_flags=read_flags@entry=262144, lockd_state=4)
    at metadata/metadata.c:5916
#9  0x00005632f568c324 in _process_vgnameid_list (process_single_vg=0x5632f56829d0 <_vgs_single>, handle=0x5632f78e4760, arg_tags=0x7ffddcdb51b0, arg_vgnames=0x7ffddcdb51c0, vgnameids_to_process=0x7ffddcdb51e0, read_flags=262144, 
    cmd=0x5632f7899020) at toollib.c:1946
#10 process_each_vg (cmd=cmd@entry=0x5632f7899020, argc=<optimized out>, argv=<optimized out>, one_vgname=one_vgname@entry=0x0, use_vgnames=use_vgnames@entry=0x0, read_flags=262144, read_flags@entry=0, 
    include_internal=include_internal@entry=0, handle=handle@entry=0x5632f78e4760, process_single_vg=0x5632f56829d0 <_vgs_single>) at toollib.c:2277
#11 0x00005632f56846af in _do_report (cmd=cmd@entry=0x5632f7899020, handle=handle@entry=0x5632f78e4760, args=args@entry=0x7ffddcdb53b0, single_args=single_args@entry=0x7ffddcdb53f8) at reporter.c:1183
#12 0x00005632f568489f in _report (cmd=0x5632f7899020, argc=0, argv=0x7ffddcdb5a00, report_type=<optimized out>) at reporter.c:1428
#13 0x00005632f56773b5 in lvm_run_command (cmd=cmd@entry=0x5632f7899020, argc=0, argc@entry=1, argv=0x7ffddcdb5a00, argv@entry=0x7ffddcdb59f8) at lvmcmdline.c:2880
#14 0x00005632f5677f83 in lvm2_main (argc=1, argv=0x7ffddcdb59f8) at lvmcmdline.c:3406
#15 0x00007f6a4a0bfc05 in __libc_start_main () from /lib64/libc.so.6
#16 0x00005632f5657e2e in _start ()

Comment 4 David Teigland 2017-05-09 17:15:47 UTC
I think this is another case where is_missing_pv(pvl->pv) is not always sufficient to check for a missing device, and pvl->pv->dev also needs to be checked for NULL.

Another commit fixing the same kind of segfault: 3c53acb378478f23acf624be8836c0cb24c2724e

Here's a patch that I suspect will fix this, but I've not been able to reproduce the segfault to verify.

diff --git a/lib/metadata/metadata.c b/lib/metadata/metadata.c
index 1b500ed0382a..972306eea853 100644
--- a/lib/metadata/metadata.c
+++ b/lib/metadata/metadata.c
@@ -4039,6 +4039,7 @@ static int _check_or_repair_pv_ext(struct cmd_context *cmd,
                                   struct volume_group *vg,
                                   int repair, int *inconsistent_pvs)
 {
+       char uuid[64] __attribute__((aligned(8)));
        struct lvmcache_info *info;
        uint32_t ext_version, ext_flags;
        struct pv_list *pvl;
@@ -4052,6 +4053,14 @@ static int _check_or_repair_pv_ext(struct cmd_context *cmd,
                if (is_missing_pv(pvl->pv))
                        continue;
 
+               if (!pvl->pv->dev) {
+                       /* is_missing_pv doesn't catch NULL dev */
+                       memset(&uuid, 0, sizeof(uuid));
+                       id_write_format(&pvl->pv->id, uuid, sizeof(uuid));
+                       log_warn("WARNING: Not repairing PV %s with missing device.", uuid);
+                       continue;
+               }
+
                if (!(info = lvmcache_info_from_pvid(pvl->pv->dev->pvid, pvl->pv->dev, 0))) {
                        log_error("Failed to find cached info for PV %s.", pv_dev_name(pvl->pv));
                        goto out;

Comment 5 David Teigland 2017-05-10 15:53:11 UTC
This seems like the likely fix, although I've not reproduced the bug to verify it.

https://sourceware.org/git/?p=lvm2.git;a=commit;h=d45531712d7c18035975980069a1683c218abec2

Comment 6 Alasdair Kergon 2017-05-11 00:32:36 UTC
This is a correct fix to avoid the segfault.

However, the bug report has one puzzling line in it:

  Missing device unknown device reappeared, updating metadata for VG black_bird to version 7.

which is noticeably NOT followed by

  Device still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing.

as in the other cases.

Why has the code apparently reinstated a missing device?

Firstly, let's find an easier way to reach that path through the code to get a simpler reproducer:

Create a VG with 3 PVs pv1, pv2, pv3.
Hide pv2.
Run vgreduce --removemissing.
- the VG now contains pv1 and pv3.  pv2 has out-dated metadata on it.
Reinstate the hidden PV pv2 and at the same time hide a different PV pv3.
The code has to do two things at once now - sort out the reinstated pv2 which requires a metadata update and deal with the missing pv3.  This takes it through _check_reinstated_pv() responsible for the messages while there is already a PV marked internally as missing (pv3).
Soon afterwards, it crashes.

The patch in the last couple of comments shows that pv->dev is NULL for an unknown device.  A quick glance at _check_reinstated_pv() shows that it's happily considering that because NULL == NULL the unknown device is no longer a MISSING_PV!

Comment 8 Alasdair Kergon 2017-05-11 01:39:36 UTC
_check_reappeared_pv() incorrectly clears the MISSING_PV flags of
PVs with unknown devices.
While one caller avoids passing such PVs into the function, the other doesn't.  Move the check inside the function so it's not forgotten.

Without this patch (but with the first patch on this bug), if the normal VG reading code tries to repair inconsistent metadata while there is an unknown PV, it incorrectly considers the missing PVs no longer to be missing and produces incorrect 'pvs' output omitting the missing PV, for example.

Easy reproducer:
Create a VG with 3 PVs pv1, pv2, pv3.
Hide pv2.
Run vgreduce --removemissing.
Reinstate the hidden PV pv2 and at the same time hide a different PV
pv3.
Run 'pvs' - incorrect output.
Run 'pvs' again - correct output.

Comment 10 Corey Marthaler 2017-06-06 22:53:58 UTC
Marking verified in the latest rpms.

3.10.0-672.el7.x86_64
lvm2-2.02.171-3.el7    BUILT: Wed May 31 08:36:29 CDT 2017
lvm2-libs-2.02.171-3.el7    BUILT: Wed May 31 08:36:29 CDT 2017
lvm2-cluster-2.02.171-3.el7    BUILT: Wed May 31 08:36:29 CDT 2017
device-mapper-1.02.140-3.el7    BUILT: Wed May 31 08:36:29 CDT 2017
device-mapper-libs-1.02.140-3.el7    BUILT: Wed May 31 08:36:29 CDT 2017
device-mapper-event-1.02.140-3.el7    BUILT: Wed May 31 08:36:29 CDT 2017
device-mapper-event-libs-1.02.140-3.el7    BUILT: Wed May 31 08:36:29 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017


The test case in comment #0 no longer causes the vgs segfault w/o lvmetad running.

================================================================================
Iteration 0.1 started at Tue Jun  6 17:25:02 CDT 2017
================================================================================
Scenario kill_three_synced_raid10_3legs: Kill three legs (none of which share the same stripe leg) of synced 3 leg raid10 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_three_raid10_3legs_1
* sync:               1
* type:               raid10
* -m |-i value:       3
* leg devices:        /dev/sdg1 /dev/sdc1 /dev/sde1 /dev/sdb1 /dev/sdh1 /dev/sda1
* spanned legs:       0
* manual repair:      0
* no MDA devices:     
* failpv(s):          /dev/sdg1 /dev/sde1 /dev/sdh1
* additional snap:    /dev/sdc1
* failnode(s):        host-073
* lvmetad:            0
* raid fault policy:  allocate
******************************************************

Comment 12 errata-xmlrpc 2017-08-01 21:49:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2222

Comment 14 David Teigland 2017-10-30 16:41:02 UTC
The segfault could happen in any lvm command, not just the 'vgs' command.  Otherwise it looks fine.

Comment 15 Marek Suchánek 2017-10-31 11:25:56 UTC
Thanks, David. Fixed to refer to LVM tools in general.

Comment 16 Lenka Špačková 2017-10-31 14:42:36 UTC
I did some minor formal changes and published the description:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.4_release_notes/bug_fixes_storage

Thanks for the input!


Note You need to log in before you can comment on or make changes to this bug.