Bug 885811 - [lvmetad] segfaulting with upstream LVM2 (w/ patch attached)
Summary: [lvmetad] segfaulting with upstream LVM2 (w/ patch attached)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Petr Rockai
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 888457
TreeView+ depends on / blocked
 
Reported: 2012-12-10 17:11 UTC by Marian Csontos
Modified: 2013-02-21 08:15 UTC (History)
12 users (show)

Fixed In Version: lvm2-2.02.98-5.el6
Doc Type: Bug Fix
Doc Text:
A problem in VG auto-activation code could cause crashes in cases when a PV without an MDAs triggered the activation. The code has been updated to use cached VG metadata instead of relying on (in this case absent) MDA content of the last discovered PV.
Clone Of:
Environment:
Last Closed: 2013-02-21 08:15:16 UTC
Target Upstream Version:


Attachments (Terms of Use)
Proposed fix (753 bytes, patch)
2012-12-10 17:11 UTC, Marian Csontos
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0501 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2013-02-20 21:30:45 UTC

Description Marian Csontos 2012-12-10 17:11:47 UTC
Created attachment 660969 [details]
Proposed fix

Description of problem:
lvmetad is frequently crashing in most of STS test scenarios leaving behind VGs which can not be deactivated.

Incorrect function is used to release memory when handling error path.

I am not sure which commit was the trigger to uncover the hidden bug which is there since June.

Version-Release number of selected component (if applicable):
lvm2-2.02.98-4.el6

How reproducible:
high (in terms of STS failures)

Steps to Reproduce:
1. run sts test suite - do not have better reproducer yet - prockai or zkabelac are likely to identify one by looking at backtrace...
  
Actual results:
lvmetad corecumping

Expected results:
no crash

Additional info:
It is always dm_free call at cache/lvmetad.c:729 crashing:

    #5  0x00000000004a3181 in lvmetad_pv_found (pvid=<value optimized out>, device=0x11dec78, fmt=<value optimized out>, label_sector=1, vg=0x0, handler=0x42dd10 <_auto_activation_handler>) at cache/lvmetad.c:729
            uuid = "k1ZXEF-0HbB-BoV6-J2GR-8Xku-f3pU-IeFY3F\000\000p2\036\001\000\000\000\000\200", '\000' <repeats 14 times>
            reply = {error = 0, buffer = {allocated = 18371456, used = 0, mem = 0x0}, cft = 0x7fd877840fd0}
            info = <value optimized out>
            pvmeta = 0x11e3e10
            vgmeta = <value optimized out>
            status = <value optimized out>
            result = <value optimized out>

See attached patch for proposed fix.

Comment 2 Marian Csontos 2012-12-10 18:36:18 UTC
Patch fixed the segfault, but tests are still failing, due to LV not deactivated by lvchange -an LV. Investigating.

Comment 3 Zdenek Kabelac 2012-12-11 11:02:24 UTC
Proposed patch merged upstream:

https://www.redhat.com/archives/lvm-devel/2012-December/msg00021.html


Assigning to Petr to decrypt the real reason for Internal error code path.

It might be worth to STS test with lvm.conf  settings:

global/abort_on_internal_errors = 1

Comment 4 Marian Csontos 2012-12-11 20:14:17 UTC
Full backtrace:

(gdb) bt full
#0  0x00007fc07e9bb8a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
        resultvar = 0
        pid = <value optimized out>
        selftid = 6170
#1  0x00007fc07e9bd085 in abort () at abort.c:92
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7fff034ebc4e, sa_sigaction = 0x7fff034ebc4e}, sa_mask = {__val = {140733248875600, 140733248879694, 27283200, 5161767, 1, 140464734403416, 140464751448064, 
              140733248880928, 4294967295, 206158430248, 1, 7435432, 0, 0, 0, 0}}, sa_flags = 2138877904, sa_restorer = 0x1}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x0000000000475a76 in print_log (level=2, file=0x4fd9ae "cache/lvmetad.c", line=728, dm_errno=-1, format=0x4fde90 "Internal error: Handler needs existing VG.") at log/log.c:348
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fff034ec130, reg_save_area = 0x7fff034ec060}}
        buf = "\340N\240\001\000\000\000\000\242R\240\001\000\000\000\000\260\274N\003\377\177\000\060\235\236\370~\300\177\000\000\000\000\000\000\000\000\000\060@\276N\003\b\000\000\000(\000\000\000\000\000\000\000\220N\240\001\002\000\000\000\030\000\000\000\060\000\000\000 \300N\003\377\177\000\000P\277N\003\377\177\000\000\235\236\370~\b\000\000\000\340N\240\001\000\000\000\000\242R\240\001\000\000\000\000\020\275N\003\377\177\000\000\235\236\370~\b\000\000\000\340N\240\001\000\000\000\000\252S\240\001\000\000\000\000\060\275N\003\377\177\000\000\235\236\370~\300\177\000\000 \275N\003\377\177\000\000;\246\370~\b\000\000\000(\000\000\000\000\000\000\000\220N\240\001\002\000\000\000\340N\240\001\000\000\000\000\252S\240\001\000\000\000\000p\275N\003\377\177\000\000\235\236\370~\b\000\000\000\340N\240\001\000\000\000\000\252S\240\001\000\000\000\000\220\275N\003\377\177\000\000\235\236\370~\300\177\000\000\200\275N\003\377\177\000\000\353\246\370~\b\000\000\000"...
        locn = "\000nternal error: Handler needs existing VG.\000\000\000\000\000\000\004\070}\037", '\000' <repeats 28 times>, "\002\000\000\000\000\000\000\000\300\264N\003\377\177\000\000\377\377\377\377/devices/virtual/blo\300K\321~\300\177\000\000`h\321~\300\177\000\000get8:0:0/8:0:0:1/block/s\200՞~\300\177\000\000`h\321~\300\177\000\000\260(\233\001", '\000' <repeats 20 times>, "`h\321~\300\177\000\000\000\264N\003\377\177\000\000\207(\255\373\000\000\000\000@\264N\003\377\177\000\000\236Ԝ~\300\177\000\000 \271\233~\300\177\000\000\002\000\000\000\000\000\000\000\060\200\237\001", '\000' <repeats 12 times>"\210"...
        bufused = 0
        n = 42
        message = 0x7fff034eac50 ""
        trformat = 0x4fde90 "Internal error: Handler needs existing VG."
        newbuf = 0x19f9c50 "Internal error: Handler needs existing VG."
        use_stderr = 0
        log_once = 0
        fatal_internal_error = 1
        msglen = 42
#3  0x00000000004cb104 in lvmetad_pv_found (pvid=0x1a0fa28, device=0x1a0f9d8, fmt=0x19c85e0, label_sector=1, vg=0x0, handler=0x434be5 <_auto_activation_handler>) at cache/lvmetad.c:728
        uuid = "jdf3sY-uFzI-jMBS-87Ct-UyDn-Wdcd-PZcywz\000\000:\236\370~\300\177\000\000\000\000\000\000\000\000\000\000\220N\240\001\000\000\000"
        reply = {error = 0, buffer = {allocated = 2138877904, used = 0, mem = 0x1 <Address 0x1 out of bounds>}, cft = 0x0}
        info = 0x1a042c0
        pvmeta = 0x1a04f00
        vgmeta = 0x7fff034ec230
        status = 0x19c85e0 "\240c\232\001"
        result = 0
#4  0x00000000004cb8bb in lvmetad_pvscan_single (cmd=0x19a6380, dev=0x1a0f9d8, handler=0x434be5 <_auto_activation_handler>) at cache/lvmetad.c:862
        label = 0x19fe1d0
        info = 0x1a042c0
        baton = {vg = 0x0, fid = 0x19f7df0}
        fic = {type = 0, context = {pv_id = 0x0, vg_ref = {vg_name = 0x0, vg_id = 0x0}, private = 0x0}}
#5  0x000000000043512f in _pvscan_lvmetad (cmd=0x19a6380, argc=-1, argv=0x7fff034ec650) at pvscan.c:216
        ret = 1
        dev = 0x1a0f9d8
        pv_name = 0x0
        major = 253
        minor = 3
        devno_args = 1
        current_group = 0x19f5de0
        devno = 64771
        buf = 0x0
        handler = 0x434be5 <_auto_activation_handler>
#6  0x000000000043523e in pvscan (cmd=0x19a6380, argc=0, argv=0x7fff034ec650) at pvscan.c:248
        new_pvs_found = 0
        pvs_found = 0
        pvslist = 0x100000000
        pvl = 0x19afe60
        pv = 0x72d2e0
        size_total = 0
        size_new = 0
        len = 0
#7  0x0000000000429144 in lvm_run_command (cmd=0x19a6380, argc=0, argv=0x7fff034ec650) at lvmcmdline.c:1120
        ret = 0
        locking_type = -1
        monitoring = 1
        old_cft = 0xffff
#8  0x000000000042a5e7 in lvm2_main (argc=8, argv=0x7fff034ec610) at lvmcmdline.c:1554
        base = 0x7fff034edc83 "lvm"
        ret = 0
        alias = 0
        cmd = 0x19a6380

When this happens in lvmetad the rest of the system seems to be going on as if nothing has happened so there is little use to run lvmdump as we may be catching something

Comment 5 Petr Rockai 2012-12-11 22:12:20 UTC
Now I am thoroughly confused. You say lvmetad has crashed, but the backtrace is certainly not from lvmetad, since lvmetad doesn't even link to lib/. Is this pvscan --cache crashing? Getting the exact commandline and maybe some context (what the device was, etc.) would be definitely useful. An lvmdump is definitely relevant, this is not really a lvmetad bug, more like something bad happens in pvscan --cache passing in a NULL VG pointer for whatever reason, when a non-NULL VG is expected. Presumably k1ZXEF-0HbB-BoV6-J2GR-8Xku-f3pU-IeFY3F is the PV that triggers the problem in the original report, but a backtrace along with the commandline of the pvscan --cache that crashed and an lvmdump from around the time would be definitely useful. (If you still have the crashing version, I think the command that crashed will be mentioned in system logs.)

Comment 6 Peter Rajnoha 2012-12-12 11:30:21 UTC
So I think we've found the root cause of the problem. The activation_handler needs VG struct:

  typedef int (*activation_handler) (struct volume_group *vg, int partial, enum activation_change activate);

BUT, if the PV that makes the VG complete and ready for autoactivation is without any MDA, the _lvmetad_pv_found fn does not have any struct volume_group for use, it's NULL (passed into _lvmetad_pv_found). And the NULL value would be passed to the activation handler as well. Mornfall is preparing a patch that will pass vgid to the handler instead and then we'll get the struct volume_group inside the handler...

Comment 7 Peter Rajnoha 2012-12-12 11:58:17 UTC
Reproducer:

(make sure you have use_lvmetad=1 and lvmetad is running!)

[root@rhel6-a ~]# service lvm2-lvmetad start
Starting LVM metadata daemon:                              [  OK  ]

[root@rhel6-a ~]# pvcreate /dev/sda
  Physical volume "/dev/sda" successfully created

[root@rhel6-a ~]# pvcreate --metadatacopies 0 /dev/sdb
  Physical volume "/dev/sdb" successfully created

[root@rhel6-a ~]# vgcreate vg /dev/sda /dev/sdb
  Volume group "vg" successfully created

[root@rhel6-a ~]# pvscan --cache -aay /dev/sda
  0 logical volume(s) in volume group "vg" now active

[root@rhel6-a ~]# pvscan --cache -aay /dev/sdb
  Internal error: Handler needs existing VG.
  Update of lvmetad failed. This is a serious problem.
  It is strongly recommended that you restart lvmetad immediately.

Comment 8 Peter Rajnoha 2012-12-12 12:06:30 UTC
(In reply to comment #7)
> [root@rhel6-a ~]# pvscan --cache -aay /dev/sdb
>   Internal error: Handler needs existing VG.
>   Update of lvmetad failed. This is a serious problem.
>   It is strongly recommended that you restart lvmetad immediately.

This is with current upstream (without the latest patch which fixes it). And with latest rhel 6.4 build lvm2-2.02.98-4.el6 (the original problem reported in this BZ):

[root@rhel6-a ~]# pvscan --cache -aay /dev/sdb
  Internal error: Handler needs existing VG.
*** glibc detected *** pvscan: free(): invalid next size (fast): 0x00000000020b3f70 ***
======= Backtrace: =========
/lib64/libc.so.6[0x348e2760e6]
/lib64/libc.so.6[0x348e278c13]
pvscan(lvmetad_pv_found+0x3a1)[0x4a31f1]
pvscan(lvmetad_pvscan_single+0xf0)[0x4a27f0]
pvscan(pvscan+0x6ce)[0x42e8ee]
pvscan(lvm_run_command+0xbbc)[0x42256c]
pvscan(lvm2_main+0x6a8)[0x4251d8]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x348e21ecdd]
pvscan[0x4145b9]
======= Memory map: ========
00400000-004e8000 r-xp 00000000 fd:00 259506                             /sbin/lvm
006e8000-006f1000 rw-p 000e8000 fd:00 259506                             /sbin/lvm
006f1000-0070e000 rw-p 00000000 00:00 0 
01481000-020c6000 rw-p 00000000 00:00 0                                  [heap]
348de00000-348de20000 r-xp 00000000 fd:00 259111                         /lib64/ld-2.12.so
348e01f000-348e020000 r--p 0001f000 fd:00 259111                         /lib64/ld-2.12.so
348e020000-348e021000 rw-p 00020000 fd:00 259111                         /lib64/ld-2.12.so
348e021000-348e022000 rw-p 00000000 00:00 0 
348e200000-348e38a000 r-xp 00000000 fd:00 259125                         /lib64/libc-2.12.so
348e38a000-348e589000 ---p 0018a000 fd:00 259125                         /lib64/libc-2.12.so
348e589000-348e58d000 r--p 00189000 fd:00 259125                         /lib64/libc-2.12.so
348e58d000-348e58e000 rw-p 0018d000 fd:00 259125                         /lib64/libc-2.12.so
348e58e000-348e593000 rw-p 00000000 00:00 0 
348ea00000-348ea02000 r-xp 00000000 fd:00 264639                         /lib64/libdl-2.12.so
348ea02000-348ec02000 ---p 00002000 fd:00 264639                         /lib64/libdl-2.12.so
348ec02000-348ec03000 r--p 00002000 fd:00 264639                         /lib64/libdl-2.12.so
348ec03000-348ec04000 rw-p 00003000 fd:00 264639                         /lib64/libdl-2.12.so
348fa00000-348fa3a000 r-xp 00000000 fd:00 263709                         /lib64/libreadline.so.6.0
348fa3a000-348fc3a000 ---p 0003a000 fd:00 263709                         /lib64/libreadline.so.6.0
348fc3a000-348fc42000 rw-p 0003a000 fd:00 263709                         /lib64/libreadline.so.6.0
348fc42000-348fc43000 rw-p 00000000 00:00 0 
348fe00000-348fe1d000 r-xp 00000000 fd:00 264643                         /lib64/libselinux.so.1
348fe1d000-349001c000 ---p 0001d000 fd:00 264643                         /lib64/libselinux.so.1
349001c000-349001d000 r--p 0001c000 fd:00 264643                         /lib64/libselinux.so.1
349001d000-349001e000 rw-p 0001d000 fd:00 264643                         /lib64/libselinux.so.1
349001e000-349001f000 rw-p 00000000 00:00 0 
3498200000-349823b000 r-xp 00000000 fd:00 264673                         /lib64/libsepol.so.1
349823b000-349843b000 ---p 0003b000 fd:00 264673                         /lib64/libsepol.so.1
349843b000-349843c000 r--p 0003b000 fd:00 264673                         /lib64/libsepol.so.1
349843c000-349843d000 rw-p 0003c000 fd:00 264673                         /lib64/libsepol.so.1
349a200000-349a216000 r-xp 00000000 fd:00 264645                         /lib64/libgcc_s-4.4.7-20120601.so.1
349a216000-349a415000 ---p 00016000 fd:00 264645                         /lib64/libgcc_s-4.4.7-20120601.so.1
349a415000-349a416000 rw-p 00015000 fd:00 264645                         /lib64/libgcc_s-4.4.7-20120601.so.1
349b600000-349b60c000 r-xp 00000000 fd:00 264674                         /lib64/libudev.so.0.5.1
349b60c000-349b80c000 ---p 0000c000 fd:00 264674                         /lib64/libudev.so.0.5.1
349b80c000-349b80d000 r--p 0000c000 fd:00 264674                         /lib64/libudev.so.0.5.1
349b80d000-349b80e000 rw-p 0000d000 fd:00 264674                         /lib64/libudev.so.0.5.1
349f200000-349f21d000 r-xp 00000000 fd:00 264681                         /lib64/libtinfo.so.5.7
349f21d000-349f41d000 ---p 0001d000 fd:00 264681                         /lib64/libtinfo.so.5.7
349f41d000-349f421000 rw-p 0001d000 fd:00 264681                         /lib64/libtinfo.so.5.7
7fef3df73000-7fef3df78000 rw-p 00000000 00:00 0 
7fef3df78000-7fef3dfae000 r-xp 00000000 fd:00 259254                     /lib64/libdevmapper.so.1.02
7fef3dfae000-7fef3e1ae000 ---p 00036000 fd:00 259254                     /lib64/libdevmapper.so.1.02
7fef3e1ae000-7fef3e1b1000 rw-p 00036000 fd:00 259254                     /lib64/libdevmapper.so.1.02
7fef3e1b1000-7fef3e1b2000 rw-p 00000000 00:00 0 
7fef3e1b2000-7fef3e1b7000 r-xp 00000000 fd:00 259338                     /lib64/libdevmapper-event.so.1.02
7fef3e1b7000-7fef3e3b6000 ---p 00005000 fd:00 259338                     /lib64/libdevmapper-event.so.1.02
7fef3e3b6000-7fef3e3b7000 rw-p 00004000 fd:00 259338                     /lib64/libdevmapper-event.so.1.02
7fef3e3b7000-7fef3e3b8000 rw-p 00000000 00:00 0 
7fef3e3ce000-7fef3e3d0000 rw-p 00000000 00:00 0 
7fff09294000-7fff092a9000 rw-p 00000000 00:00 0                          [stack]
7fff092ec000-7fff092ed000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted (core dumped)

Comment 9 Nenad Peric 2012-12-12 12:19:33 UTC
Acking with test suggestions from Comment #7 and Comment #8

Comment 10 Petr Rockai 2012-12-12 12:23:45 UTC
I have pushed the fix upstream. The relevant commitid is f14f2d4378c066e50e2420e30407c8673cc7112a.

Comment 13 Nenad Peric 2013-01-22 13:01:56 UTC
Ran combinations of suggested tests in Comment #7 and did not run into any issue. 
Since bug 886695 and bug 889454 are closed and verified, after performing the tests suggested in the above comments, I am marking this one as verified as well.

verified with:


lvm2-2.02.98-8.el6.x86_64
lvm2-libs-2.02.98-8.el6.x86_64
device-mapper-1.02.77-8.el6.x86_64
udev-147-2.46.el6.x86_64
kernel-2.6.32-355.el6.x86_64

Comment 14 errata-xmlrpc 2013-02-21 08:15:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0501.html


Note You need to log in before you can comment on or make changes to this bug.