Bug 885811
| Summary: | [lvmetad] segfaulting with upstream LVM2 (w/ patch attached) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Marian Csontos <mcsontos> | ||||
| Component: | lvm2 | Assignee: | Petr Rockai <prockai> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.4 | CC: | agk, cmarthal, dwysocha, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, thornber, tlavigne, zkabelac | ||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | lvm2-2.02.98-5.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: |
A problem in VG auto-activation code could cause crashes in cases when a PV without an MDAs triggered the activation. The code has been updated to use cached VG metadata instead of relying on (in this case absent) MDA content of the last discovered PV.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-02-21 08:15:16 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 888457 | ||||||
| Attachments: |
|
||||||
Patch fixed the segfault, but tests are still failing, due to LV not deactivated by lvchange -an LV. Investigating. Proposed patch merged upstream: https://www.redhat.com/archives/lvm-devel/2012-December/msg00021.html Assigning to Petr to decrypt the real reason for Internal error code path. It might be worth to STS test with lvm.conf settings: global/abort_on_internal_errors = 1 Full backtrace:
(gdb) bt full
#0 0x00007fc07e9bb8a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
resultvar = 0
pid = <value optimized out>
selftid = 6170
#1 0x00007fc07e9bd085 in abort () at abort.c:92
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x7fff034ebc4e, sa_sigaction = 0x7fff034ebc4e}, sa_mask = {__val = {140733248875600, 140733248879694, 27283200, 5161767, 1, 140464734403416, 140464751448064,
140733248880928, 4294967295, 206158430248, 1, 7435432, 0, 0, 0, 0}}, sa_flags = 2138877904, sa_restorer = 0x1}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x0000000000475a76 in print_log (level=2, file=0x4fd9ae "cache/lvmetad.c", line=728, dm_errno=-1, format=0x4fde90 "Internal error: Handler needs existing VG.") at log/log.c:348
ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fff034ec130, reg_save_area = 0x7fff034ec060}}
buf = "\340N\240\001\000\000\000\000\242R\240\001\000\000\000\000\260\274N\003\377\177\000\060\235\236\370~\300\177\000\000\000\000\000\000\000\000\000\060@\276N\003\b\000\000\000(\000\000\000\000\000\000\000\220N\240\001\002\000\000\000\030\000\000\000\060\000\000\000 \300N\003\377\177\000\000P\277N\003\377\177\000\000\235\236\370~\b\000\000\000\340N\240\001\000\000\000\000\242R\240\001\000\000\000\000\020\275N\003\377\177\000\000\235\236\370~\b\000\000\000\340N\240\001\000\000\000\000\252S\240\001\000\000\000\000\060\275N\003\377\177\000\000\235\236\370~\300\177\000\000 \275N\003\377\177\000\000;\246\370~\b\000\000\000(\000\000\000\000\000\000\000\220N\240\001\002\000\000\000\340N\240\001\000\000\000\000\252S\240\001\000\000\000\000p\275N\003\377\177\000\000\235\236\370~\b\000\000\000\340N\240\001\000\000\000\000\252S\240\001\000\000\000\000\220\275N\003\377\177\000\000\235\236\370~\300\177\000\000\200\275N\003\377\177\000\000\353\246\370~\b\000\000\000"...
locn = "\000nternal error: Handler needs existing VG.\000\000\000\000\000\000\004\070}\037", '\000' <repeats 28 times>, "\002\000\000\000\000\000\000\000\300\264N\003\377\177\000\000\377\377\377\377/devices/virtual/blo\300K\321~\300\177\000\000`h\321~\300\177\000\000get8:0:0/8:0:0:1/block/s\200՞~\300\177\000\000`h\321~\300\177\000\000\260(\233\001", '\000' <repeats 20 times>, "`h\321~\300\177\000\000\000\264N\003\377\177\000\000\207(\255\373\000\000\000\000@\264N\003\377\177\000\000\236Ԝ~\300\177\000\000 \271\233~\300\177\000\000\002\000\000\000\000\000\000\000\060\200\237\001", '\000' <repeats 12 times>"\210"...
bufused = 0
n = 42
message = 0x7fff034eac50 ""
trformat = 0x4fde90 "Internal error: Handler needs existing VG."
newbuf = 0x19f9c50 "Internal error: Handler needs existing VG."
use_stderr = 0
log_once = 0
fatal_internal_error = 1
msglen = 42
#3 0x00000000004cb104 in lvmetad_pv_found (pvid=0x1a0fa28, device=0x1a0f9d8, fmt=0x19c85e0, label_sector=1, vg=0x0, handler=0x434be5 <_auto_activation_handler>) at cache/lvmetad.c:728
uuid = "jdf3sY-uFzI-jMBS-87Ct-UyDn-Wdcd-PZcywz\000\000:\236\370~\300\177\000\000\000\000\000\000\000\000\000\000\220N\240\001\000\000\000"
reply = {error = 0, buffer = {allocated = 2138877904, used = 0, mem = 0x1 <Address 0x1 out of bounds>}, cft = 0x0}
info = 0x1a042c0
pvmeta = 0x1a04f00
vgmeta = 0x7fff034ec230
status = 0x19c85e0 "\240c\232\001"
result = 0
#4 0x00000000004cb8bb in lvmetad_pvscan_single (cmd=0x19a6380, dev=0x1a0f9d8, handler=0x434be5 <_auto_activation_handler>) at cache/lvmetad.c:862
label = 0x19fe1d0
info = 0x1a042c0
baton = {vg = 0x0, fid = 0x19f7df0}
fic = {type = 0, context = {pv_id = 0x0, vg_ref = {vg_name = 0x0, vg_id = 0x0}, private = 0x0}}
#5 0x000000000043512f in _pvscan_lvmetad (cmd=0x19a6380, argc=-1, argv=0x7fff034ec650) at pvscan.c:216
ret = 1
dev = 0x1a0f9d8
pv_name = 0x0
major = 253
minor = 3
devno_args = 1
current_group = 0x19f5de0
devno = 64771
buf = 0x0
handler = 0x434be5 <_auto_activation_handler>
#6 0x000000000043523e in pvscan (cmd=0x19a6380, argc=0, argv=0x7fff034ec650) at pvscan.c:248
new_pvs_found = 0
pvs_found = 0
pvslist = 0x100000000
pvl = 0x19afe60
pv = 0x72d2e0
size_total = 0
size_new = 0
len = 0
#7 0x0000000000429144 in lvm_run_command (cmd=0x19a6380, argc=0, argv=0x7fff034ec650) at lvmcmdline.c:1120
ret = 0
locking_type = -1
monitoring = 1
old_cft = 0xffff
#8 0x000000000042a5e7 in lvm2_main (argc=8, argv=0x7fff034ec610) at lvmcmdline.c:1554
base = 0x7fff034edc83 "lvm"
ret = 0
alias = 0
cmd = 0x19a6380
When this happens in lvmetad the rest of the system seems to be going on as if nothing has happened so there is little use to run lvmdump as we may be catching something
Now I am thoroughly confused. You say lvmetad has crashed, but the backtrace is certainly not from lvmetad, since lvmetad doesn't even link to lib/. Is this pvscan --cache crashing? Getting the exact commandline and maybe some context (what the device was, etc.) would be definitely useful. An lvmdump is definitely relevant, this is not really a lvmetad bug, more like something bad happens in pvscan --cache passing in a NULL VG pointer for whatever reason, when a non-NULL VG is expected. Presumably k1ZXEF-0HbB-BoV6-J2GR-8Xku-f3pU-IeFY3F is the PV that triggers the problem in the original report, but a backtrace along with the commandline of the pvscan --cache that crashed and an lvmdump from around the time would be definitely useful. (If you still have the crashing version, I think the command that crashed will be mentioned in system logs.) So I think we've found the root cause of the problem. The activation_handler needs VG struct: typedef int (*activation_handler) (struct volume_group *vg, int partial, enum activation_change activate); BUT, if the PV that makes the VG complete and ready for autoactivation is without any MDA, the _lvmetad_pv_found fn does not have any struct volume_group for use, it's NULL (passed into _lvmetad_pv_found). And the NULL value would be passed to the activation handler as well. Mornfall is preparing a patch that will pass vgid to the handler instead and then we'll get the struct volume_group inside the handler... Reproducer: (make sure you have use_lvmetad=1 and lvmetad is running!) [root@rhel6-a ~]# service lvm2-lvmetad start Starting LVM metadata daemon: [ OK ] [root@rhel6-a ~]# pvcreate /dev/sda Physical volume "/dev/sda" successfully created [root@rhel6-a ~]# pvcreate --metadatacopies 0 /dev/sdb Physical volume "/dev/sdb" successfully created [root@rhel6-a ~]# vgcreate vg /dev/sda /dev/sdb Volume group "vg" successfully created [root@rhel6-a ~]# pvscan --cache -aay /dev/sda 0 logical volume(s) in volume group "vg" now active [root@rhel6-a ~]# pvscan --cache -aay /dev/sdb Internal error: Handler needs existing VG. Update of lvmetad failed. This is a serious problem. It is strongly recommended that you restart lvmetad immediately. (In reply to comment #7) > [root@rhel6-a ~]# pvscan --cache -aay /dev/sdb > Internal error: Handler needs existing VG. > Update of lvmetad failed. This is a serious problem. > It is strongly recommended that you restart lvmetad immediately. This is with current upstream (without the latest patch which fixes it). And with latest rhel 6.4 build lvm2-2.02.98-4.el6 (the original problem reported in this BZ): [root@rhel6-a ~]# pvscan --cache -aay /dev/sdb Internal error: Handler needs existing VG. *** glibc detected *** pvscan: free(): invalid next size (fast): 0x00000000020b3f70 *** ======= Backtrace: ========= /lib64/libc.so.6[0x348e2760e6] /lib64/libc.so.6[0x348e278c13] pvscan(lvmetad_pv_found+0x3a1)[0x4a31f1] pvscan(lvmetad_pvscan_single+0xf0)[0x4a27f0] pvscan(pvscan+0x6ce)[0x42e8ee] pvscan(lvm_run_command+0xbbc)[0x42256c] pvscan(lvm2_main+0x6a8)[0x4251d8] /lib64/libc.so.6(__libc_start_main+0xfd)[0x348e21ecdd] pvscan[0x4145b9] ======= Memory map: ======== 00400000-004e8000 r-xp 00000000 fd:00 259506 /sbin/lvm 006e8000-006f1000 rw-p 000e8000 fd:00 259506 /sbin/lvm 006f1000-0070e000 rw-p 00000000 00:00 0 01481000-020c6000 rw-p 00000000 00:00 0 [heap] 348de00000-348de20000 r-xp 00000000 fd:00 259111 /lib64/ld-2.12.so 348e01f000-348e020000 r--p 0001f000 fd:00 259111 /lib64/ld-2.12.so 348e020000-348e021000 rw-p 00020000 fd:00 259111 /lib64/ld-2.12.so 348e021000-348e022000 rw-p 00000000 00:00 0 348e200000-348e38a000 r-xp 00000000 fd:00 259125 /lib64/libc-2.12.so 348e38a000-348e589000 ---p 0018a000 fd:00 259125 /lib64/libc-2.12.so 348e589000-348e58d000 r--p 00189000 fd:00 259125 /lib64/libc-2.12.so 348e58d000-348e58e000 rw-p 0018d000 fd:00 259125 /lib64/libc-2.12.so 348e58e000-348e593000 rw-p 00000000 00:00 0 348ea00000-348ea02000 r-xp 00000000 fd:00 264639 /lib64/libdl-2.12.so 348ea02000-348ec02000 ---p 00002000 fd:00 264639 /lib64/libdl-2.12.so 348ec02000-348ec03000 r--p 00002000 fd:00 264639 /lib64/libdl-2.12.so 348ec03000-348ec04000 rw-p 00003000 fd:00 264639 /lib64/libdl-2.12.so 348fa00000-348fa3a000 r-xp 00000000 fd:00 263709 /lib64/libreadline.so.6.0 348fa3a000-348fc3a000 ---p 0003a000 fd:00 263709 /lib64/libreadline.so.6.0 348fc3a000-348fc42000 rw-p 0003a000 fd:00 263709 /lib64/libreadline.so.6.0 348fc42000-348fc43000 rw-p 00000000 00:00 0 348fe00000-348fe1d000 r-xp 00000000 fd:00 264643 /lib64/libselinux.so.1 348fe1d000-349001c000 ---p 0001d000 fd:00 264643 /lib64/libselinux.so.1 349001c000-349001d000 r--p 0001c000 fd:00 264643 /lib64/libselinux.so.1 349001d000-349001e000 rw-p 0001d000 fd:00 264643 /lib64/libselinux.so.1 349001e000-349001f000 rw-p 00000000 00:00 0 3498200000-349823b000 r-xp 00000000 fd:00 264673 /lib64/libsepol.so.1 349823b000-349843b000 ---p 0003b000 fd:00 264673 /lib64/libsepol.so.1 349843b000-349843c000 r--p 0003b000 fd:00 264673 /lib64/libsepol.so.1 349843c000-349843d000 rw-p 0003c000 fd:00 264673 /lib64/libsepol.so.1 349a200000-349a216000 r-xp 00000000 fd:00 264645 /lib64/libgcc_s-4.4.7-20120601.so.1 349a216000-349a415000 ---p 00016000 fd:00 264645 /lib64/libgcc_s-4.4.7-20120601.so.1 349a415000-349a416000 rw-p 00015000 fd:00 264645 /lib64/libgcc_s-4.4.7-20120601.so.1 349b600000-349b60c000 r-xp 00000000 fd:00 264674 /lib64/libudev.so.0.5.1 349b60c000-349b80c000 ---p 0000c000 fd:00 264674 /lib64/libudev.so.0.5.1 349b80c000-349b80d000 r--p 0000c000 fd:00 264674 /lib64/libudev.so.0.5.1 349b80d000-349b80e000 rw-p 0000d000 fd:00 264674 /lib64/libudev.so.0.5.1 349f200000-349f21d000 r-xp 00000000 fd:00 264681 /lib64/libtinfo.so.5.7 349f21d000-349f41d000 ---p 0001d000 fd:00 264681 /lib64/libtinfo.so.5.7 349f41d000-349f421000 rw-p 0001d000 fd:00 264681 /lib64/libtinfo.so.5.7 7fef3df73000-7fef3df78000 rw-p 00000000 00:00 0 7fef3df78000-7fef3dfae000 r-xp 00000000 fd:00 259254 /lib64/libdevmapper.so.1.02 7fef3dfae000-7fef3e1ae000 ---p 00036000 fd:00 259254 /lib64/libdevmapper.so.1.02 7fef3e1ae000-7fef3e1b1000 rw-p 00036000 fd:00 259254 /lib64/libdevmapper.so.1.02 7fef3e1b1000-7fef3e1b2000 rw-p 00000000 00:00 0 7fef3e1b2000-7fef3e1b7000 r-xp 00000000 fd:00 259338 /lib64/libdevmapper-event.so.1.02 7fef3e1b7000-7fef3e3b6000 ---p 00005000 fd:00 259338 /lib64/libdevmapper-event.so.1.02 7fef3e3b6000-7fef3e3b7000 rw-p 00004000 fd:00 259338 /lib64/libdevmapper-event.so.1.02 7fef3e3b7000-7fef3e3b8000 rw-p 00000000 00:00 0 7fef3e3ce000-7fef3e3d0000 rw-p 00000000 00:00 0 7fff09294000-7fff092a9000 rw-p 00000000 00:00 0 [stack] 7fff092ec000-7fff092ed000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Aborted (core dumped) Acking with test suggestions from Comment #7 and Comment #8 I have pushed the fix upstream. The relevant commitid is f14f2d4378c066e50e2420e30407c8673cc7112a. Ran combinations of suggested tests in Comment #7 and did not run into any issue. Since bug 886695 and bug 889454 are closed and verified, after performing the tests suggested in the above comments, I am marking this one as verified as well. verified with: lvm2-2.02.98-8.el6.x86_64 lvm2-libs-2.02.98-8.el6.x86_64 device-mapper-1.02.77-8.el6.x86_64 udev-147-2.46.el6.x86_64 kernel-2.6.32-355.el6.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0501.html |
Created attachment 660969 [details] Proposed fix Description of problem: lvmetad is frequently crashing in most of STS test scenarios leaving behind VGs which can not be deactivated. Incorrect function is used to release memory when handling error path. I am not sure which commit was the trigger to uncover the hidden bug which is there since June. Version-Release number of selected component (if applicable): lvm2-2.02.98-4.el6 How reproducible: high (in terms of STS failures) Steps to Reproduce: 1. run sts test suite - do not have better reproducer yet - prockai or zkabelac are likely to identify one by looking at backtrace... Actual results: lvmetad corecumping Expected results: no crash Additional info: It is always dm_free call at cache/lvmetad.c:729 crashing: #5 0x00000000004a3181 in lvmetad_pv_found (pvid=<value optimized out>, device=0x11dec78, fmt=<value optimized out>, label_sector=1, vg=0x0, handler=0x42dd10 <_auto_activation_handler>) at cache/lvmetad.c:729 uuid = "k1ZXEF-0HbB-BoV6-J2GR-8Xku-f3pU-IeFY3F\000\000p2\036\001\000\000\000\000\200", '\000' <repeats 14 times> reply = {error = 0, buffer = {allocated = 18371456, used = 0, mem = 0x0}, cft = 0x7fd877840fd0} info = <value optimized out> pvmeta = 0x11e3e10 vgmeta = <value optimized out> status = <value optimized out> result = <value optimized out> See attached patch for proposed fix.