Bug 889361
Summary: | lvmetad dies (general protection error in libc-2.12.so) during mirror conversion regression testing | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | lvm2 | Assignee: | Petr Rockai <prockai> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 6.4 | CC: | agk, ddumas, dwysocha, heinzm, jbrassow, jharriga, mcsontos, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-07-10 21:24:12 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 960054 | ||||||
Attachments: |
|
Description
Corey Marthaler
2012-12-20 22:38:43 UTC
Created attachment 666979 [details]
log during lvmetad failure
Also saw this while converting from 3-way corelog, to 2-way disklog. Dec 20 15:17:35 hayes-01 qarshd[5599]: Running cmdline: lvcreate -n centi_base --mirrorlog core -m 2 -L 700M centipede Dec 20 15:17:36 hayes-01 lvm[3104]: Monitoring mirror device centipede-centi_base for events. [...] Dec 20 15:18:06 hayes-01 lvm[3104]: centipede-centi_base is now in-sync. [...] Dec 20 15:18:14 hayes-01 qarshd[5676]: Running cmdline: lvconvert --mirrorlog disk -m 1 centipede/centi_base Dec 20 15:18:14 hayes-01 lvm[3104]: No longer monitoring mirror device centipede-centi_base for events. Dec 20 15:18:14 hayes-01 lvm[3104]: Monitoring mirror device centipede-centi_base for events. Dec 20 15:18:25 hayes-01 lvm[3104]: No longer monitoring mirror device centipede-centi_base for events. Dec 20 15:18:25 hayes-01 lvm[3104]: Monitoring mirror device centipede-centi_base for events. Dec 20 15:18:26 hayes-01 xinetd[1607]: EXIT: qarsh status=0 pid=5676 duration=12(sec) Dec 20 15:18:27 hayes-01 kernel: lvmetad[5751] general protection ip:37af476205 sp:7fcd0a1fba80 error:0 in libc-2.12.so[37af400000+18a000] Dec 20 15:18:27 hayes-01 udevd-work[5685]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 280' unexpected exit with status 0x000d Dec 20 15:18:27 hayes-01 udevd-work[5682]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 281' unexpected exit with status 0x000d Dec 20 15:18:27 hayes-01 udevd-work[5725]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 276' unexpected exit with status 0x000d Dec 20 15:18:27 hayes-01 udevd-work[5726]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 275' unexpected exit with status 0x000d Dec 20 15:18:27 hayes-01 udevd-work[5730]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 273' unexpected exit with status 0x000d Nenad is running these tests to see if he can reproduce the bug and get a corresponding backtrace. Upstream tests already cover these scenarios but I haven't seem them trigger this bug yet. Maybe something STS-specific is playing a role here. I'll mark this as needinfo because I can't reproduce and I need more data to debug; hopefully Nenad has more luck. BZ removed NEEDINFO when I added myself to CC, was not expecting it would remove it without adding a commment. So I just re-added it until we manage to get more data. I think I finally have something to go on here. I was able to reproduce this on two different machines yesterday. Let me know is there is more information that you'd like now that we have core files. Jan 8 11:36:49 taft-01 kernel: lvmetad[9475] general protection ip:30b6676205 sp:7fce2ebfca80 error:0 in libc-2.12.so[30b6600000+18a000] Core was generated by `lvmetad'. Program terminated with signal 11, Segmentation fault. #0 0x00000030b6676205 in malloc_consolidate () from /lib64/libc.so.6 (gdb) bt #0 0x00000030b6676205 in malloc_consolidate () from /lib64/libc.so.6 #1 0x00000030b6679385 in _int_malloc () from /lib64/libc.so.6 #2 0x00000030b667a911 in malloc () from /lib64/libc.so.6 #3 0x0000003c57e231e4 in _new_chunk (p=0x7fce380014c0, s=16384) at mm/pool-fast.c:281 #4 0x0000003c57e23389 in dm_pool_alloc_aligned (p=<value optimized out>, s=32, alignment=8) at mm/pool-fast.c:101 #5 0x0000003c57e23496 in dm_pool_zalloc (p=<value optimized out>, s=32) at mm/pool.c:72 #6 0x0000003c57e20564 in dm_config_create () at libdm-config.c:102 #7 0x0000003c57e2237e in dm_config_from_string (config_settings=0x7fce38001150 "request = \"hello\"\n") at libdm-config.c:181 #8 0x000000000040596f in client_thread (baton=0x2ed6d10) at daemon-server.c:382 #9 0x00000030b6a07851 in start_thread () from /lib64/libpthread.so.0 #10 0x00000030b66e890d in clone () from /lib64/libc.so.6 Jan 9 03:52:29 qalvm-01 kernel: lvmetad[7906] general protection ip:3cc5e76205 sp:7fdef6d79a80 error:0 in libc-2.12.so[3cc5e00000+18a000] #0 0x0000003cc5e76205 in malloc_consolidate () from /lib64/libc.so.6 (gdb) bt #0 0x0000003cc5e76205 in malloc_consolidate () from /lib64/libc.so.6 #1 0x0000003cc5e79385 in _int_malloc () from /lib64/libc.so.6 #2 0x0000003cc5e7a911 in malloc () from /lib64/libc.so.6 #3 0x000000314ca231e4 in _new_chunk (p=0x7fded000cdf0, s=16384) at mm/pool-fast.c:281 #4 0x000000314ca23389 in dm_pool_alloc_aligned (p=<value optimized out>, s=32, alignment=8) at mm/pool-fast.c:101 #5 0x000000314ca23496 in dm_pool_zalloc (p=<value optimized out>, s=32) at mm/pool.c:72 #6 0x000000314ca20564 in dm_config_create () at libdm-config.c:102 #7 0x000000314ca2237e in dm_config_from_string (config_settings=0x7fded000ce40 "request = \"hello\"\n") at libdm-config.c:181 #8 0x000000000040596f in client_thread (baton=0x2b07f40) at daemon-server.c:382 #9 0x0000003cc6207851 in start_thread () from /lib64/libpthread.so.0 #10 0x0000003cc5ee890d in clone () from /lib64/libc.so.6 Hm. This doesn't look so good. I wonder if there's a chance to get this running in valgrind. Basically, the core will only tell us that something corrupted the heap, which sadly isn't enough to fix the bug. I'll run some tests with valgrind myself: even though we don't see crashes valgrind might find something. One thing that would be quite useful to know is whether the lvmetad instance in question was running only for the duration of that single mirror test or it's being re-used for multiple tests. Unfortunately the valgrind testing needs separate configure build option - which is being activated for normal build - otherwise the test will usually abort on memory checking. Re comment 7, I don't understand, what do you mean? Running lvmetad in valgrind is perfectly OK in any build config. And since this is a libc heap corruption, valgrind should certainly detect it, even without pool support, right? Anyway, I was unable to reproduce this and hence need more info (i.e. said valgrind output might get me going). This problem is very hard to track down and reproduce, setting the devel cond nak reproducer, but considering the time we're in 6.4 schedule, it seems this will be moved to 6.5. This may be related to low-memory situation - see Bug 889651 for details. I have always seen the kernel: lvmetad[7906] general protection ip:3cc5e76205 sp:7fdef6d79a80 error:0 in libc-2.12.so[3cc5e00000+18a000] accompanied by failure in dmeventd as reported in the above bug. Hit this issue when attempting to test for bug 862253. # run 20 pvs in parallel: [...] Skipping volume group raid_sanity Volume group "raid_sanity" not found Request to list PVs in lvmetad gave response Connection reset by peer. Jan 24 15:30:10 hayes-03 kernel: lvmetad[25889] general protection ip:3a97a76205 sp:7f1b1cdf9b90 error:0 in libc-2.12.so[3a97a00000+18a000] [root@hayes-03 ~]# pvscan WARNING: Failed to connect to lvmetad: Connection refused. Falling back to internal scanning. PV /dev/etherd/e1.1p1 VG raid_sanity lvm2 [908.23 GiB / 907.93 GiB free] PV /dev/etherd/e1.1p10 VG raid_sanity lvm2 [908.23 GiB / 908.23 GiB free] PV /dev/etherd/e1.1p2 VG raid_sanity lvm2 [908.23 GiB / 908.23 GiB free] Closing. Unable to reproduce. There has been some well hidden bug in buffering fixed - likely it could have been the source of these hard to reproduce problems - so I assume it should be now ok. Here is related patch: https://www.redhat.com/archives/lvm-devel/2013-June/msg00416.html This bug may still exist, I filed new bug 1016322 for what appears to be a similar instance as this bug. |