Bug 1016322
| Summary: | lvmetad dies (general protection error in libc-2.12.so) during pv creation | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Petr Rockai <prockai> |
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.5 | CC: | agk, cmarthal, dwysocha, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, thornber, tlavigne, zkabelac |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | lvm2-2.02.100-6.el6 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: A race condition in pool destruction in libdevmapper.so.
Consequence: lvmetad would sometimes crash due to heap corruption, especially under heavier concurrent loads (multiple LVM commands executing at once).
Fix: The race condition has been fixed by introducing correct locking.
Result: lvmetad no longer suffers heap corruption and subsequent crashes.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-11-21 23:29:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Corey Marthaler
2013-10-07 22:50:19 UTC
lvmetad must not crash Reproduced again, however I'm still trying to get a valid core file. I'm not sure what the deal is with the lvm2 package isn't signed issue. Oct 8 04:03:06 hayes-03 qarshd[4108]: Running cmdline: pvremove /dev/etherd/e1.1p1 /dev/etherd/e1.1p10 /dev/etherd/e1.1p2 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7 /dev/etherd/e1.1p8 /dev/etherd/e1.1p9 Oct 8 04:03:06 hayes-03 qarshd[4121]: Running cmdline: pvcreate --setphysicalvolumesize 300M /dev/etherd/e1.1p1 /dev/etherd/e1.1p10 /dev/etherd/e1.1p2 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7 /dev/etherd/e1.1p8 /dev/etherd/e1.1p9 Oct 8 04:03:07 hayes-03 kernel: __ratelimit: 127 callbacks suppressed Oct 8 04:03:07 hayes-03 kernel: lvmetad[4133]: segfault at 7f3fe963b568 ip 0000003be1076205 sp 00007f0a330a5a80 error 4 in libc-2.12.so[3be1000000+18a000] Oct 8 04:03:08 hayes-03 abrtd: Directory 'ccpp-2013-10-08-04:03:07-16471' creation detected Oct 8 04:03:08 hayes-03 abrt[4154]: Saved core dump of pid 16471 (/sbin/lvmetad) to /var/spool/abrt/ccpp-2013-10-08-04:03:07-16471 (155684864 bytes) Oct 8 04:03:09 hayes-03 abrtd: Package 'lvm2' isn't signed with proper key Oct 8 04:03:09 hayes-03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-10-08-04:03:07-16471' exited with 1 Oct 8 04:03:09 hayes-03 abrtd: Corrupted or bad directory '/var/spool/abrt/ccpp-2013-10-08-04:03:07-16471', deleting I have fixed a few invalid memory reads and a possible race in lvmetad, git 830d233..808a5d9 -- any of those could in theory cause a SEGV. A core could help to confirm whether any of those were the culprit. Else, I guess those fixes are worthwhile even if we didn't run into any directly related crashes, so maybe the next build will contain those. I tentatively bump this to POST so that the commits get pulled into a build. More testing needed though. Here we go, (I forgot to add "no" to abrt-action-save-package-data.conf): Core was generated by `lvmetad'. Program terminated with signal 11, Segmentation fault. #0 0x00007f6cacf99285 in malloc_consolidate () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.128.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libsepol-2.0.41-4.el6.x86_64 libudev-147-2.48.el6.x86_64 (gdb) bt #0 0x00007f6cacf99285 in malloc_consolidate () from /lib64/libc.so.6 #1 0x00007f6cacf9c405 in _int_malloc () from /lib64/libc.so.6 #2 0x00007f6cacf9d991 in malloc () from /lib64/libc.so.6 #3 0x00007f6cad4f9044 in _new_chunk (p=0x7f6ca8002d70, s=16384) at mm/pool-fast.c:282 #4 0x00007f6cad4f91e9 in dm_pool_alloc_aligned (p=<value optimized out>, s=32, alignment=8) at mm/pool-fast.c:102 #5 0x00007f6cad4f92f6 in dm_pool_zalloc (p=<value optimized out>, s=32) at mm/pool.c:72 #6 0x00007f6cad4f61b4 in dm_config_create () at libdm-config.c:103 #7 0x00007f6cad4f81de in dm_config_from_string (config_settings=0x7f6ca8001110 "request = \"hello\"\n") at libdm-config.c:180 #8 0x0000000000405baf in client_thread (baton=0x303e530) at daemon-server.c:390 #9 0x00007f6cad2be9d1 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f6cad00ba8d in clone () from /lib64/libc.so.6 Hm. This is rather unfortunate, as it seems this is a write corruption, not a read problem, and as such is far less likely to be fixed by the patches referenced above. Unfortunately, the backtrace itself is of little use beyond telling us that it's a different problem. One of two things would be very helpful in figuring the problem out: - if you could stop lvmetad before running the offending test, and re-start it manually within valgrind: valgrind --num-callers=32 lvmetad -f, then run the failing test(s) and post all the output valgrind produced (you can redirect everything to a file to make it easier, with >& file) - alternatively, a list of all LVM commands that have been executed from the point lvmetad started up until the crash (if at all possible, also including commands executed by udev), ideally with timestamps Unfortunately, none of our regression tests triggers the problem and I can't reproduce it either. (sorry, the needinfo request was lost in a bugzilla collision) I have just noticed this: Recreating PVs/VG with smaller sizes No physical volume found in lvmetad cache for /dev/sdc6 Physical volume /dev/sdc6 not found Corey, do you run a succession of tests here? Can you try to run the test in isolation (restarting lvmetad just before)? A list of tests that lead to the segfault would be quite useful as well. It's probably not this particular test that trips the problem, but some earlier one and this one just gets the flak. I have STS sources now and I can probably extract command lists from that, for particular tests. This is the minimum that's required to cause this. It usually take anywhere from 20-40 iterations and never happens on the first iteration.
$PVS = a list of random devices (in my case 6)
$VG = a vg created out of those 6 devices using their full size
while true {
# Recreating PVs/VG with smaller sizes\n";
vgremove $VG
pvremove $PVS
pvcreate --setphysicalvolumesize 500M $PVS
vgcreate $VG $PVS
# Restoring VG to default extent size\n";
vgremove $VG
pvremove $PVS
pvcreate $PVS
vgcreate $VG $PVS
}
Tracked down to a data race in dm_pool_{create,destroy} and fixed upstream in b5aad86710be57833879ac0e8609021828949682.
Fix verified in the latest build. 2.6.32-410.el6.x86_64 lvm2-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 lvm2-libs-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 lvm2-cluster-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 udev-147-2.50.el6 BUILT: Fri Oct 11 05:58:10 CDT 2013 device-mapper-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 device-mapper-libs-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 device-mapper-event-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 device-mapper-event-libs-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 cmirror-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 Recreating PVs/VG with smaller sizes ( i=20 ) PV VG Fmt Attr PSize PFree /dev/sda2 vg_harding02 lvm2 a-- 92.67g 0 /dev/sdb1 harding lvm2 a-- 496.00m 496.00m /dev/sdb2 harding lvm2 a-- 496.00m 496.00m /dev/sdb3 harding lvm2 a-- 496.00m 496.00m /dev/sdc1 harding lvm2 a-- 496.00m 496.00m /dev/sdc2 harding lvm2 a-- 496.00m 496.00m /dev/sdc3 harding lvm2 a-- 496.00m 496.00m Restoring VG to default extent size PV VG Fmt Attr PSize PFree /dev/sda2 vg_harding02 lvm2 a-- 92.67g 0 /dev/sdb1 harding lvm2 a-- 18.63g 18.63g /dev/sdb2 harding lvm2 a-- 18.63g 18.63g /dev/sdb3 harding lvm2 a-- 18.63g 18.63g /dev/sdc1 harding lvm2 a-- 18.63g 18.63g /dev/sdc2 harding lvm2 a-- 18.63g 18.63g /dev/sdc3 harding lvm2 a-- 18.63g 18.63g Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1704.html |