Red Hat Bugzilla – Bug 1016322
lvmetad dies (general protection error in libc-2.12.so) during pv creation
Last modified: 2015-07-04 03:02:30 EDT
Description of problem: This may very well just be a reproduction of bug 889361. As you can see from the log, i missed getting an actual core due to an unsigned lvm2 package??? I'll work to reproduce this and provide more information. SCENARIO - [contiguous_opt_check] Sanity check for the contiguous allocation option Recreating PVs/VG with smaller sizes No physical volume found in lvmetad cache for /dev/sdc6 Physical volume /dev/sdc6 not found Attempting mirror creation without enough contiguous space harding-02: lvcreate -m 1 -n contig_without -C y -L 400M mirror_sanity Insufficient suitable contiguous allocatable extents for logical volume contig_without: 200 more required Attempting mirror creation with enough contiguous space harding-02: lvcreate -m 1 -n contig_with -C y -L 200M mirror_sanity Deactivating mirror contig_with... and removing Restoring VG to default extent size Request to lookup PV /dev/sdb1 in lvmetad gave response Connection reset by peer. Request to update PV nIWg5x-U9Og-G02K-rwfK-igtN-EqAT-BBLBrX in lvmetad gave response Connection reset by peer. Failed to write physical volume "/dev/sdb1" unable to create PVs Oct 7 08:11:50 harding-02 qarshd[58203]: Running cmdline: pvcreate /dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb5 /dev/sdb6 /dev/sdc1 /dev/sdc2 /dev/sdc3 /dev/sdc5 /dev/sdc6 Oct 7 08:11:50 harding-02 kernel: __ratelimit: 212 callbacks suppressed Oct 7 08:11:50 harding-02 kernel: lvmetad[58231] general protection ip:7fe0414f6285 sp:7fe0397faa80 error:0 in libc-2.12.so[7fe041480000+18a000] Oct 7 08:11:50 harding-02 abrtd: Directory 'ccpp-2013-10-07-08:11:50-1862' creation detected Oct 7 08:11:50 harding-02 abrt[58239]: Saved core dump of pid 1862 (/sbin/lvmetad) to /var/spool/abrt/ccpp-2013-10-07-08:11:50-1862 (134029312 bytes) Oct 7 08:11:50 harding-02 abrtd: Package 'lvm2' isn't signed with proper key Oct 7 08:11:50 harding-02 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-10-07-08:11:50-1862' exited with 1 Oct 7 08:11:50 harding-02 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2013-10-07-08:11:50-1862' Version-Release number of selected component (if applicable): 2.6.32-410.el6.x86_64 lvm2-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 lvm2-libs-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 lvm2-cluster-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 udev-147-2.48.el6 BUILT: Fri Aug 9 06:09:50 CDT 2013 device-mapper-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 device-mapper-libs-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 device-mapper-event-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 device-mapper-event-libs-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 cmirror-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013
lvmetad must not crash
Reproduced again, however I'm still trying to get a valid core file. I'm not sure what the deal is with the lvm2 package isn't signed issue. Oct 8 04:03:06 hayes-03 qarshd[4108]: Running cmdline: pvremove /dev/etherd/e1.1p1 /dev/etherd/e1.1p10 /dev/etherd/e1.1p2 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7 /dev/etherd/e1.1p8 /dev/etherd/e1.1p9 Oct 8 04:03:06 hayes-03 qarshd[4121]: Running cmdline: pvcreate --setphysicalvolumesize 300M /dev/etherd/e1.1p1 /dev/etherd/e1.1p10 /dev/etherd/e1.1p2 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7 /dev/etherd/e1.1p8 /dev/etherd/e1.1p9 Oct 8 04:03:07 hayes-03 kernel: __ratelimit: 127 callbacks suppressed Oct 8 04:03:07 hayes-03 kernel: lvmetad[4133]: segfault at 7f3fe963b568 ip 0000003be1076205 sp 00007f0a330a5a80 error 4 in libc-2.12.so[3be1000000+18a000] Oct 8 04:03:08 hayes-03 abrtd: Directory 'ccpp-2013-10-08-04:03:07-16471' creation detected Oct 8 04:03:08 hayes-03 abrt[4154]: Saved core dump of pid 16471 (/sbin/lvmetad) to /var/spool/abrt/ccpp-2013-10-08-04:03:07-16471 (155684864 bytes) Oct 8 04:03:09 hayes-03 abrtd: Package 'lvm2' isn't signed with proper key Oct 8 04:03:09 hayes-03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-10-08-04:03:07-16471' exited with 1 Oct 8 04:03:09 hayes-03 abrtd: Corrupted or bad directory '/var/spool/abrt/ccpp-2013-10-08-04:03:07-16471', deleting
I have fixed a few invalid memory reads and a possible race in lvmetad, git 830d233..808a5d9 -- any of those could in theory cause a SEGV. A core could help to confirm whether any of those were the culprit. Else, I guess those fixes are worthwhile even if we didn't run into any directly related crashes, so maybe the next build will contain those. I tentatively bump this to POST so that the commits get pulled into a build. More testing needed though.
Here we go, (I forgot to add "no" to abrt-action-save-package-data.conf): Core was generated by `lvmetad'. Program terminated with signal 11, Segmentation fault. #0 0x00007f6cacf99285 in malloc_consolidate () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.128.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libsepol-2.0.41-4.el6.x86_64 libudev-147-2.48.el6.x86_64 (gdb) bt #0 0x00007f6cacf99285 in malloc_consolidate () from /lib64/libc.so.6 #1 0x00007f6cacf9c405 in _int_malloc () from /lib64/libc.so.6 #2 0x00007f6cacf9d991 in malloc () from /lib64/libc.so.6 #3 0x00007f6cad4f9044 in _new_chunk (p=0x7f6ca8002d70, s=16384) at mm/pool-fast.c:282 #4 0x00007f6cad4f91e9 in dm_pool_alloc_aligned (p=<value optimized out>, s=32, alignment=8) at mm/pool-fast.c:102 #5 0x00007f6cad4f92f6 in dm_pool_zalloc (p=<value optimized out>, s=32) at mm/pool.c:72 #6 0x00007f6cad4f61b4 in dm_config_create () at libdm-config.c:103 #7 0x00007f6cad4f81de in dm_config_from_string (config_settings=0x7f6ca8001110 "request = \"hello\"\n") at libdm-config.c:180 #8 0x0000000000405baf in client_thread (baton=0x303e530) at daemon-server.c:390 #9 0x00007f6cad2be9d1 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f6cad00ba8d in clone () from /lib64/libc.so.6
Hm. This is rather unfortunate, as it seems this is a write corruption, not a read problem, and as such is far less likely to be fixed by the patches referenced above. Unfortunately, the backtrace itself is of little use beyond telling us that it's a different problem. One of two things would be very helpful in figuring the problem out: - if you could stop lvmetad before running the offending test, and re-start it manually within valgrind: valgrind --num-callers=32 lvmetad -f, then run the failing test(s) and post all the output valgrind produced (you can redirect everything to a file to make it easier, with >& file) - alternatively, a list of all LVM commands that have been executed from the point lvmetad started up until the crash (if at all possible, also including commands executed by udev), ideally with timestamps Unfortunately, none of our regression tests triggers the problem and I can't reproduce it either.
(sorry, the needinfo request was lost in a bugzilla collision)
I have just noticed this: Recreating PVs/VG with smaller sizes No physical volume found in lvmetad cache for /dev/sdc6 Physical volume /dev/sdc6 not found Corey, do you run a succession of tests here? Can you try to run the test in isolation (restarting lvmetad just before)? A list of tests that lead to the segfault would be quite useful as well. It's probably not this particular test that trips the problem, but some earlier one and this one just gets the flak. I have STS sources now and I can probably extract command lists from that, for particular tests.
This is the minimum that's required to cause this. It usually take anywhere from 20-40 iterations and never happens on the first iteration. $PVS = a list of random devices (in my case 6) $VG = a vg created out of those 6 devices using their full size while true { # Recreating PVs/VG with smaller sizes\n"; vgremove $VG pvremove $PVS pvcreate --setphysicalvolumesize 500M $PVS vgcreate $VG $PVS # Restoring VG to default extent size\n"; vgremove $VG pvremove $PVS pvcreate $PVS vgcreate $VG $PVS }
Tracked down to a data race in dm_pool_{create,destroy} and fixed upstream in b5aad86710be57833879ac0e8609021828949682.
Fix verified in the latest build. 2.6.32-410.el6.x86_64 lvm2-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 lvm2-libs-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 lvm2-cluster-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 udev-147-2.50.el6 BUILT: Fri Oct 11 05:58:10 CDT 2013 device-mapper-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 device-mapper-libs-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 device-mapper-event-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 device-mapper-event-libs-1.02.79-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 cmirror-2.02.100-6.el6 BUILT: Wed Oct 16 07:26:00 CDT 2013 Recreating PVs/VG with smaller sizes ( i=20 ) PV VG Fmt Attr PSize PFree /dev/sda2 vg_harding02 lvm2 a-- 92.67g 0 /dev/sdb1 harding lvm2 a-- 496.00m 496.00m /dev/sdb2 harding lvm2 a-- 496.00m 496.00m /dev/sdb3 harding lvm2 a-- 496.00m 496.00m /dev/sdc1 harding lvm2 a-- 496.00m 496.00m /dev/sdc2 harding lvm2 a-- 496.00m 496.00m /dev/sdc3 harding lvm2 a-- 496.00m 496.00m Restoring VG to default extent size PV VG Fmt Attr PSize PFree /dev/sda2 vg_harding02 lvm2 a-- 92.67g 0 /dev/sdb1 harding lvm2 a-- 18.63g 18.63g /dev/sdb2 harding lvm2 a-- 18.63g 18.63g /dev/sdb3 harding lvm2 a-- 18.63g 18.63g /dev/sdc1 harding lvm2 a-- 18.63g 18.63g /dev/sdc2 harding lvm2 a-- 18.63g 18.63g /dev/sdc3 harding lvm2 a-- 18.63g 18.63g
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1704.html