Bug 1016322 - lvmetad dies (general protection error in libc-2.12.so) during pv creation
lvmetad dies (general protection error in libc-2.12.so) during pv creation
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.5
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Petr Rockai
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-07 18:50 EDT by Corey Marthaler
Modified: 2015-07-04 03:02 EDT (History)
12 users (show)

See Also:
Fixed In Version: lvm2-2.02.100-6.el6
Doc Type: Bug Fix
Doc Text:
Cause: A race condition in pool destruction in libdevmapper.so. Consequence: lvmetad would sometimes crash due to heap corruption, especially under heavier concurrent loads (multiple LVM commands executing at once). Fix: The race condition has been fixed by introducing correct locking. Result: lvmetad no longer suffers heap corruption and subsequent crashes.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-21 18:29:07 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2013-10-07 18:50:19 EDT
Description of problem:
This may very well just be a reproduction of bug 889361. As you can see from the log, i missed getting an actual core due to an unsigned lvm2 package??? I'll work to reproduce this and provide more information.


 SCENARIO - [contiguous_opt_check]
 Sanity check for the contiguous allocation option
 Recreating PVs/VG with smaller sizes
   No physical volume found in lvmetad cache for /dev/sdc6
   Physical volume /dev/sdc6 not found
 Attempting mirror creation without enough contiguous space
 harding-02: lvcreate -m 1 -n contig_without -C y -L 400M mirror_sanity
   Insufficient suitable contiguous allocatable extents for logical volume contig_without: 200 more required
 Attempting mirror creation with enough contiguous space
 harding-02: lvcreate -m 1 -n contig_with -C y -L 200M mirror_sanity
 Deactivating mirror contig_with... and removing
 Restoring VG to default extent size
   Request to lookup PV /dev/sdb1 in lvmetad gave response Connection reset by peer.
   Request to update PV nIWg5x-U9Og-G02K-rwfK-igtN-EqAT-BBLBrX in lvmetad gave response Connection reset by peer.
   Failed to write physical volume "/dev/sdb1"   
 unable to create PVs

Oct  7 08:11:50 harding-02 qarshd[58203]: Running cmdline: pvcreate /dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb5 /dev/sdb6 /dev/sdc1 /dev/sdc2 /dev/sdc3 /dev/sdc5 /dev/sdc6
Oct  7 08:11:50 harding-02 kernel: __ratelimit: 212 callbacks suppressed
Oct  7 08:11:50 harding-02 kernel: lvmetad[58231] general protection ip:7fe0414f6285 sp:7fe0397faa80 error:0 in libc-2.12.so[7fe041480000+18a000]
Oct  7 08:11:50 harding-02 abrtd: Directory 'ccpp-2013-10-07-08:11:50-1862' creation detected
Oct  7 08:11:50 harding-02 abrt[58239]: Saved core dump of pid 1862 (/sbin/lvmetad) to /var/spool/abrt/ccpp-2013-10-07-08:11:50-1862 (134029312 bytes)
Oct  7 08:11:50 harding-02 abrtd: Package 'lvm2' isn't signed with proper key
Oct  7 08:11:50 harding-02 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-10-07-08:11:50-1862' exited with 1
Oct  7 08:11:50 harding-02 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2013-10-07-08:11:50-1862'


Version-Release number of selected component (if applicable):
2.6.32-410.el6.x86_64

lvm2-2.02.100-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
lvm2-libs-2.02.100-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
lvm2-cluster-2.02.100-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
udev-147-2.48.el6    BUILT: Fri Aug  9 06:09:50 CDT 2013
device-mapper-1.02.79-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
device-mapper-libs-1.02.79-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
device-mapper-event-1.02.79-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
device-mapper-event-libs-1.02.79-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
cmirror-2.02.100-4.el6    BUILT: Fri Sep 27 09:05:32 CDT 2013
Comment 2 Alasdair Kergon 2013-10-08 10:32:00 EDT
lvmetad must not crash
Comment 3 Corey Marthaler 2013-10-08 17:30:23 EDT
Reproduced again, however I'm still trying to get a valid core file. I'm not sure what the deal is with the lvm2 package isn't signed issue.


Oct  8 04:03:06 hayes-03 qarshd[4108]: Running cmdline: pvremove /dev/etherd/e1.1p1 /dev/etherd/e1.1p10 /dev/etherd/e1.1p2 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7 /dev/etherd/e1.1p8 /dev/etherd/e1.1p9
Oct  8 04:03:06 hayes-03 qarshd[4121]: Running cmdline: pvcreate --setphysicalvolumesize 300M /dev/etherd/e1.1p1 /dev/etherd/e1.1p10 /dev/etherd/e1.1p2 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p6 /dev/etherd/e1.1p7 /dev/etherd/e1.1p8 /dev/etherd/e1.1p9
Oct  8 04:03:07 hayes-03 kernel: __ratelimit: 127 callbacks suppressed
Oct  8 04:03:07 hayes-03 kernel: lvmetad[4133]: segfault at 7f3fe963b568 ip 0000003be1076205 sp 00007f0a330a5a80 error 4 in libc-2.12.so[3be1000000+18a000]
Oct  8 04:03:08 hayes-03 abrtd: Directory 'ccpp-2013-10-08-04:03:07-16471' creation detected
Oct  8 04:03:08 hayes-03 abrt[4154]: Saved core dump of pid 16471 (/sbin/lvmetad) to /var/spool/abrt/ccpp-2013-10-08-04:03:07-16471 (155684864 bytes)
Oct  8 04:03:09 hayes-03 abrtd: Package 'lvm2' isn't signed with proper key
Oct  8 04:03:09 hayes-03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-10-08-04:03:07-16471' exited with 1
Oct  8 04:03:09 hayes-03 abrtd: Corrupted or bad directory '/var/spool/abrt/ccpp-2013-10-08-04:03:07-16471', deleting
Comment 4 Petr Rockai 2013-10-08 17:56:32 EDT
I have fixed a few invalid memory reads and a possible race in lvmetad, git 830d233..808a5d9 -- any of those could in theory cause a SEGV. A core could help to confirm whether any of those were the culprit. Else, I guess those fixes are worthwhile even if we didn't run into any directly related crashes, so maybe the next build will contain those.

I tentatively bump this to POST so that the commits get pulled into a build. More testing needed though.
Comment 5 Corey Marthaler 2013-10-08 18:13:51 EDT
Here we go, (I forgot to add "no" to abrt-action-save-package-data.conf):

Core was generated by `lvmetad'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f6cacf99285 in malloc_consolidate () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.128.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libsepol-2.0.41-4.el6.x86_64 libudev-147-2.48.el6.x86_64
(gdb) bt
#0  0x00007f6cacf99285 in malloc_consolidate () from /lib64/libc.so.6
#1  0x00007f6cacf9c405 in _int_malloc () from /lib64/libc.so.6
#2  0x00007f6cacf9d991 in malloc () from /lib64/libc.so.6
#3  0x00007f6cad4f9044 in _new_chunk (p=0x7f6ca8002d70, s=16384) at mm/pool-fast.c:282
#4  0x00007f6cad4f91e9 in dm_pool_alloc_aligned (p=<value optimized out>, s=32, alignment=8) at mm/pool-fast.c:102
#5  0x00007f6cad4f92f6 in dm_pool_zalloc (p=<value optimized out>, s=32) at mm/pool.c:72
#6  0x00007f6cad4f61b4 in dm_config_create () at libdm-config.c:103
#7  0x00007f6cad4f81de in dm_config_from_string (config_settings=0x7f6ca8001110 "request = \"hello\"\n") at libdm-config.c:180
#8  0x0000000000405baf in client_thread (baton=0x303e530) at daemon-server.c:390
#9  0x00007f6cad2be9d1 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f6cad00ba8d in clone () from /lib64/libc.so.6
Comment 6 Petr Rockai 2013-10-09 06:12:54 EDT
Hm. This is rather unfortunate, as it seems this is a write corruption, not a read problem, and as such is far less likely to be fixed by the patches referenced above. Unfortunately, the backtrace itself is of little use beyond telling us that it's a different problem.

One of two things would be very helpful in figuring the problem out:
- if you could stop lvmetad before running the offending test, and re-start it manually within valgrind: valgrind --num-callers=32 lvmetad -f, then run the failing test(s) and post all the output valgrind produced (you can redirect everything to a file to make it easier, with >& file)
- alternatively, a list of all LVM commands that have been executed from the point lvmetad started up until the crash (if at all possible, also including commands executed by udev), ideally with timestamps

Unfortunately, none of our regression tests triggers the problem and I can't reproduce it either.
Comment 8 Petr Rockai 2013-10-09 09:03:19 EDT
(sorry, the needinfo request was lost in a bugzilla collision)
Comment 9 Petr Rockai 2013-10-09 10:59:59 EDT
I have just noticed this:

Recreating PVs/VG with smaller sizes
   No physical volume found in lvmetad cache for /dev/sdc6
   Physical volume /dev/sdc6 not found

Corey, do you run a succession of tests here? Can you try to run the test in isolation (restarting lvmetad just before)? A list of tests that lead to the segfault would be quite useful as well. It's probably not this particular test that trips the problem, but some earlier one and this one just gets the flak. I have STS sources now and I can probably extract command lists from that, for particular tests.
Comment 11 Corey Marthaler 2013-10-09 13:24:50 EDT
This is the minimum that's required to cause this. It usually take anywhere from 20-40 iterations and never happens on the first iteration.


$PVS = a list of random devices (in my case 6)
$VG = a vg created out of those 6 devices using their full size

while true {
        # Recreating PVs/VG with smaller sizes\n";
        vgremove $VG
        pvremove $PVS
        pvcreate --setphysicalvolumesize 500M $PVS
        vgcreate $VG $PVS

        # Restoring VG to default extent size\n";
        vgremove $VG
        pvremove $PVS
        pvcreate $PVS
        vgcreate $VG $PVS
}
Comment 12 Petr Rockai 2013-10-09 18:39:23 EDT
Tracked down to a data race in dm_pool_{create,destroy} and fixed upstream in b5aad86710be57833879ac0e8609021828949682.
Comment 14 Corey Marthaler 2013-10-17 17:02:35 EDT
Fix verified in the latest build.

2.6.32-410.el6.x86_64
lvm2-2.02.100-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013
lvm2-libs-2.02.100-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013
lvm2-cluster-2.02.100-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013
udev-147-2.50.el6    BUILT: Fri Oct 11 05:58:10 CDT 2013
device-mapper-1.02.79-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013
device-mapper-libs-1.02.79-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013
device-mapper-event-1.02.79-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013
device-mapper-event-libs-1.02.79-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013
cmirror-2.02.100-6.el6    BUILT: Wed Oct 16 07:26:00 CDT 2013


Recreating PVs/VG with smaller sizes ( i=20 )
  PV         VG           Fmt  Attr PSize   PFree  
  /dev/sda2  vg_harding02 lvm2 a--   92.67g      0 
  /dev/sdb1  harding      lvm2 a--  496.00m 496.00m
  /dev/sdb2  harding      lvm2 a--  496.00m 496.00m
  /dev/sdb3  harding      lvm2 a--  496.00m 496.00m
  /dev/sdc1  harding      lvm2 a--  496.00m 496.00m
  /dev/sdc2  harding      lvm2 a--  496.00m 496.00m
  /dev/sdc3  harding      lvm2 a--  496.00m 496.00m
Restoring VG to default extent size
  PV         VG           Fmt  Attr PSize  PFree 
  /dev/sda2  vg_harding02 lvm2 a--  92.67g     0 
  /dev/sdb1  harding      lvm2 a--  18.63g 18.63g
  /dev/sdb2  harding      lvm2 a--  18.63g 18.63g
  /dev/sdb3  harding      lvm2 a--  18.63g 18.63g
  /dev/sdc1  harding      lvm2 a--  18.63g 18.63g
  /dev/sdc2  harding      lvm2 a--  18.63g 18.63g
  /dev/sdc3  harding      lvm2 a--  18.63g 18.63g
Comment 15 errata-xmlrpc 2013-11-21 18:29:07 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1704.html

Note You need to log in before you can comment on or make changes to this bug.