Bug 889361

Summary:

lvmetad dies (general protection error in libc-2.12.so) during mirror conversion regression testing

Product:

Red Hat Enterprise Linux 6

Reporter:

Corey Marthaler <cmarthal>

Component:

lvm2

Assignee:

Petr Rockai <prockai>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

medium

Version:

6.4

CC:

agk, ddumas, dwysocha, heinzm, jbrassow, jharriga, mcsontos, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-10 21:24:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

960054

Attachments:

Description	Flags
log during lvmetad failure	none

Description Corey Marthaler 2012-12-20 22:38:43 UTC

Description of problem:
./centipede -o qalvm-01 -l /home/msp/cmarthal/work/sts/sts-root -r /usr/tests/sts-rhel6.4 -i 1

[...]

Converting from mirror -> raid1
lvconvert --type raid1 centipede/centi_base
lvconvert -m 0 centipede/centi_base /dev/vdg2 /dev/vdc1
sufficient pvs convert case failed on qalvm-01
  Volume group "centipede" not found


log (with udev debugging on):
Dec 20 16:16:45 qalvm-01 qarshd[10672]: Running cmdline: lvconvert -m 0 centipede/centi_base /dev/vdg2 /dev/vdc1
Dec 20 16:16:45 qalvm-01 udevd-work[10661]: '/sbin/blkid' (stdout) 'ID_FS_UUID=DiReDc-rlW3-3d8Q-QRvw-90yO-j0e5-6MKwUA'
Dec 20 16:16:45 qalvm-01 udevd-work[10661]: '/sbin/blkid' (stdout) 'ID_FS_UUID_ENC=DiReDc-rlW3-3d8Q-QRvw-90yO-j0e5-6MKwUA'
Dec 20 16:16:45 qalvm-01 udevd-work[10661]: '/sbin/blkid' (stdout) 'ID_FS_VERSION=LVM2\x20001'
Dec 20 16:16:45 qalvm-01 udevd-work[10661]: '/sbin/blkid' (stdout) 'ID_FS_TYPE=LVM2_member'
Dec 20 16:16:45 qalvm-01 udevd-work[10661]: '/sbin/blkid' (stdout) 'ID_FS_USAGE=raid'
Dec 20 16:16:45 qalvm-01 udevd-work[10661]: '/sbin/blkid -o udev -p /dev/vdf2' returned with exitcode 0
Dec 20 16:16:45 qalvm-01 udevd-work[10661]: RUN '$env{LVM_SBIN_PATH}/lvm pvscan --cache --activate ay --major $major --minor $minor' /lib/udev/rules.d/69-dm-lvm-metad.rules:25

[...]

Dec 20 16:16:46 qalvm-01 kernel: lvmetad[10680] general protection ip:3cc5e76205 sp:7f3e32954a80 error:0 in libc-2.12.so[3cc5e00000+18a000]

[...]

Dec 20 16:16:46 qalvm-01 udevd[606]: seq 4240 done with 0
Dec 20 16:16:46 qalvm-01 udevd[606]: seq 4244 done with 0
Dec 20 16:16:46 qalvm-01 udevd-work[10620]: '/sbin/lvm' (stderr) '  Request to update PV E0c33L-6jD4-RUaN-gc1G-45Of-rwYn-t5NplQ in lvmetad gave response Connection reset by peer.'
Dec 20 16:16:46 qalvm-01 udevd-work[10620]: '/sbin/lvm' (stderr) '  Update of lvmetad failed. This is a serious problem.'
Dec 20 16:16:46 qalvm-01 udevd-work[10620]: '/sbin/lvm' (stderr) '  It is strongly recommended that you restart lvmetad immediately.'
Dec 20 16:16:46 qalvm-01 udevd[606]: seq 4239 done with 0
Dec 20 16:16:46 qalvm-01 udevd-work[10663]: '/sbin/lvm' (stderr) '  Request to update PV DfKltW-2dXZ-JZ1i-XEEF-N7o7-cd8M-PWVCmo in lvmetad gave response Connection reset by peer.'
Dec 20 16:16:46 qalvm-01 udevd-work[10663]: '/sbin/lvm' (stderr) '  Update of lvmetad failed. This is a serious problem.'
Dec 20 16:16:46 qalvm-01 udevd-work[10663]: '/sbin/lvm' (stderr) '  It is strongly recommended that you restart lvmetad immediately.'
Dec 20 16:16:46 qalvm-01 udevd[606]: seq 4245 done with 0
Dec 20 16:16:46 qalvm-01 udevd-work[10663]: '/sbin/lvm pvscan --cache --activate ay --major 252 --minor 114' returned with exitcode 5
Dec 20 16:16:46 qalvm-01 udevd-work[10663]: passed 652 bytes to monitor 0x7f758b07d030
Dec 20 16:16:46 qalvm-01 udevd-work[10663]: adding watch on '/dev/vdh2'


I'll upload the entire log.


Version-Release number of selected component (if applicable):
2.6.32-339.el6.x86_64

lvm2-2.02.98-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012
lvm2-libs-2.02.98-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012
lvm2-cluster-2.02.98-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012
udev-147-2.43.el6    BUILT: Thu Oct 11 05:59:38 CDT 2012
device-mapper-1.02.77-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012
device-mapper-libs-1.02.77-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012
device-mapper-event-1.02.77-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012
device-mapper-event-libs-1.02.77-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012
cmirror-2.02.98-6.el6    BUILT: Thu Dec 20 07:00:04 CST 2012

Comment 1 Corey Marthaler 2012-12-20 22:48:57 UTC

Created attachment 666979 [details]
log during lvmetad failure

Comment 2 Corey Marthaler 2012-12-20 22:59:00 UTC

Also saw this while converting from 3-way corelog, to 2-way disklog.

Dec 20 15:17:35 hayes-01 qarshd[5599]: Running cmdline: lvcreate -n centi_base --mirrorlog core -m 2 -L 700M centipede
Dec 20 15:17:36 hayes-01 lvm[3104]: Monitoring mirror device centipede-centi_base for events.
[...]
Dec 20 15:18:06 hayes-01 lvm[3104]: centipede-centi_base is now in-sync.
[...]
Dec 20 15:18:14 hayes-01 qarshd[5676]: Running cmdline: lvconvert --mirrorlog disk -m 1 centipede/centi_base
Dec 20 15:18:14 hayes-01 lvm[3104]: No longer monitoring mirror device centipede-centi_base for events.
Dec 20 15:18:14 hayes-01 lvm[3104]: Monitoring mirror device centipede-centi_base for events.
Dec 20 15:18:25 hayes-01 lvm[3104]: No longer monitoring mirror device centipede-centi_base for events.
Dec 20 15:18:25 hayes-01 lvm[3104]: Monitoring mirror device centipede-centi_base for events.
Dec 20 15:18:26 hayes-01 xinetd[1607]: EXIT: qarsh status=0 pid=5676 duration=12(sec)
Dec 20 15:18:27 hayes-01 kernel: lvmetad[5751] general protection ip:37af476205 sp:7fcd0a1fba80 error:0 in libc-2.12.so[37af400000+18a000]
Dec 20 15:18:27 hayes-01 udevd-work[5685]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 280' unexpected exit with status 0x000d
Dec 20 15:18:27 hayes-01 udevd-work[5682]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 281' unexpected exit with status 0x000d
Dec 20 15:18:27 hayes-01 udevd-work[5725]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 276' unexpected exit with status 0x000d
Dec 20 15:18:27 hayes-01 udevd-work[5726]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 275' unexpected exit with status 0x000d
Dec 20 15:18:27 hayes-01 udevd-work[5730]: '/sbin/lvm pvscan --cache --activate ay --major 152 --minor 273' unexpected exit with status 0x000d

Comment 3 Petr Rockai 2013-01-07 14:03:11 UTC

Nenad is running these tests to see if he can reproduce the bug and get a corresponding backtrace. Upstream tests already cover these scenarios but I haven't seem them trigger this bug yet. Maybe something STS-specific is playing a role here. I'll mark this as needinfo because I can't reproduce and I need more data to debug; hopefully Nenad has more luck.

Comment 4 Nenad Peric 2013-01-09 13:18:19 UTC

BZ removed NEEDINFO when I added myself to CC, was not expecting it would remove it without adding a commment. 

So I just re-added it until we manage to get more data.

Comment 5 Corey Marthaler 2013-01-09 17:03:43 UTC

I think I finally have something to go on here. I was able to reproduce this on two different machines yesterday. Let me know is there is more information that you'd like now that we have core files.

Jan  8 11:36:49 taft-01 kernel: lvmetad[9475] general protection ip:30b6676205 sp:7fce2ebfca80 error:0 in libc-2.12.so[30b6600000+18a000]

Core was generated by `lvmetad'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000030b6676205 in malloc_consolidate () from /lib64/libc.so.6
(gdb) bt
#0  0x00000030b6676205 in malloc_consolidate () from /lib64/libc.so.6
#1  0x00000030b6679385 in _int_malloc () from /lib64/libc.so.6
#2  0x00000030b667a911 in malloc () from /lib64/libc.so.6
#3  0x0000003c57e231e4 in _new_chunk (p=0x7fce380014c0, s=16384) at mm/pool-fast.c:281
#4  0x0000003c57e23389 in dm_pool_alloc_aligned (p=<value optimized out>, s=32, alignment=8) at mm/pool-fast.c:101
#5  0x0000003c57e23496 in dm_pool_zalloc (p=<value optimized out>, s=32) at mm/pool.c:72
#6  0x0000003c57e20564 in dm_config_create () at libdm-config.c:102
#7  0x0000003c57e2237e in dm_config_from_string (config_settings=0x7fce38001150 "request = \"hello\"\n") at libdm-config.c:181
#8  0x000000000040596f in client_thread (baton=0x2ed6d10) at daemon-server.c:382
#9  0x00000030b6a07851 in start_thread () from /lib64/libpthread.so.0
#10 0x00000030b66e890d in clone () from /lib64/libc.so.6




Jan  9 03:52:29 qalvm-01 kernel: lvmetad[7906] general protection ip:3cc5e76205 sp:7fdef6d79a80 error:0 in libc-2.12.so[3cc5e00000+18a000]

#0  0x0000003cc5e76205 in malloc_consolidate () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003cc5e76205 in malloc_consolidate () from /lib64/libc.so.6
#1  0x0000003cc5e79385 in _int_malloc () from /lib64/libc.so.6
#2  0x0000003cc5e7a911 in malloc () from /lib64/libc.so.6
#3  0x000000314ca231e4 in _new_chunk (p=0x7fded000cdf0, s=16384) at mm/pool-fast.c:281
#4  0x000000314ca23389 in dm_pool_alloc_aligned (p=<value optimized out>, s=32, alignment=8) at mm/pool-fast.c:101
#5  0x000000314ca23496 in dm_pool_zalloc (p=<value optimized out>, s=32) at mm/pool.c:72
#6  0x000000314ca20564 in dm_config_create () at libdm-config.c:102
#7  0x000000314ca2237e in dm_config_from_string (config_settings=0x7fded000ce40 "request = \"hello\"\n") at libdm-config.c:181
#8  0x000000000040596f in client_thread (baton=0x2b07f40) at daemon-server.c:382
#9  0x0000003cc6207851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003cc5ee890d in clone () from /lib64/libc.so.6

Comment 6 Petr Rockai 2013-01-10 04:17:43 UTC

Hm. This doesn't look so good. I wonder if there's a chance to get this running in valgrind. Basically, the core will only tell us that something corrupted the heap, which sadly isn't enough to fix the bug. I'll run some tests with valgrind myself: even though we don't see crashes valgrind might find something. One thing that would be quite useful to know is whether the lvmetad instance in question was running only for the duration of that single mirror test or it's being re-used for multiple tests.

Comment 7 Zdenek Kabelac 2013-01-10 08:20:10 UTC

Unfortunately the valgrind testing needs separate configure build option - which is being activated for normal build - otherwise the test will usually abort on memory checking.

Comment 8 Petr Rockai 2013-01-16 15:19:31 UTC

Re comment 7, I don't understand, what do you mean? Running lvmetad in valgrind is perfectly OK in any build config. And since this is a libc heap corruption, valgrind should certainly detect it, even without pool support, right? Anyway, I was unable to reproduce this and hence need more info (i.e. said valgrind output might get me going).

Comment 9 Peter Rajnoha 2013-01-16 15:21:02 UTC

This problem is very hard to track down and reproduce, setting the devel cond nak reproducer, but considering the time we're in 6.4 schedule, it seems this will be moved to 6.5.

Comment 11 Marian Csontos 2013-01-22 12:02:14 UTC

This may be related to low-memory situation - see Bug 889651 for details.

I have always seen the 

    kernel: lvmetad[7906] general protection ip:3cc5e76205 sp:7fdef6d79a80 error:0 in libc-2.12.so[3cc5e00000+18a000]

accompanied by failure in dmeventd as reported in the above bug.

Comment 12 Corey Marthaler 2013-01-24 21:49:48 UTC

Hit this issue when attempting to test for bug 862253.

# run 20 pvs in parallel:

  [...]
  Skipping volume group raid_sanity
  Volume group "raid_sanity" not found
  Request to list PVs in lvmetad gave response Connection reset by peer.

Jan 24 15:30:10 hayes-03 kernel: lvmetad[25889] general protection ip:3a97a76205 sp:7f1b1cdf9b90 error:0 in libc-2.12.so[3a97a00000+18a000]


[root@hayes-03 ~]# pvscan
  WARNING: Failed to connect to lvmetad: Connection refused. Falling back to internal scanning.
  PV /dev/etherd/e1.1p1    VG raid_sanity   lvm2 [908.23 GiB / 907.93 GiB free]
  PV /dev/etherd/e1.1p10   VG raid_sanity   lvm2 [908.23 GiB / 908.23 GiB free]
  PV /dev/etherd/e1.1p2    VG raid_sanity   lvm2 [908.23 GiB / 908.23 GiB free]

Comment 14 John Harrigan 2013-07-10 21:24:12 UTC

Closing. Unable to reproduce.

Comment 15 Zdenek Kabelac 2013-07-11 07:51:02 UTC

There has been some well hidden bug in buffering fixed - likely it could have been the source of these hard to reproduce problems - so I assume it should be now ok.

Comment 16 Zdenek Kabelac 2013-07-11 08:05:48 UTC

Here is related patch:

https://www.redhat.com/archives/lvm-devel/2013-June/msg00416.html

Comment 17 Corey Marthaler 2013-10-07 22:51:49 UTC

This bug may still exist, I filed new bug 1016322 for what appears to be a similar instance as this bug.