Bug 438434 - clvmd fails to start on 58th node
clvmd fails to start on 58th node
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2-cluster (Show other bugs)
5.2
All Linux
medium Severity high
: rc
: ---
Assigned To: Christine Caulfield
GFS Bugs
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-20 16:54 EDT by Nate Straz
Modified: 2010-01-11 23:09 EST (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2008-0379
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 10:26:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
clvmd core file (52.05 KB, application/x-gzip)
2008-03-20 16:59 EDT, Nate Straz
no flags Details

  None (edit)
Description Nate Straz 2008-03-20 16:54:44 EDT
Description of problem:

When starting clvmd on a 60 node cluster, clvmd fails.

[root@west-15 log]# clvmd -d 
CLVMD[aaabc300]: Mar 20 15:46:45 CLVMD started
CLVMD[aaabc300]: Mar 20 15:46:45 Connected to CMAN
CLVMD[aaabc300]: Mar 20 15:46:45 CMAN initialisation complete
*** glibc detected *** clvmd: malloc(): memory corruption: 0x000000001e348750 ***
======= Backtrace: =========
/lib64/libc.so.6[0x30dd071d11]
/lib64/libc.so.6(__libc_malloc+0x7d)[0x30dd072eed]
/lib64/libc.so.6[0x30dd06128a]
/usr/lib64/libdlm.so.2[0x30df401acd]
/usr/lib64/libdlm.so.2[0x30df401d16]
/usr/lib64/libdlm.so.2[0x30df4024c2]
clvmd(init_cman_cluster+0x142)[0x411b52]
clvmd(main+0x4d4)[0x40f364]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x30dd01d8b4]
clvmd[0x40c309]
======= Memory map: ========
00400000-00465000 r-xp 00000000 fd:00 8106509                           
/usr/sbin/clvmd
00664000-00666000 rw-p 00064000 fd:00 8106509                           
/usr/sbin/clvmd
00666000-0066b000 rw-p 00666000 00:00 0 
00865000-00869000 rw-p 00065000 fd:00 8106509                           
/usr/sbin/clvmd
1e348000-1e369000 rw-p 1e348000 00:00 0 
30dcc00000-30dcc1a000 r-xp 00000000 fd:00 3113247                       
/lib64/ld-2.5.so
30dce1a000-30dce1b000 r--p 0001a000 fd:00 3113247                       
/lib64/ld-2.5.so
30dce1b000-30dce1c000 rw-p 0001b000 fd:00 3113247                       
/lib64/ld-2.5.so
30dd000000-30dd14a000 r-xp 00000000 fd:00 3113248                       
/lib64/libc-2.5.so
30dd14a000-30dd34a000 ---p 0014a000 fd:00 3113248                       
/lib64/libc-2.5.so
30dd34a000-30dd34e000 r--p 0014a000 fd:00 3113248                       
/lib64/libc-2.5.so
30dd34e000-30dd34f000 rw-p 0014e000 fd:00 3113248                       
/lib64/libc-2.5.so
30dd34f000-30dd354000 rw-p 30dd34f000 00:00 0 
30dd400000-30dd402000 r-xp 00000000 fd:00 3113249                       
/lib64/libdl-2.5.so
30dd402000-30dd602000 ---p 00002000 fd:00 3113249                       
/lib64/libdl-2.5.so
30dd602000-30dd603000 r--p 00002000 fd:00 3113249                       
/lib64/libdl-2.5.so
30dd603000-30dd604000 rw-p 00003000 fd:00 3113249                       
/lib64/libdl-2.5.so
30dd800000-30dd814000 r-xp 00000000 fd:00 3113153                       
/lib64/libdevmapper.so.1.02
30dd814000-30dda14000 ---p 00014000 fd:00 3113153                       
/lib64/libdevmapper.so.1.02
30dda14000-30dda16000 rw-p 00014000 fd:00 3113153                       
/lib64/libdevmapper.so.1.02
30ddc00000-30ddc15000 r-xp 00000000 fd:00 3113253                       
/lib64/libpthread-2.5.so
30ddc15000-30dde14000 ---p 00015000 fd:00 3113253                       
/lib64/libpthread-2.5.so
30dde14000-30dde15000 r--p 00014000 fd:00 3113253                       
/lib64/libpthread-2.5.so
30dde15000-30dde16000 rw-p 00015000 fd:00 3113253                       
/lib64/libpthread-2.5.so
30dde16000-30dde1a000 rw-p 30dde16000 00:00 0 
30de000000-30de004000 r-xp 00000000 fd:00 3113151                       
/lib64/libdevmapper-event.so.1.02
30de004000-30de203000 ---p 00004000 fd:00 3113151                       
/lib64/libdevmapper-event.so.1.02
30de203000-30de204000 rw-p 00003000 fd:00 3113151                       
/lib64/libdevmapper-event.so.1.02
30de400000-30de415000 r-xp 00000000 fd:00 3113258                       
/lib64/libselinux.so.1
30de415000-30de615000 ---p 00015000 fd:00 3113258                       
/lib64/libselinux.so.1
30de615000-30de617000 rw-p 00015000 fd:00 3113258                       
/lib64/libselinux.so.1
30de617000-30de618000 rw-p 30de617000 00:00 0 
30de800000-30de83b000 r-xp 00000000 fd:00 3113257                       
/lib64/libsepol.so.1
30de83b000-30dea3b000 ---p 0003b000 fd:00 3113257                       
/lib64/libsepol.so.1
30dea3b000-30dea3c000 rw-p 0003b000 fd:00 3113257                       
/lib64/libsepol.so.1
30dea3c000-30dea46000 rw-p 30dea3c000 00:00 0 
30dec00000-30dec07000 r-xp 00000000 fd:00 3113254                       
/lib64/librt-2.5.so
30dec07000-30dee07000 ---p 00007000 fd:00 3113254                       
/lib64/librt-2.5.so
30dee07000-30dee08000 r--p 00007000 fd:00 3113254                       
/lib64/librt-2.5.so
30dee08000-30dee09000 rw-p 00008000 fd:00 3113254                       
/lib64/librt-2.5.so
30df000000-30df005000 r-xp 00000000 fd:00 8105890                       
/usr/lib64/libcman.so.2.0.80
30df005000-30df204000 ---p 00005000 fd:00 8105890                       
/usr/lib64/libcman.so.2.0.80
30df204000-30df205000 rw-p 00004000 fd:00 8105890                       
/usr/lib64/libcman.so.2.0.80
30df400000-30df405000 r-xp 00000000 fd:00 8111103                       
/usr/lib64/libdlm.so.2.0.80
30df405000-30df604000 ---p 00005000 fd:00 8111103                       
/usr/lib64/libdlm.so.2.0.80
30df604000-30df605000 rw-p 00004000 fd:00 8111103                       
/usr/lib64/libdlm.so.2.0.80
30df800000-30df84f000 r-xp 00000000 fd:00 8102911                       
/usr/lib64/libncurses.so.5.5
30df84f000-30dfa4e000 ---p 0004f000 fd:00 8102911                       
/usr/lib64/libncurses.so.5.5
30dfa4e000-30dfa5c000 rw-p 0004e000 fd:00 8102911                       
/usr/lib64/libncurses.so.5.5
30dfa5c000-30dfa5d000 rw-p 30dfa5c000 00:00 0 
30e0000000-30e0035000 r-xp 00000000 fd:00 8101492                       
/usr/lib64/libreadline.so.5.1
30e0035000-30e0234000 ---p 00035000 fd:00 8101492                       
/usr/lib64/libreadline.so.5.1
30e0234000-30e023c000 rw-p 00034000 fd:00 8101492                       
/usr/lib64/libreadline.so.5.1
30e023c000-30e023d000 rw-p 30e023c000 00:00 0 
2aaaaaaab000-2aaaaaaac000 rw-p 2aaaaaaab000 00:00 0 
2aaaaaab7000-2aaaaaabd000 rw-p 2aaaaaab7000 00:00 0 
2aaaaaabd000-2aaaaaaca000 r-xp 00000000 fd:00 3113256                   
/lib64/libgcc_s-4.1.2-20080102.so.1
2aaaaaaca000-2aaaaacca000 ---p 0000d000 fd:00 3113256                   
/lib64/libgcc_s-4.1.2-20080102.so.1
2aaaaacca000-2aaaaaccb000 rw-p 0000d000 fd:00 3113256                   
/lib64/libgcc_s-4.1.2-20080102.so.1
2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0 
2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0 
7fff00135000-7fff0014b000 rw-p 7fff00135000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
Aborted


Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.32-1.el5
cman-2.0.80-1.el5
lvm2-2.02.32-1.el5


How reproducible:
100%

Steps to Reproduce:
1. try to start clvmd on all nodes of a 60 node cluster
  
Actual results:
(gdb) bt
#0  0x00000030dd030145 in raise () from /lib64/libc.so.6
#1  0x00000030dd031be0 in abort () from /lib64/libc.so.6
#2  0x00000030dd06a3cb in __libc_message () from /lib64/libc.so.6
#3  0x00000030dd071d11 in _int_malloc () from /lib64/libc.so.6
#4  0x00000030dd072eed in malloc () from /lib64/libc.so.6
#5  0x00000030dd06128a in __fopen_internal () from /lib64/libc.so.6
#6  0x00000030df401acd in dlm_library_version () from /usr/lib64/libdlm.so.2
#7  0x00000030df401d16 in dlm_library_version () from /usr/lib64/libdlm.so.2
#8  0x00000030df4024c2 in dlm_ls_deadlock_cancel () from /usr/lib64/libdlm.so.2
#9  0x0000000000411b52 in init_cman_cluster ()
#10 0x000000000040f364 in main ()


Expected results:
clvmd should start

Additional info:
Comment 1 Nate Straz 2008-03-20 16:59:05 EDT
Created attachment 298748 [details]
clvmd core file
Comment 2 Nate Straz 2008-03-20 17:11:42 EDT
Upon further inspection, it appears that all of the clvmd processes on all nodes
have exitted and left behind their dlm lockspaces.  Restarting clvmd on each
node produces a core file.
Comment 3 Nate Straz 2008-03-24 11:20:15 EDT
Adding TestBlocker flag since I really need clvmd in order to mount something
since the four disks come up ordered differently on some nodes.
Comment 4 Christine Caulfield 2008-03-25 06:43:59 EDT
This is one of those "AAaargh!" bugs, the basic fix is :

-               int *new_updown = realloc(node_updown, new_size);
+               int *new_updown = realloc(node_updown, sizeof(int) * new_size);

The checkin below also contains a fix to setting the initial size of the array.

Checking in daemons/clvmd/clvmd-cman.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd-cman.c,v  <--  clvmd-cman.c
new revision: 1.21; previous revision: 1.20
done
Comment 10 Nate Straz 2008-04-01 10:46:33 EDT
I am still hitting realloc problems and clvmd not staying up on all nodes with
high node counts.
Comment 11 Christine Caulfield 2008-04-01 11:02:13 EDT
It turns out the initial allocation can be wrong too. This patch fixes:

Checking in daemons/clvmd/clvmd-cman.c;
/cvs/lvm2/LVM2/daemons/clvmd/clvmd-cman.c,v  <--  clvmd-cman.c
new revision: 1.22; previous revision: 1.21
done
Comment 15 errata-xmlrpc 2008-05-21 10:26:41 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0379.html

Note You need to log in before you can comment on or make changes to this bug.