Bug 177467 - clustat hangs during rgmanager recovery testing w/ gulm
Summary: clustat hangs during rgmanager recovery testing w/ gulm
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gulm
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Chris Feist
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-01-10 21:37 UTC by Nate Straz
Modified: 2009-04-16 20:02 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-14 00:00:41 UTC
Embargoed:


Attachments (Terms of Use)

Description Nate Straz 2006-01-10 21:37:09 UTC
Description of problem:

While running recovery tests I get into a state where a clustat hangs in a
recv().  It appears to be talking to clurgmgrd which is stuck in gulm lock code.

Version-Release number of selected component (if applicable):
rgmanager-1.9.44-0
gulm-1.0.5-0

How reproducible:
Takes a couple tries

Steps to Reproduce:
1. Reboot a few nodes
2. Allow recovery to work
3. Run clustat

or run derringer test case.
  
Actual results:

-- Crash output:

      KERNEL: /usr/lib/debug/lib/modules/2.6.9-27.EL/vmlinux
    DUMPFILE: /dev/crash
        CPUS: 1
        DATE: Tue Jan 10 15:30:05 2006
      UPTIME: 04:20:28
LOAD AVERAGE: 0.08, 0.02, 0.01
       TASKS: 66
    NODENAME: morph-03
     RELEASE: 2.6.9-27.EL
     VERSION: #1 Tue Dec 20 19:13:33 EST 2005
     MACHINE: i686  (2399 Mhz)
      MEMORY: 2 GB
         PID: 3701
     COMMAND: "crash"
        TASK: f6560bd0  [THREAD_INFO: f6189000]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)

crash> bt 2858
PID: 2858   TASK: f5e34c50  CPU: 0   COMMAND: "clurgmgrd"
 #0 [f5e24c78] schedule at c030ed50
 #1 [f5e24cfc] sk_wait_data at c02af4a2
 #2 [f5e24d54] tcp_recvmsg at c02d849d
 #3 [f5e24d8c] sock_common_recvmsg at c02afc2e
 #4 [f5e24db0] sock_recvmsg at c02ac437
 #5 [f5e24e9c] sys_recvfrom at c02ad6eb
 #6 [f5e24f60] sys_recv at c02ad754
 #7 [f5e24f7c] sys_socketcall at c02ade36
 #8 [f5e24fc0] system_call at c03115a8
    EAX: 00000066  EBX: 0000000a  ECX: bfeaff30  EDX: 00ab9518 
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 097e3b48 
    SS:  007b      ESP: bfeaff24  EBP: bfeaff58 
    CS:  0073      EIP: 00b207a2  ERR: 00000066  EFLAGS: 00000293 
crash> bt 3634
PID: 3634   TASK: f5fb0dd0  CPU: 0   COMMAND: "clustat"
 #0 [f618bc78] schedule at c030ed50
 #1 [f618bcfc] sk_wait_data at c02af4a2
 #2 [f618bd54] tcp_recvmsg at c02d849d
 #3 [f618bd8c] sock_common_recvmsg at c02afc2e
 #4 [f618bdb0] sock_recvmsg at c02ac437
 #5 [f618be9c] sys_recvfrom at c02ad6eb
 #6 [f618bf60] sys_recv at c02ad754
 #7 [f618bf7c] sys_socketcall at c02ade36
 #8 [f618bfc0] system_call at c03115a8
    EAX: 00000066  EBX: 0000000a  ECX: bfea83f0  EDX: 08052504 
    DS:  007b      ESI: bfea85c4  ES:  007b      EDI: bfea8550 
    SS:  007b      ESP: bfea83e8  EBP: bfea8408 
    CS:  0073      EIP: 00b207a2  ERR: 00000066  EFLAGS: 00000246 
crash> 

-- strace output
[root@morph-03 ~]# strace -p 2858
Process 2858 attached - interrupt to quit
recv(5,  <unfinished ...>
Process 2858 detached
[root@morph-03 ~]# strace -p 3634
Process 3634 attached - interrupt to quit
recv(5,  <unfinished ...>
Process 3634 detached

-- GDB backtraces
[root@morph-03 ~]# gdb /usr/sbin/clustat 
GNU gdb Red Hat Linux (6.3.0.0-1.91rh)
...
(gdb) attach 3634
Attaching to program: /usr/sbin/clustat, process 3634
...
(gdb) bt
#0  0x00b207a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00d8b771 in recv () from /lib/tls/libpthread.so.0
#2  0x0804bfa1 in msg_peek ()
#3  0x0804fbbe in msg_receive_simple (fd=5, buf=0xbfea849c, timeout=10)
    at msgsimple.c:86
#4  0x080498f9 in rg_state_list (local_node_id=18446462599795314442)
    at clustat.c:60
#5  0x0804ad63 in main (argc=2, argv=0xbfea85c4) at clustat.c:620
#6  0x00b4de23 in __libc_start_main () from /lib/tls/libc.so.6
#7  0x08049795 in _start ()

[root@morph-03 ~]# gdb /usr/sbin/clurgmgrd 
GNU gdb Red Hat Linux (6.3.0.0-1.91rh)
...
(gdb) attach 2858
Attaching to program: /usr/sbin/clurgmgrd, process 2858
...
(gdb) bt
#0  0x00b207a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00d8b798 in recv () from /lib/tls/libpthread.so.0
#2  0x00ab831e in xdr_recv (fd=5, buf=0x97e3b48, len=1) at src/xdr_io.c:150
#3  0x00ab7995 in get_next (xdr=0x97e4868) at src/xdr_base.c:467
#4  0x00ab7bf2 in xdr_dec_uint32 (xdr=0x97e4868, i=0xbfeafff4)
    at src/xdr_base.c:546
#5  0x00ab6139 in lg_lock_handle_messages (lgp=0x97e9c00, cbp=0xbfeb0050, 
    misc=0xbfeb0040) at lib/lg_lock.c:69
#6  0x00c9d317 in gulm_lock () from /lib/magma/magma_gulm.so
#7  0x0805f024 in cp_lock ()
#8  0x0805e130 in clu_lock ()
#9  0x0804f521 in rg_lock (name=0x97e7818 "10.15.84.89", p=0xbfeb029c)
    at rg_state.c:140
#10 0x0804db30 in eval_groups (local=1, nodeid=18446462599795314442, 
    nodeStatus=1) at groups.c:338
#11 0x0804bf00 in node_event (local=1, nodeID=18446462599795314442, 
    nodeStatus=1) at main.c:158
#12 0x0804c152 in membership_update () at main.c:233
#13 0x0804d337 in main (argc=1, argv=0xbfeb03f4) at main.c:736
#14 0x00b4de23 in __libc_start_main () from /lib/tls/libc.so.6
#15 0x0804a645 in _start ()

Expected results:

clustat should never hang

Additional info:

Comment 1 Nate Straz 2006-01-10 21:58:09 UTC
FYI, on why I jumped from clustat to clurgmgrd

> Start with what clustat is doing

[root@morph-03 ~]# ps -C clustat
  PID TTY          TIME CMD
 3634 ?        00:00:00 clustat
[root@morph-03 ~]# strace  -p 3634
Process 3634 attached - interrupt to quit
recv(5,  <unfinished ...>
Process 3634 detached

> Look at what fd 5 is

[root@morph-03 ~]# lsof -p 3634
COMMAND  PID USER   FD   TYPE DEVICE    SIZE    NODE NAME
clustat 3634 root  cwd    DIR  253,0    4096       2 /
clustat 3634 root  rtd    DIR  253,0    4096       2 /
clustat 3634 root  txt    REG  253,0   42628 1784514 /usr/sbin/clustat
clustat 3634 root  mem    REG  253,0   79488  472443 /lib/libresolv-2.3.4.so
clustat 3634 root  mem    REG  253,0   11940  491004 /lib/magma/magma_gulm.so
clustat 3634 root  mem    REG  253,0   59755 1775641 /usr/lib/libgulm.so.1.0.0
clustat 3634 root  mem    REG  253,0   21280  472396 /lib/libnss_dns-2.3.4.so
clustat 3634 root  mem    REG  253,0   45889  473911 /lib/libnss_files-2.3.4.so
clustat 3634 root  mem    REG  253,0  106397  472354 /lib/ld-2.3.4.so
clustat 3634 root  mem    REG  253,0 1454546  472397 /lib/tls/libc-2.3.4.so
clustat 3634 root  mem    REG  253,0   15324  472399 /lib/libdl-2.3.4.so
clustat 3634 root  mem    REG  253,0   93940  472400 /lib/tls/libpthread-2.3.4.so
clustat 3634 root  mem    REG  253,0  879961 1782980 /usr/lib/libncurses.so.5.4
clustat 3634 root    0u  IPv4   8553             TCP
morph-03.lab.msp.redhat.com:32803->tin.msp.redhat.com:5013 (ESTABLISHED)
clustat 3634 root    1u  IPv4   8554             TCP
morph-03.lab.msp.redhat.com:32804->tin.msp.redhat.com:5014 (ESTABLISHED)
clustat 3634 root    2u  IPv4   8555             TCP
morph-03.lab.msp.redhat.com:32805->tin.msp.redhat.com:5015 (ESTABLISHED)
clustat 3634 root    3u  IPv6   8558             TCP [::1]:32806->[::1]:40040
(ESTABLISHED)
clustat 3634 root    4u  IPv6   8560             TCP [::1]:32807->[::1]:40042
(ESTABLISHED)
clustat 3634 root    5u  IPv4   8592             TCP
morph-03.lab.msp.redhat.com:32808->morph-03.lab.msp.redhat.com:41966 (ESTABLISHED)

> See who else is connected to the socket

[root@morph-03 ~]# lsof -i 4tcp:32808
COMMAND  PID USER   FD   TYPE DEVICE SIZE NODE NAME
clustat 3634 root    5u  IPv4   8592       TCP
morph-03.lab.msp.redhat.com:32808->morph-03.lab.msp.redhat.com:41966 (ESTABLISHED)
[root@morph-03 ~]# lsof -i 4tcp:41966
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
clurgmgrd 2858 root    7u  IPv4   7968       TCP *:41966 (LISTEN)
clustat   3634 root    5u  IPv4   8592       TCP
morph-03.lab.msp.redhat.com:32808->morph-03.lab.msp.redhat.com:41966 (ESTABLISHED)

> No one seems to be, but clurgmgrd is listening.  What is clurgmgrd up to?

[root@morph-03 ~]# strace -p 2858
Process 2858 attached - interrupt to quit
recv(5,  <unfinished ...>
Process 2858 detached

> Stuck in a recv too...

[root@morph-03 ~]# lsof -p 2858
COMMAND    PID USER   FD   TYPE     DEVICE    SIZE    NODE NAME
clurgmgrd 2858 root  cwd    DIR      253,0    4096       2 /
clurgmgrd 2858 root  rtd    DIR      253,0    4096       2 /
clurgmgrd 2858 root  txt    REG      253,0  122872 1785629 /usr/sbin/clurgmgrd
clurgmgrd 2858 root  mem    REG      253,0   45889  473911
/lib/libnss_files-2.3.4.so
clurgmgrd 2858 root  mem    REG      253,0   79488  472443 /lib/libresolv-2.3.4.so
clurgmgrd 2858 root  mem    REG      253,0  965536 1788673
/usr/lib/libxml2.so.2.6.16
clurgmgrd 2858 root  mem    REG      253,0   31184  472484
/lib/libgcc_s-3.4.5-20051201.so.1
clurgmgrd 2858 root  mem    REG      253,0   59755 1775641 /usr/lib/libgulm.so.1.0.0
clurgmgrd 2858 root  mem    REG      253,0  106397  472354 /lib/ld-2.3.4.so
clurgmgrd 2858 root  mem    REG      253,0 1454546  472397 /lib/tls/libc-2.3.4.so
clurgmgrd 2858 root  mem    REG      253,0   15324  472399 /lib/libdl-2.3.4.so
clurgmgrd 2858 root  mem    REG      253,0  178019  472401 /lib/tls/libm-2.3.4.so
clurgmgrd 2858 root  mem    REG      253,0   11940  491004 /lib/magma/magma_gulm.so
clurgmgrd 2858 root  mem    REG      253,0   93940  472400
/lib/tls/libpthread-2.3.4.so
clurgmgrd 2858 root  mem    REG      253,0   63624 1783259 /usr/lib/libz.so.1.2.1.2
clurgmgrd 2858 root  mem    REG      253,0   21280  472396 /lib/libnss_dns-2.3.4.so
clurgmgrd 2858 root    0u   CHR        1,3            1729 /dev/null
clurgmgrd 2858 root    1u   CHR        1,3            1729 /dev/null
clurgmgrd 2858 root    2u   CHR        1,3            1729 /dev/null
clurgmgrd 2858 root    3u  unix 0xf5ec3d00            6670 socket
clurgmgrd 2858 root    4u  IPv6       7963             TCP
[::1]:32798->[::1]:40040 (ESTABLISHED)
clurgmgrd 2858 root    5u  IPv6       7965             TCP
[::1]:32799->[::1]:40042 (ESTABLISHED)
clurgmgrd 2858 root    6u  IPv6       7967             TCP *:41967 (LISTEN)
clurgmgrd 2858 root    7u  IPv4       7968             TCP *:41966 (LISTEN)
clurgmgrd 2858 root    8u  IPv6       7979             TCP *:41969 (LISTEN)
clurgmgrd 2858 root    9u  IPv4       7980             TCP *:41968 (LISTEN)
[root@morph-03 ~]# lsof -i 6:32799
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
lock_gulm 2339 root   10u  IPv6   7966       TCP [::1]:40042->[::1]:32799
(ESTABLISHED)
clurgmgrd 2858 root    5u  IPv6   7965       TCP [::1]:32799->[::1]:40042
(ESTABLISHED)

> ... talking with lock_gulm ...

[root@morph-03 ~]# strace -p 2339
Process 2339 attached - interrupt to quit
gettimeofday({1136930470, 970353}, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN},
{fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11,
events=POLLIN}], 7, 1000) = 0
gettimeofday({1136930471, 970910}, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN},
{fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11,
events=POLLIN}], 7, 1000) = 0
gettimeofday({1136930472, 971728}, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN},
{fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11,
events=POLLIN}], 7, 1000) = 0
gettimeofday({1136930473, 972600}, NULL) = 0
poll( <unfinished ...>
Process 2339 detached
[root@morph-03 ~]# lsof -p 2339
COMMAND    PID USER   FD   TYPE     DEVICE    SIZE    NODE NAME
lock_gulm 2339 root  cwd    DIR      253,0    4096       2 /
lock_gulm 2339 root  rtd    DIR      253,0    4096       2 /
lock_gulm 2339 root  txt    REG      253,0  205708  521381 /sbin/lock_gulmd
lock_gulm 2339 root  mem    REG      253,0   79488  472443 /lib/libresolv-2.3.4.so
lock_gulm 2339 root  mem    REG      253,0   21280  472396 /lib/libnss_dns-2.3.4.so
lock_gulm 2339 root  mem    REG      253,0  106397  472354 /lib/ld-2.3.4.so
lock_gulm 2339 root  mem    REG      253,0 1454546  472397 /lib/tls/libc-2.3.4.so
lock_gulm 2339 root  mem    REG      253,0   45889  473911
/lib/libnss_files-2.3.4.so
lock_gulm 2339 root    0u   CHR        1,3            1729 /dev/null
lock_gulm 2339 root    1u   CHR        1,3            1729 /dev/null
lock_gulm 2339 root    2u   CHR        1,3            1729 /dev/null
lock_gulm 2339 root    3u  unix 0xf64ba800            5299 socket
lock_gulm 2339 root    4wW  REG      253,0       5 1693977
/var/run/sistina/lock_gulmd_LTPX.pid
lock_gulm 2339 root    5u  IPv6       5301             TCP *:40042 (LISTEN)
lock_gulm 2339 root    6u  IPv6       5302             TCP
[::1]:32778->[::1]:40040 (ESTABLISHED)
lock_gulm 2339 root    7u  IPv6       5307             TCP
[::1]:40042->[::1]:32780 (ESTABLISHED)
lock_gulm 2339 root    8u  IPv6       5347             TCP
morph-03.lab.msp.redhat.com:32786->morph-05.lab.msp.redhat.com:41040 (ESTABLISHED)
lock_gulm 2339 root    9u  IPv6       5523             TCP
[::1]:40042->[::1]:32793 (ESTABLISHED)
lock_gulm 2339 root   10u  IPv6       7966             TCP
[::1]:40042->[::1]:32799 (ESTABLISHED)
lock_gulm 2339 root   11u  IPv6       8561             TCP
[::1]:40042->[::1]:32807 (ESTABLISHED)


Comment 3 Nate Straz 2006-01-11 21:47:51 UTC
I hit it again and was able to dump the lock table by sending SIGUSR1 to all
lock_gulmd processes.  (Might have to increase verbosity with gulm_tool setverb)
I found one lock where people were waiting.

key            : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI='
LVBlen         : 0
LVB            :
HolderCount    : 0
Holders        :
LVBHolderCount : 0
LVBHolders     :
ExpiredCount   : 1
ExpiredHolders :
 - name   : morph-05
   subid  : 2842
   state  : gio_lck_st_Exclusive
   start  : 0
   stop   : 18446744073709551615
   flags  : 0x32
   idx    : 2
reply_waiter   :
Waiters        :
 - key        : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI='
   name       : morph-04
   subid      : 2843
   state      : gio_lck_st_Exclusive
   flags      : 0x32
   start      : 0
   stop       : 18446744073709551615
   LVB        : 'dXNybTo6Z3VsbQ=='
   Slave_rply : 0x0
   Slave_sent : 0x0
   idx        : 5
 - key        : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI='
   name       : morph-03
   subid      : 2854
   state      : gio_lck_st_Exclusive
   flags      : 0x32
   start      : 0
   stop       : 18446744073709551615
   LVB        : 'dXNybTo6Z3VsbQ=='
   Slave_rply : 0x0
   Slave_sent : 0x0
   idx        : 4
 - key        : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI='
   name       : morph-01
   subid      : 2846
   state      : gio_lck_st_Exclusive
   flags      : 0x32
   start      : 0
   stop       : 18446744073709551615
   LVB        : 'dXNybTo6Z3VsbQ=='
   Slave_rply : 0x0
   Slave_sent : 0x0
   idx        : 2
 - key        : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI='
   name       : morph-02
   subid      : 2901
   state      : gio_lck_st_Exclusive
   flags      : 0x32
   start      : 0
   stop       : 18446744073709551615
   LVB        : 'dXNybTo6Z3VsbQ=='
   Slave_rply : 0x0
   Slave_sent : 0x0
   idx        : 6
 - key        : 'dXNybXVzcm06OnJnPSIxMC4xNS44NC44OSI='
   name       : morph-05
   subid      : 2847
   state      : gio_lck_st_Exclusive
   flags      : 0x32
   start      : 0
   stop       : 18446744073709551615
   LVB        : 'dXNybTo6Z3VsbQ=='
   Slave_rply : 0x0
   Slave_sent : 0x0
   idx        : 8
High_Waiters   :
Action_Waiters :
State_Waiters  :


Comment 4 Lon Hohberger 2006-01-11 21:56:51 UTC
So, on morph-05, the lock had somehow expired, thereby sending everything else
into a lock wait state.  Of course, morph-05 was also trying to acquire the
lock, it looks like.



Comment 5 Nate Straz 2006-01-11 22:05:47 UTC
Here is the derringer output that goes with this lock dump:

Iteration 4 started at Wed Jan 11 15:39:55 CST 2006
        Gulm Status
        ===========
        morph-01: Slave
        morph-02: Client
        morph-03: Slave
        morph-04: Client
        morph-05: Master
For this iteration we're gonna RECOVER MACHINES
Those machines facing the derringer=morph-03 morph-02 morph-01 morph-05
Whoops, can't kill every server... morph-01 has been pardoned
Those NOW facing the derringer=morph-03 morph-02 morph-05
Feeling lucky morph-03? Well do ya? Go'head make my day...
No heartbeat from remote host
Remote command exited with unknown state
Feeling lucky morph-02? Well do ya? Go'head make my day...
No heartbeat from remote host
Remote command exited with unknown state
Feeling lucky morph-05? Well do ya? Go'head make my day...
No heartbeat from remote host
Remote command exited with unknown state
        morph-03... DEAD
        morph-02... DEAD
        morph-05... DEAD


Comment 6 Lon Hohberger 2006-01-11 22:10:37 UTC
The lock masters are inquorate (2 are dead out of 3).  Fencing will not occur;
the whole cluster hangs at this point.


Comment 7 Lon Hohberger 2006-01-11 22:19:05 UTC
I have a fix which will make clustat call select() before doing the recv() in
msg_peek, but that's not going to solve it all ways :)

Comment 8 Nate Straz 2006-01-11 23:12:23 UTC
(In reply to comment #6)
> The lock masters are inquorate (2 are dead out of 3).  Fencing will not occur;
> the whole cluster hangs at this point.

Shouldn't this clear up after morph-03 and morph-05 come back up?  I'm
still stuck in this state after the cluster regains quorum.


Comment 9 Lon Hohberger 2006-01-12 15:14:33 UTC
Yes, it should.  If we're stuck in recv() in libgulm, I'm not sure how to fix
it.  I'll have to look more into it.


Comment 10 Nate Straz 2006-02-20 17:49:56 UTC
I hit this again during regression testing.

rgmanager-1.9.46-0
gulm-1.0.6-0


Comment 11 Lon Hohberger 2006-02-20 18:09:45 UTC
Ok, looking at the libgulm code, when the gulm master cluster loses quorum, it
appears that clients using libgulm are never notified of either event, and the
socket is not closed.

This is why clurgmgrd hangs in that case...  There's no notification of either
loss of quorum or regain, and libgulm doesn't seem to retry.  It just blocks in
recv() forever.

Comment 12 dex chen 2006-02-21 18:35:05 UTC
We ran into the same problem. When could we get the fix?

Comment 13 Lon Hohberger 2006-02-22 18:35:51 UTC
This is actually a problem with gulm not notifying clients correctly after a
loss of quorum, as far as I can tell.

Comment 17 Chris Feist 2006-11-14 00:00:41 UTC
We're not currently planning on fixing this issue because it requires a large
amount of code changes.  If it becomes a serious issue we will re-evaluate the
issue.


Note You need to log in before you can comment on or make changes to this bug.