Bug 200566 - A node silently fails to join fence group and is eventually fenced.
A node silently fails to join fence group and is eventually fenced.
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-07-28 13:55 EDT by Robert Peterson
Modified: 2009-04-16 18:49 EDT (History)
1 user (show)

See Also:
Fixed In Version: 5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-09-20 11:51:13 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Robert Peterson 2006-07-28 13:55:51 EDT
Description of problem:
Half the time I reboot my RHEL5/FC6 cluster, my node gets fenced
after the cman init script completes successfully.
group_tool -v indicates there is a node stuck trying to join
the fence domain.

Version-Release number of selected component (if applicable):
Currently development tree for Cluster Suite

How reproducible:
Recreates 50% of the time.

Steps to Reproduce:
1. Reboot all nodes in a 3-node cluster
2. On all nodes, do: service cman start
3. Use group_tool -v to see if they're all there.
  
Actual results:
[root@camel ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 JOIN_START_WAIT 1 100010001 0

Expected results:
[root@camel ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 none
[1 2 3]
dlm              1     clvmd    00020001 none
[1 2 3]

Additional info:
Below I've got output from all three nodes showing the starting
of the services, and output from cman_tool nodes, cman_tool status,
group_tool -v, and group_tool dump fence.  The group_tool dump
hung on the node I suspect was having the problem, so I went into
gdb and did a backtrace.

--------------------------------------------------------------
[root@merit ~]# service cman start
Starting cluster:                                          [  OK  ]
[root@merit ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00000000 JOIN_STOP_WAIT 2 200020001 1
[1 2]
[root@merit ~]#  cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    940   2006-07-28 12:41:15  camel
   2   M    932   2006-07-28 12:41:15  merit
   3   M    944   2006-07-28 12:41:17  winston
[root@merit ~]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 944
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: merit
Node ID: 2
Multicast addresses: 239.192.13.156
Node addresses: 10.15.89.54
[root@merit ~]# group_tool dump fence
1154104877 our_nodeid 2 our_name merit
1154104877 listen 1 member 2 groupd 4
1154104878 client 3: join default
1154104878 delay post_join 1800s post_fail 0s
1154104878 added 3 nodes from ccs
1154104959 client 3: dump
--------------------------------------------------------------

[root@winston ~]# service cman start
Starting cluster:                                          [  OK  ]
[root@winston ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00000000 JOIN_STOP_WAIT 3 300030001 1
[1 2 3]
[root@winston ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    944   2006-07-28 12:40:38  camel
   2   M    944   2006-07-28 12:40:38  merit
   3   M    932   2006-07-28 12:40:38  winston
[root@winston ~]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 944
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: winston
Node ID: 3
Multicast addresses: 239.192.13.156
Node addresses: 10.15.89.53
[root@winston ~]# group_tool dump fence
1154104840 our_nodeid 3 our_name winston
1154104840 listen 1 member 2 groupd 4
1154104841 client 3: join default
1154104841 delay post_join 1800s post_fail 0s
1154104841 added 3 nodes from ccs
1154104914 client 3: dump
--------------------------------------------------------------

[root@camel ~]# service cman start
Starting cluster:                                          [  OK  ]
[root@camel ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 JOIN_START_WAIT 1 100010001 0
[1]

[root@camel ~]#  cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    936   2006-07-28 12:40:48  camel
   2   M    940   2006-07-28 12:40:50  merit
   3   M    944   2006-07-28 12:40:52  winston
[root@camel ~]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 944
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: camel
Node ID: 1
Multicast addresses: 239.192.13.156
Node addresses: 10.15.89.52
[root@camel ~]# group_tool dump fence
(hung, so I used ctrl-c to break out of it)
[root@camel ~]# ps ax | grep fenced
 2215 ?        Ss     0:00 /sbin/fenced
 2245 pts/0    S+     0:00 grep fenced
[root@camel ~]# cd /home/devel/cluster/fence/fenced/
[root@camel ../cluster/fence/fenced]# gdb ./fenced 2215
GNU gdb Red Hat Linux (6.3.0.0-1.131.FC6rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db
library "/lib/libthread_db.so.1".

Attaching to program: /home/devel/cluster/fence/fenced/fenced, process 2215
Failed to read a valid object file image from memory.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
0xb7fcc410 in ?? ()
(gdb) bt
#0  0xb7fcc410 in ?? ()
#1  0xbf8d1548 in ?? ()
#2  0x4e3a7ff4 in ?? () from /lib/libc.so.6
#3  0xbf8d1534 in ?? ()
#4  0x4e303550 in __nanosleep_nocancel () from /lib/libc.so.6
#5  0x4e30339b in sleep () from /lib/libc.so.6
#6  0x0804b59e in delay_fencing (fd=0x8167090, start_type=2) at recover.c:244
#7  0x0804b7af in fence_victims (fd=0x8167090, start_type=2) at recover.c:295
#8  0x0804bd9f in do_recovery (fd=0x8167090, start_type=2, member_count=1,
nodeids=0x80534a0) at recover.c:398
#9  0x0804c190 in process_groupd () at group.c:129
#10 0x0804a89a in loop () at main.c:431
#11 0x0804af74 in main (argc=0, argv=0x346556a7) at main.c:599
Comment 1 Robert Peterson 2006-07-28 14:15:22 EDT
[root@camel ../cluster/fence/fenced]# group_tool dump fence
1154104850 our_nodeid 1 our_name camel
1154104850 listen 1 member 2 groupd 4
1154104851 client 3: join default
1154104851 delay post_join 1800s post_fail 0s
1154104851 added 3 nodes from ccs
1154104851 setid default 65537
1154104851 start default 1 members 1
1154104851 do_recovery stop 0 start 1 finish 0
1154104851 node "winston" not a cman member, cn 1
1154104851 add first victim winston
1154104851 node "merit" not in groupd cpg
1154104851 add first victim merit
1154104852 node "merit" not in groupd cpg
1154104852 node "winston" not a cman member, cn 1
1154104853 reduce victim merit
1154104853 node "winston" not a cman member, cn 1
(last line repeats until the post_join_delay time passes)
1154106653 node "winston" not a cman member, cn 1
1154106653 delay of 1802s leaves 1 victims
1154106653 node "winston" not a cman member, cn 1
1154106653 fencing node winston
1154106664 finish default 1
1154106664 stop default
1154106664 start default 2 members 2 1
1154106664 do_recovery stop 1 start 2 finish 1
1154106664 finish default 2
1154107337 client 3: dump
Comment 2 Robert Peterson 2006-09-20 11:51:13 EDT
This problem went away with the RHEL5/FC6 changes to the group daemons.
Comment 3 Nate Straz 2007-12-13 12:22:08 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.