Description of problem: Half the time I reboot my RHEL5/FC6 cluster, my node gets fenced after the cman init script completes successfully. group_tool -v indicates there is a node stuck trying to join the fence domain. Version-Release number of selected component (if applicable): Currently development tree for Cluster Suite How reproducible: Recreates 50% of the time. Steps to Reproduce: 1. Reboot all nodes in a 3-node cluster 2. On all nodes, do: service cman start 3. Use group_tool -v to see if they're all there. Actual results: [root@camel ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_START_WAIT 1 100010001 0 Expected results: [root@camel ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 none [1 2 3] dlm 1 clvmd 00020001 none [1 2 3] Additional info: Below I've got output from all three nodes showing the starting of the services, and output from cman_tool nodes, cman_tool status, group_tool -v, and group_tool dump fence. The group_tool dump hung on the node I suspect was having the problem, so I went into gdb and did a backtrace. -------------------------------------------------------------- [root@merit ~]# service cman start Starting cluster: [ OK ] [root@merit ~]# group_tool -v type level name id state node id local_done fence 0 default 00000000 JOIN_STOP_WAIT 2 200020001 1 [1 2] [root@merit ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 940 2006-07-28 12:41:15 camel 2 M 932 2006-07-28 12:41:15 merit 3 M 944 2006-07-28 12:41:17 winston [root@merit ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 944 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 5 Flags: Ports Bound: 0 Node name: merit Node ID: 2 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.54 [root@merit ~]# group_tool dump fence 1154104877 our_nodeid 2 our_name merit 1154104877 listen 1 member 2 groupd 4 1154104878 client 3: join default 1154104878 delay post_join 1800s post_fail 0s 1154104878 added 3 nodes from ccs 1154104959 client 3: dump -------------------------------------------------------------- [root@winston ~]# service cman start Starting cluster: [ OK ] [root@winston ~]# group_tool -v type level name id state node id local_done fence 0 default 00000000 JOIN_STOP_WAIT 3 300030001 1 [1 2 3] [root@winston ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 944 2006-07-28 12:40:38 camel 2 M 944 2006-07-28 12:40:38 merit 3 M 932 2006-07-28 12:40:38 winston [root@winston ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 944 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 5 Flags: Ports Bound: 0 Node name: winston Node ID: 3 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.53 [root@winston ~]# group_tool dump fence 1154104840 our_nodeid 3 our_name winston 1154104840 listen 1 member 2 groupd 4 1154104841 client 3: join default 1154104841 delay post_join 1800s post_fail 0s 1154104841 added 3 nodes from ccs 1154104914 client 3: dump -------------------------------------------------------------- [root@camel ~]# service cman start Starting cluster: [ OK ] [root@camel ~]# group_tool -v type level name id state node id local_done fence 0 default 00010001 JOIN_START_WAIT 1 100010001 0 [1] [root@camel ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 936 2006-07-28 12:40:48 camel 2 M 940 2006-07-28 12:40:50 merit 3 M 944 2006-07-28 12:40:52 winston [root@camel ~]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 944 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Quorum: 2 Active subsystems: 5 Flags: Ports Bound: 0 Node name: camel Node ID: 1 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.52 [root@camel ~]# group_tool dump fence (hung, so I used ctrl-c to break out of it) [root@camel ~]# ps ax | grep fenced 2215 ? Ss 0:00 /sbin/fenced 2245 pts/0 S+ 0:00 grep fenced [root@camel ~]# cd /home/devel/cluster/fence/fenced/ [root@camel ../cluster/fence/fenced]# gdb ./fenced 2215 GNU gdb Red Hat Linux (6.3.0.0-1.131.FC6rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1". Attaching to program: /home/devel/cluster/fence/fenced/fenced, process 2215 Failed to read a valid object file image from memory. Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 0xb7fcc410 in ?? () (gdb) bt #0 0xb7fcc410 in ?? () #1 0xbf8d1548 in ?? () #2 0x4e3a7ff4 in ?? () from /lib/libc.so.6 #3 0xbf8d1534 in ?? () #4 0x4e303550 in __nanosleep_nocancel () from /lib/libc.so.6 #5 0x4e30339b in sleep () from /lib/libc.so.6 #6 0x0804b59e in delay_fencing (fd=0x8167090, start_type=2) at recover.c:244 #7 0x0804b7af in fence_victims (fd=0x8167090, start_type=2) at recover.c:295 #8 0x0804bd9f in do_recovery (fd=0x8167090, start_type=2, member_count=1, nodeids=0x80534a0) at recover.c:398 #9 0x0804c190 in process_groupd () at group.c:129 #10 0x0804a89a in loop () at main.c:431 #11 0x0804af74 in main (argc=0, argv=0x346556a7) at main.c:599
[root@camel ../cluster/fence/fenced]# group_tool dump fence 1154104850 our_nodeid 1 our_name camel 1154104850 listen 1 member 2 groupd 4 1154104851 client 3: join default 1154104851 delay post_join 1800s post_fail 0s 1154104851 added 3 nodes from ccs 1154104851 setid default 65537 1154104851 start default 1 members 1 1154104851 do_recovery stop 0 start 1 finish 0 1154104851 node "winston" not a cman member, cn 1 1154104851 add first victim winston 1154104851 node "merit" not in groupd cpg 1154104851 add first victim merit 1154104852 node "merit" not in groupd cpg 1154104852 node "winston" not a cman member, cn 1 1154104853 reduce victim merit 1154104853 node "winston" not a cman member, cn 1 (last line repeats until the post_join_delay time passes) 1154106653 node "winston" not a cman member, cn 1 1154106653 delay of 1802s leaves 1 victims 1154106653 node "winston" not a cman member, cn 1 1154106653 fencing node winston 1154106664 finish default 1 1154106664 stop default 1154106664 start default 2 members 2 1 1154106664 do_recovery stop 1 start 2 finish 1 1154106664 finish default 2 1154107337 client 3: dump
This problem went away with the RHEL5/FC6 changes to the group daemons.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.