Description of problem: I had a three node cluster (link-01,2,8) up and running I/O to one gfs filesystem. Revolver then shot link-01 and then brought it back into the cluster. The mount of the gfs filesystem however hung. mounting /dev/vg/myvol on /mnt/gfs on link-01 [HANG] [root@link-01 ~]# ps -elf | grep mount 1 S root 1998 1 0 85 0 - 7390 rt_sig 07:55 ? 00:00:00 automount 0 S root 2466 2465 0 80 0 - 17703 wait 07:57 ? 00:00:00 mount -t gfs -o debug /dev/vg/myvol /mnt/gfs 4 S root 2467 2466 0 78 0 - 906 unix_s 07:57 ? 00:00:00 /sbin/mount.gfs /dev/vg/myvol /mnt/gfs -o rw,debug 0 S root 2853 2487 0 78 0 - 15051 pipe_w 09:42 pts/0 00:00:00 grep mount [root@link-01 ~]# strace -p 2467 Process 2467 attached - interrupt to quit read(3, <unfinished ...> Process 2467 detached [root@link-01 ~]# strace -p 2466 Process 2466 attached - interrupt to quit wait4(-1, <unfinished ...> Process 2466 detached [root@link-01 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 324 2006-08-23 07:56:35 link-01 2 M 324 2006-08-23 07:56:35 link-02 8 M 324 2006-08-23 07:56:35 link-08 [root@link-01 ~]# cman_tool services type level name id state fence 0 default 00010008 none [1 2 8] dlm 1 clvmd 00070001 none [1 2 8] gfs 2 gfs 00080001 JOIN_STOP_WAIT [1 2 8] [root@link-02 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 324 2006-08-23 07:57:16 link-01 2 M 316 2006-08-23 07:45:52 link-02 8 M 316 2006-08-23 07:45:52 link-08 [root@link-02 ~]# cman_tool services type level name id state fence 0 default 00010008 none [1 2 8] dlm 1 clvmd 00070001 none [1 2 8] dlm 1 gfs 00090001 none [2 8] gfs 2 gfs 00080001 JOIN_STOP_WAIT [1 2 8] [root@link-08 tmp]# cman_tool nodes Node Sts Inc Joined Name 1 M 324 2006-08-23 08:03:04 link-01 2 M 316 2006-08-23 07:51:39 link-02 8 M 280 2006-08-23 06:03:39 link-08 [root@link-08 tmp]# cman_tool services type level name id state fence 0 default 00010008 none [1 2 8] dlm 1 clvmd 00070001 none [1 2 8] dlm 1 gfs 00090001 none [2 8] gfs 2 gfs 00080001 JOIN_STOP_WAIT [1 2 8] Version-Release number of selected component (if applicable): [root@link-01 ~]# uname -ar Linux link-01 2.6.17-1.2519.4.5.el5 #1 SMP Thu Aug 17 19:10:18 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root@link-01 ~]# rpm -q cman cman-2.0.11-1.fc6
Here are the logs from each on the machines: LINK-01: Aug 23 07:56:35 link-01 openais[2375]: [CLM ] CLM CONFIGURATION CHANGE Aug 23 07:56:35 link-01 openais[2375]: [CLM ] New Configuration: Aug 23 07:56:35 link-01 openais[2375]: [CLM ] Members Left: Aug 23 07:56:35 link-01 openais[2375]: [CLM ] Members Joined: Aug 23 07:56:35 link-01 openais[2375]: [SYNC ] This node is within the primary component and will provide service. Aug 23 07:56:35 link-01 openais[2375]: [CLM ] CLM CONFIGURATION CHANGE Aug 23 07:56:35 link-01 openais[2375]: [CLM ] New Configuration: Aug 23 07:56:35 link-01 openais[2375]: [CLM ] r(0) ip(10.15.89.151) Aug 23 07:56:35 link-01 openais[2375]: [CLM ] r(0) ip(10.15.89.152) Aug 23 07:56:35 link-01 openais[2375]: [CLM ] r(0) ip(10.15.89.158) Aug 23 07:56:35 link-01 openais[2375]: [CLM ] Members Left: Aug 23 07:56:35 link-01 openais[2375]: [CLM ] Members Joined: Aug 23 07:56:35 link-01 openais[2375]: [CLM ] r(0) ip(10.15.89.151) Aug 23 07:56:35 link-01 openais[2375]: [CLM ] r(0) ip(10.15.89.152) Aug 23 07:56:35 link-01 openais[2375]: [CLM ] r(0) ip(10.15.89.158) Aug 23 07:56:35 link-01 openais[2375]: [CMAN ] quorum regained, resuming activity Aug 23 07:56:35 link-01 openais[2375]: [SYNC ] This node is within the primary component and will provide service. Aug 23 07:56:35 link-01 openais[2375]: [TOTEM] entering OPERATIONAL state. Aug 23 07:56:35 link-01 openais[2375]: [CLM ] got nodejoin message 10.15.89.151 Aug 23 07:56:35 link-01 openais[2375]: [CLM ] got nodejoin message 10.15.89.152 Aug 23 07:56:35 link-01 openais[2375]: [CLM ] got nodejoin message 10.15.89.158 Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12 Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12 Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12 Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12 Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12 Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12 Aug 23 07:56:35 link-01 openais[2375]: [CPG ] got joinlist message from node 2 Aug 23 07:56:35 link-01 openais[2375]: [CPG ] got joinlist message from node 8 Aug 23 08:03:35 link-01 kernel: dlm: clvmd: recover 1 Aug 23 08:03:35 link-01 kernel: dlm: clvmd: add member 8 Aug 23 08:03:35 link-01 kernel: dlm: clvmd: add member 2 Aug 23 08:03:35 link-01 kernel: dlm: clvmd: add member 1 Aug 23 08:03:35 link-01 kernel: dlm: Initiating association with node 2 Aug 23 08:03:35 link-01 kernel: dlm: got new/restarted association 1 nodeid 2 Aug 23 08:03:35 link-01 kernel: dlm: Initiating association with node 8 Aug 23 08:03:35 link-01 kernel: dlm: clvmd: total members 3 error 0 Aug 23 08:03:35 link-01 kernel: dlm: clvmd: dlm_recover_directory Aug 23 08:03:35 link-01 kernel: dlm: clvmd: dlm_recover_directory 0 entries Aug 23 08:03:35 link-01 kernel: dlm: clvmd: recover 1 done: 40 ms LINK-02: Aug 23 07:57:29 link-02 openais[2545]: [CLM ] New Configuration: Aug 23 07:57:27 link-02 xinetd[2194]: START: qarsh pid=2764 from=10.15.80.47 Aug 23 07:57:30 link-02 qarshd[2764]: Running cmdline: cman_tool services Aug 23 07:57:30 link-02 openais[2545]: [CLM ] r(0) ip(10.15.89.152) Aug 23 07:57:32 link-02 xinetd[2194]: EXIT: qarsh status=0 pid=2761 duration=14(sec) Aug 23 07:57:32 link-02 openais[2545]: [CLM ] r(0) ip(10.15.89.158) Aug 23 07:57:32 link-02 qarshd[2767]: Talking to peer 10.15.80.47:54296 Aug 23 07:57:32 link-02 xinetd[2194]: START: qarsh pid=2767 from=10.15.80.47 Aug 23 07:57:34 link-02 openais[2545]: [CLM ] Members Left: Aug 23 07:57:34 link-02 xinetd[2194]: EXIT: qarsh status=0 pid=2764 duration=7(sec) Aug 23 07:57:34 link-02 qarshd[2767]: Running cmdline: cman_tool services Aug 23 07:57:34 link-02 openais[2545]: [CLM ] Members Joined: Aug 23 07:57:36 link-02 openais[2545]: [SYNC ] This node is within the primary component and will provide service. Aug 23 07:57:36 link-02 xinetd[2194]: EXIT: qarsh status=0 pid=2767 duration=4(sec) Aug 23 07:57:37 link-02 openais[2545]: [CLM ] CLM CONFIGURATION CHANGE Aug 23 07:57:37 link-02 openais[2545]: [CLM ] New Configuration: Aug 23 07:57:37 link-02 openais[2545]: [CLM ] r(0) ip(10.15.89.151) Aug 23 07:57:39 link-02 openais[2545]: [CLM ] r(0) ip(10.15.89.152) Aug 23 07:57:40 link-02 openais[2545]: [CLM ] r(0) ip(10.15.89.158) Aug 23 07:57:41 link-02 openais[2545]: [CLM ] Members Left: Aug 23 07:57:41 link-02 openais[2545]: [CLM ] Members Joined: Aug 23 07:57:41 link-02 openais[2545]: [CLM ] r(0) ip(10.15.89.151) Aug 23 07:57:41 link-02 openais[2545]: [SYNC ] This node is within the primary component and will provide service. Aug 23 07:57:41 link-02 openais[2545]: [TOTEM] entering OPERATIONAL state. Aug 23 07:57:41 link-02 openais[2545]: [CLM ] got nodejoin message 10.15.89.151 Aug 23 07:57:42 link-02 openais[2545]: [CLM ] got nodejoin message 10.15.89.152 Aug 23 07:57:42 link-02 openais[2545]: [CLM ] got nodejoin message 10.15.89.158 Aug 23 07:57:43 link-02 openais[2545]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:57:43 link-02 openais[2545]: [CKPT ] recovery_section_write returned 12 Aug 23 07:57:43 link-02 openais[2545]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:57:43 link-02 openais[2545]: [CKPT ] recovery_section_write returned 12 Aug 23 07:57:45 link-02 openais[2545]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 07:57:45 link-02 openais[2545]: [CKPT ] recovery_section_write returned 12 Aug 23 07:57:45 link-02 openais[2545]: [CPG ] got joinlist message from node 2 Aug 23 07:57:46 link-02 openais[2545]: [CPG ] got joinlist message from node 8 LINK-08: Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] position [2] member 10.15.89.158: Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] previous ring seq 320 rep 10.15.89.152 Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] aru 961db high delivered 961d9 received flag 0 Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] Did not need to originate any messages in recovery. Aug 23 08:03:04 link-08 openais[2567]: [CLM ] CLM CONFIGURATION CHANGE Aug 23 08:03:04 link-08 openais[2567]: [CLM ] New Configuration: Aug 23 08:03:04 link-08 openais[2567]: [CLM ] r(0) ip(10.15.89.152) Aug 23 08:03:04 link-08 openais[2567]: [CLM ] r(0) ip(10.15.89.158) Aug 23 08:03:04 link-08 openais[2567]: [CLM ] Members Left: Aug 23 08:03:04 link-08 openais[2567]: [CLM ] Members Joined: Aug 23 08:03:04 link-08 openais[2567]: [SYNC ] This node is within the primary component and will provide service. Aug 23 08:03:04 link-08 openais[2567]: [CLM ] CLM CONFIGURATION CHANGE Aug 23 08:03:04 link-08 openais[2567]: [CLM ] New Configuration: Aug 23 08:03:04 link-08 openais[2567]: [CLM ] r(0) ip(10.15.89.151) Aug 23 08:03:04 link-08 openais[2567]: [CLM ] r(0) ip(10.15.89.152) Aug 23 08:03:04 link-08 openais[2567]: [CLM ] r(0) ip(10.15.89.158) Aug 23 08:03:04 link-08 openais[2567]: [CLM ] Members Left: Aug 23 08:03:04 link-08 openais[2567]: [CLM ] Members Joined: Aug 23 08:03:04 link-08 openais[2567]: [CLM ] r(0) ip(10.15.89.151) Aug 23 08:03:04 link-08 openais[2567]: [SYNC ] This node is within the primary component and will provide service. Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] entering OPERATIONAL state. Aug 23 08:03:04 link-08 openais[2567]: [CLM ] got nodejoin message 10.15.89.151 Aug 23 08:03:04 link-08 openais[2567]: [CLM ] got nodejoin message 10.15.89.152 Aug 23 08:03:04 link-08 openais[2567]: [CLM ] got nodejoin message 10.15.89.158 Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] recovery_section_write returned 12 Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] recovery_section_write returned 12 Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] recovery_section_write returned 12 Aug 23 08:03:04 link-08 openais[2567]: [CPG ] got joinlist message from node 2 Aug 23 08:03:04 link-08 openais[2567]: [CPG ] got joinlist message from node 8 Aug 23 08:03:05 link-08 openais[2567]: [IPC ] Library queue is full, disconnecting library connection. Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle 6b8b456700000000 Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1 Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle 6b8b456700000000 Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1 Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg connection died Aug 23 08:03:05 link-08 gfs_controld[2595]: cluster is down, exiting
Reproduced this. [root@link-01 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010008 none [1 2 8] dlm 1 clvmd 00020008 none [1 2 8] dlm 1 gfs 00020001 none [1 8] gfs 2 gfs 00010001 JOIN_STOP_WAIT 2 200030001 1 [1 2 8] [root@link-02 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010008 none [1 2 8] dlm 1 clvmd 00020008 none [1 2 8] gfs 2 gfs 00010001 JOIN_STOP_WAIT 2 200030001 1 [1 2 8] [root@link-08 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010008 none [1 2 8] dlm 1 clvmd 00020008 none [1 2 8] dlm 1 gfs 00020001 none [1 8] gfs 2 gfs 00010001 JOIN_STOP_WAIT 2 200030001 0 [1 2 8]
Marking this as a beta blocker since I can't get beyond this during x86_64 recovery testing.
recovery is no longer apart of the beta criteria.
I tried this and got a different error: [root@bench-18 pcaulfie]# mount -tgfs2 /dev/vg/gfs2 /mnt/gfs /sbin/mount.gfs2: error: jid is -2, try rw /sbin/mount.gfs2: error mounting lockproto lock_dlm [root@bench-18 pcaulfie]# mount -tgfs2 /dev/vg/gfs2 /mnt/gfs /sbin/mount.gfs2: lock_dlm_join: gfs_controld join error: -17 /sbin/mount.gfs2: error mounting lockproto lock_dlm The gfs_controld log looks like this: 1156497285 gfs2a send_options len 1296 "rw" 1156497285 gfs2a start_done 3 1156497285 gfs2a receive_options from 18 len 1296 last_cb 2 1156497285 gfs2a receive_journals from 12 to 18 len 356 count 7 cb 2 1156497285 receive nodeid 12 jid 0 opts 1 1156497285 receive nodeid 13 jid 1 opts 1 1156497285 receive nodeid 14 jid 2 opts 1 1156497285 receive nodeid 15 jid 3 opts 1 1156497285 receive nodeid 17 jid 5 opts 1 1156497285 receive nodeid 18 jid -2 opts 1 1156497285 receive nodeid 19 jid 7 opts 1 1156497285 gfs2a start_participant_init_2 our_jid=-2 1156497285 notify_mount_client: error: jid is -2, try rw 1156497285 gfs2a leaving due to mount error: error: jid is -2, try rw 1156497285 groupd callback: finish gfs2a 1156497285 gfs2a set /sys/fs/gfs2/Patrick2:gfs2a/lock_module/block to 0 1156497285 gfs2a set open /sys/fs/gfs2/Patrick2:gfs2a/lock_module/block error -1 2 1156497296 client 6: join /mnt/gfs gfs2 lock_dlm Patrick2:gfs2a rw 1156497296 mount: /mnt/gfs gfs2 lock_dlm Patrick2:gfs2a rw 1156497296 mount: failed -17 1156497296 client 6: 1156497296 client 6 fd 13 read error -1 104
Patrick, the -2 that comes back as the jid is a "magic number" that means the other nodes don't think this new node should be allowed to mount the fs. We need to see the gfs_controld log from the other nodes to know why they think this new mounter shouldn't be allowed. (One reason for not allowing it is that it's trying to mount readonly in certain cases, but that's not what's happening here). I've never seen any errors from this section of the code before (which means we're due for one!) I think a different bug should be opened for this. The big problem in the original bug report is that openais IPC and/or CPG died: Aug 23 08:03:05 link-08 openais[2567]: [IPC ] Library queue is full, disconnecting library connection. Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle 6b8b456700000000 Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1 Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle 6b8b456700000000 Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1 Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg connection died Aug 23 08:03:05 link-08 gfs_controld[2595]: cluster is down, exiting I'm guessing that the test was doing a lot of plock operations which results in a really high volume of cpg messages going through openais.
I've created a new bug report for that: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=204471 All the relevant logs should be attached to it.
As for the IPC message, you could try increasing the size of the IPC queue in exec/main.h #define SIZEQUEUE 256
*** Bug 204411 has been marked as a duplicate of this bug. ***
Hit something similar today while also doing recovery testing. [root@link-08 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010008 FAIL_ALL_STOPPED 1 100020003 -1 [1 2 8] dlm 1 clvmd 00020008 FAIL_ALL_STOPPED 1 100020003 -1 [1 2 8] dlm 1 gfs 00010008 none [] gfs 2 gfs 00030008 FAIL_ALL_STOPPED 2 200010003 -1 [2 8]
Another variation... hung attempting to start clvmd on link-01 after recovery: [root@link-08 ~]# cman_tool services type level name id state fence 0 default 00010002 none [2 8] dlm 1 clvmd 00020008 none [2 8] dlm 1 gfs 00020002 none [2 8] gfs 2 gfs 00010002 FAIL_START_WAIT [2 8] [root@link-02 ~]# cman_tool services type level name id state fence 0 default 00010002 none [2 8] dlm 1 clvmd 00020008 none [2 8] dlm 1 gfs 00020002 none [2 8] gfs 2 gfs 00010002 FAIL_START_WAIT [2 8] [root@link-02 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 152 2006-09-01 07:27:25 link-01 2 M 128 2006-09-01 06:10:35 link-02 8 M 144 2006-09-01 06:19:32 link-08
The warnings about ckpt returning error 12 are serious problems which have already been submitted as another bug and are being addressed. I'm not certain about the rest of the bug, but I suspect Dave will need an operational checkpoint synchronization to further debug.
In assigning the duplicate above, I was declaring that this bug represents the "lib queue is full" error. gfs_controld never blocks the sending or receiving of cpg messages, so that doesn't appear to be a cause of the full queue. I'd like to try to narrow in on reproducing this full-queue error specifically to figure out what gfs_controld can do to avoid causing it. The ckpt error 12 messages are pretty common but don't have any relation to gfs_controld exiting or recovery stalling AFAIK. The group_tool output in comments 11 and 12 may have nothing to do with the topic of this bug (lib queue full), there is no evidence one way or the other from the information. Each instance of a problem should probably get its own bug entry with an associated collection of all debug logs. After looking through the logs for each instance we can usually determine whether it's a duplicate of an existing bug or a new one and usually assign blame to one component or daemon.
I've reproduced this myself. I had 8 nodes running two instances of: writeread -L fcntl -m 100 a b c d e f g h I then killed two nodes, recovery happened, and then a while after recovery had completed, the IPC queue full appeared and gfs_controld exited. Here are the syslog messages from the node with the queue full: Sep 8 03:33:42 bench-21 openais[4117]: [TOTEM] The token was lost in the OPERATIONAL state. Sep 8 03:33:42 bench-21 openais[4117]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Sep 8 03:33:42 bench-21 openais[4117]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). Sep 8 03:33:42 bench-21 openais[4117]: [TOTEM] entering GATHER state. Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] entering GATHER state. Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] Saving state aru dbd0d high seq received dbd0d Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] Storing new sequence id for ring 3af490 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] entering COMMIT state. Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] entering RECOVERY state. Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] position [0] member 10.15.84.40: Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep 10.15.84.40 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d received flag 0 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] position [1] member 10.15.84.41: Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep 10.15.84.40 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d received flag 0 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] position [2] member 10.15.84.42: Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep 10.15.84.40 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d received flag 0 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] position [3] member 10.15.84.43: Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep 10.15.84.40 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d received flag 0 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] position [4] member 10.15.84.44: Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep 10.15.84.40 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d received flag 0 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] position [5] member 10.15.84.45: Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep 10.15.84.40 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d received flag 0 Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] copying all old ring messages from dbd0e-dbd0d. Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] Originated 0 messages in RECOVERY. Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] Originated for recovery: Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] Not Originated for recovery: Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] CLM CONFIGURATION CHANGE Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] New Configuration: Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.40) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.41) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.42) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.43) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.44) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.45) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] Members Left: Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] no interface found for nodeid Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] no interface found for nodeid Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] Members Joined: Sep 8 03:33:43 bench-21 openais[4117]: [SYNC ] This node is within the primary component and will provide se rvice. Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] CLM CONFIGURATION CHANGE Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] New Configuration: Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.40) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.41) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.42) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.43) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.44) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] r(0) ip(10.15.84.45) Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] Members Left: Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] Members Joined: Sep 8 03:33:43 bench-21 openais[4117]: [SYNC ] This node is within the primary component and will provide se rvice. Sep 8 03:33:43 bench-21 openais[4117]: [TOTEM] entering OPERATIONAL state. Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] got nodejoin message 10.15.84.40 Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] got nodejoin message 10.15.84.41 Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] got nodejoin message 10.15.84.42 Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] got nodejoin message 10.15.84.43 Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] got nodejoin message 10.15.84.44 Sep 8 03:33:43 bench-21 openais[4117]: [CLM ] got nodejoin message 10.15.84.45 Sep 8 03:33:43 bench-21 openais[4117]: [CKPT ] message_handler_req_exec_ckpt_synchronize_section Sep 8 03:33:43 bench-21 openais[4117]: [CKPT ] recovery_section_write returned 12 Sep 8 03:33:43 bench-21 openais[4117]: [CPG ] got joinlist message from node 20 Sep 8 03:33:43 bench-21 openais[4117]: [CPG ] got joinlist message from node 21 Sep 8 03:33:43 bench-21 openais[4117]: [CPG ] got joinlist message from node 22 Sep 8 03:33:43 bench-21 openais[4117]: [CPG ] got joinlist message from node 23 Sep 8 03:33:43 bench-21 openais[4117]: [CPG ] got joinlist message from node 24 Sep 8 03:33:43 bench-21 openais[4117]: [CPG ] got joinlist message from node 25 Sep 8 03:33:44 bench-21 fenced[4199]: fencing deferred to bench-20 Sep 8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156d Sep 8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156c Sep 8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156d Sep 8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156c Sep 8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4059 Sep 8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4058 Sep 8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4059 Sep 8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4058 Sep 8 03:34:09 bench-21 openais[4117]: [TOTEM] Retransmit List: dd97 Sep 8 03:34:09 bench-21 openais[4117]: [TOTEM] Retransmit List: dd96 Sep 8 03:34:09 bench-21 openais[4117]: [TOTEM] Retransmit List: dd97 Sep 8 03:34:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 16254 Sep 8 03:34:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 16253 Sep 8 03:34:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 16254 Sep 8 03:34:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 1fd1d Sep 8 03:34:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 1fd1c Sep 8 03:34:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 1fd1d 1fd1e Sep 8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288c Sep 8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288b Sep 8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288c Sep 8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288b Sep 8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3f Sep 8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3e Sep 8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3f Sep 8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3e Sep 8 03:35:11 bench-21 openais[4117]: [TOTEM] Retransmit List: 3054c Sep 8 03:35:11 bench-21 openais[4117]: [TOTEM] Retransmit List: 3054b Sep 8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37635 Sep 8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37634 Sep 8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37635 Sep 8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37634 Sep 8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ae Sep 8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ad Sep 8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ae Sep 8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ad Sep 8 03:35:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4f Sep 8 03:35:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4e Sep 8 03:35:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4f Sep 8 03:35:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4e Sep 8 03:35:45 bench-21 openais[4117]: [TOTEM] Retransmit List: 42a18 Sep 8 03:35:45 bench-21 openais[4117]: [TOTEM] Retransmit List: 42a17 Sep 8 03:35:45 bench-21 openais[4117]: [TOTEM] Retransmit List: 42a18 Sep 8 03:35:56 bench-21 openais[4117]: [TOTEM] Retransmit List: 49300 Sep 8 03:35:56 bench-21 openais[4117]: [TOTEM] Retransmit List: 492ff Sep 8 03:35:57 bench-21 openais[4117]: [TOTEM] Retransmit List: 49300 Sep 8 03:36:10 bench-21 openais[4117]: [TOTEM] Retransmit List: 507cd Sep 8 03:36:10 bench-21 openais[4117]: [TOTEM] Retransmit List: 507cc Sep 8 03:36:10 bench-21 openais[4117]: [TOTEM] Retransmit List: 507cd Sep 8 03:36:21 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fd Sep 8 03:36:21 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fc Sep 8 03:36:21 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fd Sep 8 03:36:22 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fc Sep 8 03:36:28 bench-21 udevsend[4399]: starting udevd daemon Sep 8 03:36:28 bench-21 udevsend[4404]: starting udevd daemon Sep 8 03:36:30 bench-21 udevsend[4399]: unable to connect to event daemon, try to call udev directly Sep 8 03:36:30 bench-21 udevsend[4404]: unable to connect to event daemon, try to call udev directly Sep 8 03:36:31 bench-21 openais[4117]: [TOTEM] Retransmit List: 5c688 Sep 8 03:36:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 61f6c Sep 8 03:36:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 61f6b Sep 8 03:36:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 61f6f Sep 8 03:36:53 bench-21 openais[4117]: [TOTEM] Retransmit List: 683a9 Sep 8 03:36:53 bench-21 openais[4117]: [TOTEM] Retransmit List: 683a8 Sep 8 03:36:57 bench-21 openais[4117]: [TOTEM] Retransmit List: 6a2ac Sep 8 03:36:57 bench-21 openais[4117]: [TOTEM] Retransmit List: 6a2ab 6a2ac 6a2ae Sep 8 03:37:48 bench-21 openais[4117]: [IPC ] Library queue is full, disconnecting library connection. Sep 8 03:37:48 bench-21 gfs_controld: cpg connection died Sep 8 03:37:48 bench-21 gfs_controld: cluster is down, exiting
reassigned bug to sdake - Steve working on flow control for CPG and EVS services to ensure this type of problem never happens in the field.
Suzanne, We found an intermittent failure with the fix during heavy integration testing. We should have a new build today. Thanks
Fixed in openais-0.80.1-3. changed to modified.
No longer seeing this marking verified in openais-0.80.1-17.el5.
openais-0.80.2-1.el5 included in 20061218.1 trees.