Bug 203805 - gfs_controld gets IPC "library queue is full" from cpg_mcast
Summary: gfs_controld gets IPC "library queue is full" from cpg_mcast
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
: 204411 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-08-23 19:57 UTC by Corey Marthaler
Modified: 2016-04-26 15:16 UTC (History)
4 users (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-12-19 14:20:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2006-08-23 19:57:54 UTC
Description of problem:
I had a three node cluster (link-01,2,8) up and running I/O to one gfs
filesystem. Revolver then shot link-01 and then brought it back into the
cluster. The mount of the gfs filesystem however hung.

mounting /dev/vg/myvol on /mnt/gfs on link-01
[HANG]


[root@link-01 ~]# ps -elf | grep mount
1 S root      1998     1  0  85   0 -  7390 rt_sig 07:55 ?        00:00:00 automount
0 S root      2466  2465  0  80   0 - 17703 wait   07:57 ?        00:00:00 mount
-t gfs -o debug /dev/vg/myvol /mnt/gfs
4 S root      2467  2466  0  78   0 -   906 unix_s 07:57 ?        00:00:00
/sbin/mount.gfs /dev/vg/myvol /mnt/gfs -o rw,debug
0 S root      2853  2487  0  78   0 - 15051 pipe_w 09:42 pts/0    00:00:00 grep
mount
[root@link-01 ~]# strace -p 2467
Process 2467 attached - interrupt to quit
read(3,  <unfinished ...>
Process 2467 detached
[root@link-01 ~]# strace -p 2466
Process 2466 attached - interrupt to quit
wait4(-1,  <unfinished ...>
Process 2466 detached
 

[root@link-01 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    324   2006-08-23 07:56:35  link-01
   2   M    324   2006-08-23 07:56:35  link-02
   8   M    324   2006-08-23 07:56:35  link-08
[root@link-01 ~]# cman_tool services
type             level name     id       state
fence            0     default  00010008 none
[1 2 8]
dlm              1     clvmd    00070001 none
[1 2 8]
gfs              2     gfs      00080001 JOIN_STOP_WAIT
[1 2 8]


[root@link-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    324   2006-08-23 07:57:16  link-01
   2   M    316   2006-08-23 07:45:52  link-02
   8   M    316   2006-08-23 07:45:52  link-08
[root@link-02 ~]# cman_tool services
type             level name     id       state
fence            0     default  00010008 none
[1 2 8]
dlm              1     clvmd    00070001 none
[1 2 8]
dlm              1     gfs      00090001 none
[2 8]
gfs              2     gfs      00080001 JOIN_STOP_WAIT
[1 2 8]


[root@link-08 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    324   2006-08-23 08:03:04  link-01
   2   M    316   2006-08-23 07:51:39  link-02
   8   M    280   2006-08-23 06:03:39  link-08
[root@link-08 tmp]# cman_tool services
type             level name     id       state
fence            0     default  00010008 none
[1 2 8]
dlm              1     clvmd    00070001 none
[1 2 8]
dlm              1     gfs      00090001 none
[2 8]
gfs              2     gfs      00080001 JOIN_STOP_WAIT
[1 2 8]



Version-Release number of selected component (if applicable):
[root@link-01 ~]# uname -ar
Linux link-01 2.6.17-1.2519.4.5.el5 #1 SMP Thu Aug 17 19:10:18 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux
[root@link-01 ~]# rpm -q cman
cman-2.0.11-1.fc6

Comment 1 Corey Marthaler 2006-08-23 20:04:01 UTC
Here are the logs from each on the machines:


LINK-01:
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] New Configuration:
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] Members Left:
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] Members Joined:
Aug 23 07:56:35 link-01 openais[2375]: [SYNC ] This node is within the primary
component and will provide service.
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] New Configuration:
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ]  r(0) ip(10.15.89.151)
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ]  r(0) ip(10.15.89.152)
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ]  r(0) ip(10.15.89.158)
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] Members Left:
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] Members Joined:
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ]  r(0) ip(10.15.89.151)
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ]  r(0) ip(10.15.89.152)
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ]  r(0) ip(10.15.89.158)
Aug 23 07:56:35 link-01 openais[2375]: [CMAN ] quorum regained, resuming activity
Aug 23 07:56:35 link-01 openais[2375]: [SYNC ] This node is within the primary
component and will provide service.
Aug 23 07:56:35 link-01 openais[2375]: [TOTEM] entering OPERATIONAL state.
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] got nodejoin message 10.15.89.151
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] got nodejoin message 10.15.89.152
Aug 23 07:56:35 link-01 openais[2375]: [CLM  ] got nodejoin message 10.15.89.158
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:56:35 link-01 openais[2375]: [CKPT ] recovery_section_write returned 12
Aug 23 07:56:35 link-01 openais[2375]: [CPG  ] got joinlist message from node 2
Aug 23 07:56:35 link-01 openais[2375]: [CPG  ] got joinlist message from node 8
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: recover 1
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: add member 8
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: add member 2
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: add member 1
Aug 23 08:03:35 link-01 kernel: dlm: Initiating association with node 2
Aug 23 08:03:35 link-01 kernel: dlm: got new/restarted association 1 nodeid 2
Aug 23 08:03:35 link-01 kernel: dlm: Initiating association with node 8
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: total members 3 error 0
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: dlm_recover_directory
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: dlm_recover_directory 0 entries
Aug 23 08:03:35 link-01 kernel: dlm: clvmd: recover 1 done: 40 ms



LINK-02:
Aug 23 07:57:29 link-02 openais[2545]: [CLM  ] New Configuration:
Aug 23 07:57:27 link-02 xinetd[2194]: START: qarsh pid=2764 from=10.15.80.47
Aug 23 07:57:30 link-02 qarshd[2764]: Running cmdline: cman_tool services
Aug 23 07:57:30 link-02 openais[2545]: [CLM  ]  r(0) ip(10.15.89.152)
Aug 23 07:57:32 link-02 xinetd[2194]: EXIT: qarsh status=0 pid=2761 duration=14(sec)
Aug 23 07:57:32 link-02 openais[2545]: [CLM  ]  r(0) ip(10.15.89.158)
Aug 23 07:57:32 link-02 qarshd[2767]: Talking to peer 10.15.80.47:54296
Aug 23 07:57:32 link-02 xinetd[2194]: START: qarsh pid=2767 from=10.15.80.47
Aug 23 07:57:34 link-02 openais[2545]: [CLM  ] Members Left:
Aug 23 07:57:34 link-02 xinetd[2194]: EXIT: qarsh status=0 pid=2764 duration=7(sec)
Aug 23 07:57:34 link-02 qarshd[2767]: Running cmdline: cman_tool services
Aug 23 07:57:34 link-02 openais[2545]: [CLM  ] Members Joined:
Aug 23 07:57:36 link-02 openais[2545]: [SYNC ] This node is within the primary
component and will provide service.
Aug 23 07:57:36 link-02 xinetd[2194]: EXIT: qarsh status=0 pid=2767 duration=4(sec)
Aug 23 07:57:37 link-02 openais[2545]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 23 07:57:37 link-02 openais[2545]: [CLM  ] New Configuration:
Aug 23 07:57:37 link-02 openais[2545]: [CLM  ]  r(0) ip(10.15.89.151)
Aug 23 07:57:39 link-02 openais[2545]: [CLM  ]  r(0) ip(10.15.89.152)
Aug 23 07:57:40 link-02 openais[2545]: [CLM  ]  r(0) ip(10.15.89.158)
Aug 23 07:57:41 link-02 openais[2545]: [CLM  ] Members Left:
Aug 23 07:57:41 link-02 openais[2545]: [CLM  ] Members Joined:
Aug 23 07:57:41 link-02 openais[2545]: [CLM  ]  r(0) ip(10.15.89.151)
Aug 23 07:57:41 link-02 openais[2545]: [SYNC ] This node is within the primary
component and will provide service.
Aug 23 07:57:41 link-02 openais[2545]: [TOTEM] entering OPERATIONAL state.
Aug 23 07:57:41 link-02 openais[2545]: [CLM  ] got nodejoin message 10.15.89.151
Aug 23 07:57:42 link-02 openais[2545]: [CLM  ] got nodejoin message 10.15.89.152
Aug 23 07:57:42 link-02 openais[2545]: [CLM  ] got nodejoin message 10.15.89.158
Aug 23 07:57:43 link-02 openais[2545]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:57:43 link-02 openais[2545]: [CKPT ] recovery_section_write returned 12
Aug 23 07:57:43 link-02 openais[2545]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:57:43 link-02 openais[2545]: [CKPT ] recovery_section_write returned 12
Aug 23 07:57:45 link-02 openais[2545]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 07:57:45 link-02 openais[2545]: [CKPT ] recovery_section_write returned 12
Aug 23 07:57:45 link-02 openais[2545]: [CPG  ] got joinlist message from node 2
Aug 23 07:57:46 link-02 openais[2545]: [CPG  ] got joinlist message from node 8


LINK-08:
Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] position [2] member 10.15.89.158:
Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] previous ring seq 320 rep
10.15.89.152
Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] aru 961db high delivered 961d9
received flag 0
Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] Did not need to originate any
messages in recovery.
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] New Configuration:
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ]  r(0) ip(10.15.89.152)
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ]  r(0) ip(10.15.89.158)
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] Members Left:
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] Members Joined:
Aug 23 08:03:04 link-08 openais[2567]: [SYNC ] This node is within the primary
component and will provide service.
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] New Configuration:
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ]  r(0) ip(10.15.89.151)
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ]  r(0) ip(10.15.89.152)
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ]  r(0) ip(10.15.89.158)
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] Members Left:
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] Members Joined:
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ]  r(0) ip(10.15.89.151)
Aug 23 08:03:04 link-08 openais[2567]: [SYNC ] This node is within the primary
component and will provide service.
Aug 23 08:03:04 link-08 openais[2567]: [TOTEM] entering OPERATIONAL state.
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] got nodejoin message 10.15.89.151
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] got nodejoin message 10.15.89.152
Aug 23 08:03:04 link-08 openais[2567]: [CLM  ] got nodejoin message 10.15.89.158
Aug 23 08:03:04 link-08 openais[2567]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] recovery_section_write returned 12
Aug 23 08:03:04 link-08 openais[2567]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] recovery_section_write returned 12
Aug 23 08:03:04 link-08 openais[2567]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section
Aug 23 08:03:04 link-08 openais[2567]: [CKPT ] recovery_section_write returned 12
Aug 23 08:03:04 link-08 openais[2567]: [CPG  ] got joinlist message from node 2
Aug 23 08:03:04 link-08 openais[2567]: [CPG  ] got joinlist message from node 8
Aug 23 08:03:05 link-08 openais[2567]: [IPC  ] Library queue is full,
disconnecting library connection.
Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle
6b8b456700000000
Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1
Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle
6b8b456700000000
Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1
Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg connection died
Aug 23 08:03:05 link-08 gfs_controld[2595]: cluster is down, exiting


Comment 2 Corey Marthaler 2006-08-23 20:35:52 UTC
Reproduced this.

[root@link-01 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010008 none
[1 2 8]
dlm              1     clvmd    00020008 none
[1 2 8]
dlm              1     gfs      00020001 none
[1 8]
gfs              2     gfs      00010001 JOIN_STOP_WAIT 2 200030001 1
[1 2 8]


[root@link-02 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010008 none
[1 2 8]
dlm              1     clvmd    00020008 none
[1 2 8]
gfs              2     gfs      00010001 JOIN_STOP_WAIT 2 200030001 1
[1 2 8]


[root@link-08 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010008 none
[1 2 8]
dlm              1     clvmd    00020008 none
[1 2 8]
dlm              1     gfs      00020001 none
[1 8]
gfs              2     gfs      00010001 JOIN_STOP_WAIT 2 200030001 0
[1 2 8]




Comment 3 Corey Marthaler 2006-08-23 20:42:32 UTC
Marking this as a beta blocker since I can't get beyond this during x86_64
recovery testing.

Comment 4 Corey Marthaler 2006-08-23 21:08:43 UTC
recovery is no longer apart of the beta criteria.

Comment 6 Christine Caulfield 2006-08-25 09:12:33 UTC
I tried this and got a different error:
[root@bench-18 pcaulfie]# mount -tgfs2 /dev/vg/gfs2 /mnt/gfs
/sbin/mount.gfs2: error: jid is -2, try rw
/sbin/mount.gfs2: error mounting lockproto lock_dlm
[root@bench-18 pcaulfie]# mount -tgfs2 /dev/vg/gfs2 /mnt/gfs
/sbin/mount.gfs2: lock_dlm_join: gfs_controld join error: -17
/sbin/mount.gfs2: error mounting lockproto lock_dlm


The gfs_controld log looks like this:
1156497285 gfs2a send_options len 1296 "rw"
1156497285 gfs2a start_done 3
1156497285 gfs2a receive_options from 18 len 1296 last_cb 2
1156497285 gfs2a receive_journals from 12 to 18 len 356 count 7 cb 2
1156497285 receive nodeid 12 jid 0 opts 1
1156497285 receive nodeid 13 jid 1 opts 1
1156497285 receive nodeid 14 jid 2 opts 1
1156497285 receive nodeid 15 jid 3 opts 1
1156497285 receive nodeid 17 jid 5 opts 1
1156497285 receive nodeid 18 jid -2 opts 1
1156497285 receive nodeid 19 jid 7 opts 1
1156497285 gfs2a start_participant_init_2 our_jid=-2
1156497285 notify_mount_client: error: jid is -2, try rw
1156497285 gfs2a leaving due to mount error: error: jid is -2, try rw
1156497285 groupd callback: finish gfs2a
1156497285 gfs2a set /sys/fs/gfs2/Patrick2:gfs2a/lock_module/block to 0
1156497285 gfs2a set open /sys/fs/gfs2/Patrick2:gfs2a/lock_module/block error -1
 2
1156497296 client 6: join /mnt/gfs gfs2 lock_dlm Patrick2:gfs2a rw
1156497296 mount: /mnt/gfs gfs2 lock_dlm Patrick2:gfs2a rw
1156497296 mount: failed -17
1156497296 client 6: 
1156497296 client 6 fd 13 read error -1 104



Comment 7 David Teigland 2006-08-25 15:09:44 UTC
Patrick, the -2 that comes back as the jid is a "magic number"
that means the other nodes don't think this new node should be
allowed to mount the fs.  We need to see the gfs_controld log
from the other nodes to know why they think this new mounter
shouldn't be allowed.  (One reason for not allowing it is that
it's trying to mount readonly in certain cases, but that's
not what's happening here).  I've never seen any errors from this
section of the code before (which means we're due for one!)
I think a different bug should be opened for this.

The big problem in the original bug report is that openais
IPC and/or CPG died:

Aug 23 08:03:05 link-08 openais[2567]: [IPC  ] Library queue is full,
disconnecting library connection.
Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle
6b8b456700000000
Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1
Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg_mcast_joined error 2 handle
6b8b456700000000
Aug 23 08:03:05 link-08 gfs_controld[2595]: send plock error -1
Aug 23 08:03:05 link-08 gfs_controld[2595]: cpg connection died
Aug 23 08:03:05 link-08 gfs_controld[2595]: cluster is down, exiting

I'm guessing that the test was doing a lot of plock operations which
results in a really high volume of cpg messages going through openais.


Comment 8 Christine Caulfield 2006-08-29 12:38:43 UTC
I've created a new bug report for that:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=204471

All the relevant logs should be attached to it.

Comment 9 Christine Caulfield 2006-08-29 12:43:59 UTC
As for the IPC message, you could try increasing the size of the IPC queue in
exec/main.h

#define SIZEQUEUE 256


Comment 10 David Teigland 2006-08-29 13:28:45 UTC
*** Bug 204411 has been marked as a duplicate of this bug. ***

Comment 11 Corey Marthaler 2006-08-31 16:39:21 UTC
Hit something similar today while also doing recovery testing.


[root@link-08 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010008 FAIL_ALL_STOPPED 1 100020003 -1
[1 2 8]
dlm              1     clvmd    00020008 FAIL_ALL_STOPPED 1 100020003 -1
[1 2 8]
dlm              1     gfs      00010008 none
[]
gfs              2     gfs      00030008 FAIL_ALL_STOPPED 2 200010003 -1
[2 8]

Comment 12 Corey Marthaler 2006-09-01 18:31:47 UTC
Another variation... hung attempting to start clvmd on link-01 after recovery:

[root@link-08 ~]# cman_tool services
type             level name     id       state
fence            0     default  00010002 none
[2 8]
dlm              1     clvmd    00020008 none
[2 8]
dlm              1     gfs      00020002 none
[2 8]
gfs              2     gfs      00010002 FAIL_START_WAIT
[2 8]

[root@link-02 ~]# cman_tool services
type             level name     id       state
fence            0     default  00010002 none
[2 8]
dlm              1     clvmd    00020008 none
[2 8]
dlm              1     gfs      00020002 none
[2 8]
gfs              2     gfs      00010002 FAIL_START_WAIT
[2 8]


[root@link-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    152   2006-09-01 07:27:25  link-01
   2   M    128   2006-09-01 06:10:35  link-02
   8   M    144   2006-09-01 06:19:32  link-08


Comment 13 Steven Dake 2006-09-05 06:49:35 UTC
The warnings about ckpt returning error 12 are serious problems which have
already been submitted as another bug and are being addressed.  I'm not certain
about the rest of the bug, but I suspect Dave will need an operational
checkpoint synchronization to further debug.

Comment 14 David Teigland 2006-09-06 20:40:33 UTC
In assigning the duplicate above, I was declaring that this bug
represents the "lib queue is full" error.  gfs_controld never
blocks the sending or receiving of cpg messages, so that doesn't
appear to be a cause of the full queue.  I'd like to try to narrow
in on reproducing this full-queue error specifically to figure
out what gfs_controld can do to avoid causing it.

The ckpt error 12 messages are pretty common but don't have any
relation to gfs_controld exiting or recovery stalling AFAIK.

The group_tool output in comments 11 and 12 may have nothing to
do with the topic of this bug (lib queue full), there is no
evidence one way or the other from the information.  Each instance
of a problem should probably get its own bug entry with an associated
collection of all debug logs.  After looking through the logs for each
instance we can usually determine whether it's a duplicate of an existing
bug or a new one and usually assign blame to one component or daemon.


Comment 15 David Teigland 2006-09-08 16:13:37 UTC
I've reproduced this myself.  I had 8 nodes running two instances of:
writeread -L fcntl -m 100 a b c d e f g h

I then killed two nodes, recovery happened, and then
a while after recovery had completed, the IPC queue
full appeared and gfs_controld exited.  Here are the syslog
messages from the node with the queue full:

Sep  8 03:33:42 bench-21 openais[4117]: [TOTEM] The token was lost in the
OPERATIONAL state. 
Sep  8 03:33:42 bench-21 openais[4117]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes). 
Sep  8 03:33:42 bench-21 openais[4117]: [TOTEM] Transmit multicast socket send
buffer size (288000 bytes). 
Sep  8 03:33:42 bench-21 openais[4117]: [TOTEM] entering GATHER state. 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] entering GATHER state. 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] Saving state aru dbd0d high seq
received dbd0d 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] Storing new sequence id for ring
3af490 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] entering COMMIT state. 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] entering RECOVERY state. 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] position [0] member 10.15.84.40: 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep
10.15.84.40 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d
received flag 0 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] position [1] member 10.15.84.41: 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep
10.15.84.40 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d
received flag 0 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] position [2] member 10.15.84.42: 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep
10.15.84.40 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d
received flag 0 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] position [3] member 10.15.84.43: 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep
10.15.84.40 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d
received flag 0 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] position [4] member 10.15.84.44: 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep
10.15.84.40 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d
received flag 0 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] position [5] member 10.15.84.45: 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] previous ring seq 3863692 rep
10.15.84.40 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] aru dbd0d high delivered dbd0d
received flag 0 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] copying all old ring messages
from dbd0e-dbd0d. 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] Originated 0 messages in RECOVERY. 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] Originated for recovery:  
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] Not Originated for recovery:  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] CLM CONFIGURATION CHANGE 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] New Configuration: 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.40)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.41)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.42)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.43)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.44)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.45)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] Members Left: 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         no interface found for
nodeid 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         no interface found for
nodeid 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] Members Joined: 
Sep  8 03:33:43 bench-21 openais[4117]: [SYNC ] This node is within the primary
component and will provide se
rvice. 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] CLM CONFIGURATION CHANGE 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] New Configuration: 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.40)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.41)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.42)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.43)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.44)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ]         r(0) ip(10.15.84.45)  
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] Members Left: 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] Members Joined: 
Sep  8 03:33:43 bench-21 openais[4117]: [SYNC ] This node is within the primary
component and will provide se
rvice. 
Sep  8 03:33:43 bench-21 openais[4117]: [TOTEM] entering OPERATIONAL state. 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] got nodejoin message 10.15.84.40 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] got nodejoin message 10.15.84.41 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] got nodejoin message 10.15.84.42 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] got nodejoin message 10.15.84.43 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] got nodejoin message 10.15.84.44 
Sep  8 03:33:43 bench-21 openais[4117]: [CLM  ] got nodejoin message 10.15.84.45 
Sep  8 03:33:43 bench-21 openais[4117]: [CKPT ]
message_handler_req_exec_ckpt_synchronize_section 
Sep  8 03:33:43 bench-21 openais[4117]: [CKPT ] recovery_section_write returned 12 
Sep  8 03:33:43 bench-21 openais[4117]: [CPG  ] got joinlist message from node 20 
Sep  8 03:33:43 bench-21 openais[4117]: [CPG  ] got joinlist message from node 21 
Sep  8 03:33:43 bench-21 openais[4117]: [CPG  ] got joinlist message from node 22 
Sep  8 03:33:43 bench-21 openais[4117]: [CPG  ] got joinlist message from node 23 
Sep  8 03:33:43 bench-21 openais[4117]: [CPG  ] got joinlist message from node 24 
Sep  8 03:33:43 bench-21 openais[4117]: [CPG  ] got joinlist message from node 25 
Sep  8 03:33:44 bench-21 fenced[4199]: fencing deferred to bench-20
Sep  8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156d  
Sep  8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156c  
Sep  8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156d  
Sep  8 03:33:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 156c  
Sep  8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4059  
Sep  8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4058  
Sep  8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4059  
Sep  8 03:33:51 bench-21 openais[4117]: [TOTEM] Retransmit List: 4058  
Sep  8 03:34:09 bench-21 openais[4117]: [TOTEM] Retransmit List: dd97  
Sep  8 03:34:09 bench-21 openais[4117]: [TOTEM] Retransmit List: dd96  
Sep  8 03:34:09 bench-21 openais[4117]: [TOTEM] Retransmit List: dd97  
Sep  8 03:34:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 16254  
Sep  8 03:34:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 16253  
Sep  8 03:34:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 16254  
Sep  8 03:34:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 1fd1d  
Sep  8 03:34:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 1fd1c  
Sep  8 03:34:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 1fd1d 1fd1e  
Sep  8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288c  
Sep  8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288b  
Sep  8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288c  
Sep  8 03:34:46 bench-21 openais[4117]: [TOTEM] Retransmit List: 2288b  
Sep  8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3f  
Sep  8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3e  
Sep  8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3f  
Sep  8 03:35:00 bench-21 openais[4117]: [TOTEM] Retransmit List: 29f3e  
Sep  8 03:35:11 bench-21 openais[4117]: [TOTEM] Retransmit List: 3054c  
Sep  8 03:35:11 bench-21 openais[4117]: [TOTEM] Retransmit List: 3054b  
Sep  8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37635  
Sep  8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37634  
Sep  8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37635  
Sep  8 03:35:24 bench-21 openais[4117]: [TOTEM] Retransmit List: 37634  
Sep  8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ae  
Sep  8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ad  
Sep  8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ae  
Sep  8 03:35:26 bench-21 openais[4117]: [TOTEM] Retransmit List: 389ad  
Sep  8 03:35:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4f  
Sep  8 03:35:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4e  
Sep  8 03:35:41 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4f  
Sep  8 03:35:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 40e4e  
Sep  8 03:35:45 bench-21 openais[4117]: [TOTEM] Retransmit List: 42a18  
Sep  8 03:35:45 bench-21 openais[4117]: [TOTEM] Retransmit List: 42a17  
Sep  8 03:35:45 bench-21 openais[4117]: [TOTEM] Retransmit List: 42a18  
Sep  8 03:35:56 bench-21 openais[4117]: [TOTEM] Retransmit List: 49300  
Sep  8 03:35:56 bench-21 openais[4117]: [TOTEM] Retransmit List: 492ff  
Sep  8 03:35:57 bench-21 openais[4117]: [TOTEM] Retransmit List: 49300  
Sep  8 03:36:10 bench-21 openais[4117]: [TOTEM] Retransmit List: 507cd  
Sep  8 03:36:10 bench-21 openais[4117]: [TOTEM] Retransmit List: 507cc  
Sep  8 03:36:10 bench-21 openais[4117]: [TOTEM] Retransmit List: 507cd  
Sep  8 03:36:21 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fd  
Sep  8 03:36:21 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fc  
Sep  8 03:36:21 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fd  
Sep  8 03:36:22 bench-21 openais[4117]: [TOTEM] Retransmit List: 570fc  
Sep  8 03:36:28 bench-21 udevsend[4399]: starting udevd daemon
Sep  8 03:36:28 bench-21 udevsend[4404]: starting udevd daemon
Sep  8 03:36:30 bench-21 udevsend[4399]: unable to connect to event daemon, try
to call udev directly
Sep  8 03:36:30 bench-21 udevsend[4404]: unable to connect to event daemon, try
to call udev directly
Sep  8 03:36:31 bench-21 openais[4117]: [TOTEM] Retransmit List: 5c688  
Sep  8 03:36:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 61f6c  
Sep  8 03:36:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 61f6b  
Sep  8 03:36:42 bench-21 openais[4117]: [TOTEM] Retransmit List: 61f6f  
Sep  8 03:36:53 bench-21 openais[4117]: [TOTEM] Retransmit List: 683a9  
Sep  8 03:36:53 bench-21 openais[4117]: [TOTEM] Retransmit List: 683a8  
Sep  8 03:36:57 bench-21 openais[4117]: [TOTEM] Retransmit List: 6a2ac  
Sep  8 03:36:57 bench-21 openais[4117]: [TOTEM] Retransmit List: 6a2ab 6a2ac 6a2ae  
Sep  8 03:37:48 bench-21 openais[4117]: [IPC  ] Library queue is full,
disconnecting library connection. 
Sep  8 03:37:48 bench-21 gfs_controld: cpg connection died
Sep  8 03:37:48 bench-21 gfs_controld: cluster is down, exiting


Comment 16 Steven Dake 2006-09-14 20:28:12 UTC
reassigned bug to sdake - Steve working on flow control for CPG and EVS services
to ensure this type of problem never happens in the field.

Comment 19 Steven Dake 2006-09-27 22:10:44 UTC
Suzanne,

We found an intermittent failure with the fix during heavy integration testing.
 We should have a new build today.

Thanks


Comment 20 Steven Dake 2006-09-29 20:35:07 UTC
Fixed in openais-0.80.1-3.  changed to modified.

Comment 21 Corey Marthaler 2006-11-22 18:38:33 UTC
No longer seeing this marking verified in openais-0.80.1-17.el5.

Comment 22 Jay Turner 2006-12-19 14:20:33 UTC
openais-0.80.2-1.el5 included in 20061218.1 trees.


Note You need to log in before you can comment on or make changes to this bug.