Bug 444909

Summary: aisexec died when another node left cluster
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact:
Severity: low Docs Contact:
Priority: urgent    
Version: 5.2CC: bstevens, cluster-maint, edamato, lhh, sghosh
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:46:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 509889    
Attachments:
Description Flags
core file none

Description Corey Marthaler 2008-05-01 18:26:24 UTC
Description of problem:
I had 3 nodes doing mount/umount operations on GFS filesystems when I rebooted
one of the nodes (hayes-03). Instead of recovery taking place like I had
expected, the remaing 2 nodes each just left the cluster, leaving the umount
cmds hung.

I'll attach the core left behind on hayes-01.

May  1 11:16:41 hayes-01 openais[4171]: [TOTEM] Did not need to originate any
messages in recovery.
May  1 11:16:41 hayes-01 openais[4171]: [TOTEM] Sending initial ORF token
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] CLM CONFIGURATION CHANGE
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] New Configuration:
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ]         r(0) ip(10.15.89.135)
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ]         r(0) ip(10.15.89.136)
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] Members Left:
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ]         r(0) ip(10.15.89.137)
May  1 11:16:41 hayes-01 kernel: dlm: closing connection to node 3
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] Members Joined:
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] CLM CONFIGURATION CHANGE
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] New Configuration:
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ]         r(0) ip(10.15.89.135)
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ]         r(0) ip(10.15.89.136)
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] Members Left:
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] Members Joined:
May  1 11:16:41 hayes-01 openais[4171]: [SYNC ] This node is within the primary
component and will provide service.
May  1 11:16:41 hayes-01 openais[4171]: [TOTEM] entering OPERATIONAL state.
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] got nodejoin message 10.15.89.135
May  1 11:16:41 hayes-01 openais[4171]: [CLM  ] got nodejoin message 10.15.89.136
May  1 11:16:42 hayes-01 dlm_controld[4194]: cluster is down, exiting
May  1 11:16:42 hayes-01 groupd[4180]: cpg_mcast_joined error 2 handle
6b8b456700000000
May  1 11:16:42 hayes-01 gfs_controld[4200]: groupd_dispatch error -1 errno 104
May  1 11:16:42 hayes-01 gfs_controld[4200]: groupd connection died
May  1 11:16:42 hayes-01 gfs_controld[4200]: cluster is down, exiting
May  1 11:16:42 hayes-01 fenced[4188]: cluster is down, exiting
May  1 11:16:42 hayes-01 kernel: dlm: closing connection to node 2
May  1 11:16:42 hayes-01 kernel: dlm: closing connection to node 1
May  1 11:17:10 hayes-01 ccsd[4164]: Unable to connect to cluster infrastructure
after 30 seconds.


May  1 11:16:37 hayes-02 openais[4553]: [TOTEM] aru 87563 high delivered 87563
received flag 1
May  1 11:16:37 hayes-02 openais[4553]: [TOTEM] Did not need to originate any
messages in recovery.
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] CLM CONFIGURATION CHANGE
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] New Configuration:
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ]         r(0) ip(10.15.89.135)
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ]         r(0) ip(10.15.89.136)
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] Members Left:
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ]         r(0) ip(10.15.89.137)
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] Members Joined:
May  1 11:16:37 hayes-02 kernel: dlm: closing connection to node 3
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] CLM CONFIGURATION CHANGE
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] New Configuration:
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ]         r(0) ip(10.15.89.135)
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ]         r(0) ip(10.15.89.136)
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] Members Left:
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] Members Joined:
May  1 11:16:37 hayes-02 openais[4553]: [SYNC ] This node is within the primary
component and will provide service.
May  1 11:16:37 hayes-02 openais[4553]: [TOTEM] entering OPERATIONAL state.
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] got nodejoin message 10.15.89.135
May  1 11:16:37 hayes-02 openais[4553]: [CLM  ] got nodejoin message 10.15.89.136
May  1 11:16:37 hayes-02 gfs_controld[4581]: cluster is down, exiting
May  1 11:16:37 hayes-02 groupd[4564]: cpg_mcast_joined error 2 handle
6b8b456700000000
May  1 11:16:37 hayes-02 dlm_controld[4576]: cluster is down, exiting
May  1 11:16:37 hayes-02 fenced[4571]: cluster is down, exiting
May  1 11:16:37 hayes-02 kernel: dlm: closing connection to node 1
May  1 11:16:37 hayes-02 kernel: dlm: closing connection to node 2
May  1 11:17:02 hayes-02 ccsd[4095]: Unable to connect to cluster infrastructure
after 30 seconds.
May  1 11:17:33 hayes-02 ccsd[4095]: Unable to connect to cluster infrastructure
after 60 seconds.
May  1 11:18:03 hayes-02 ccsd[4095]: Unable to connect to cluster infrastructure
after 90 seconds.
May  1 11:18:33 hayes-02 ccsd[4095]: Unable to connect to cluster infrastructure
after 120 seconds.
May  1 11:19:03 hayes-02 ccsd[4095]: Unable to connect to cluster infrastructure
after 150 seconds.
May  1 11:19:33 hayes-02 ccsd[4095]: Unable to connect to cluster infrastructure
after 180 seconds.
May  1 11:19:57 hayes-02 kernel: dlm: 24: remove fr 0 ID 2
May  1 11:19:57 hayes-02 kernel: dlm: 16: remove fr 0 ID 2
May  1 11:19:57 hayes-02 kernel: dlm: 16: remove fr 0 ID 2



Version-Release number of selected component (if applicable):
2.6.18-90.el5
openais-0.80.3-15.el5
cman-2.0.84-2.el5

Comment 1 Steven Dake 2008-05-01 18:32:22 UTC
corey
was there a core file in /var/lib/openais?

If so, what was its backtrace

thanks

Comment 2 Corey Marthaler 2008-05-01 18:38:44 UTC
(gdb) bt
#0  0x0000003f3e830155 in raise () from /lib64/libc.so.6
#1  0x0000003f3e831bf0 in abort () from /lib64/libc.so.6
#2  0x0000003f3e8295d6 in __assert_fail () from /lib64/libc.so.6
#3  0x00002aaaabcf72a3 in ckpt_checkpoint_close () from
/usr/libexec/lcrso/service_ckpt.lcrso
#4  0x0000000000414885 in totempg_groups_initialize ()
#5  0x0000000000414b98 in totempg_groups_initialize ()
#6  0x000000000040fd28 in totem_callback_token_type ()
#7  0x000000000041194c in totem_callback_token_type ()
#8  0x0000000000409a03 in rrp_deliver_fn ()
#9  0x0000000000407e96 in totemnet_net_mtu_adjust ()
#10 0x0000000000405a12 in poll_run ()
#11 0x0000000000418860 in main ()


Comment 3 Corey Marthaler 2008-05-01 18:48:19 UTC
Created attachment 304333 [details]
core file

Comment 8 errata-xmlrpc 2009-01-20 20:46:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0074.html