Bug 450161

Summary:

node mysteriously leaves cluster right after joining

Product:

Red Hat Enterprise Linux 5

Reporter:

Corey Marthaler <cmarthal>

Component:

openais

Assignee:

Steven Dake <sdake>

Status:

CLOSED DUPLICATE

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.3

CC:

cluster-maint, edamato, sdake

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-20 20:49:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log from hayes-01 (the shot node)	none
log from hayes-02	none
log from hayes-03	none

Description Corey Marthaler 2008-06-05 16:23:01 UTC

Description of problem:
I was running revolver and saw this issue on my 3 node cluster (hayes-0[123])
after hayes-01 was shot and then brought back in.

Here's the revolver output:
================================================================================
Senario iteration 0.1 started at Thu Jun  5 10:38:47 CDT 2008
Sleeping 1 minute(s) to let the I/O get its lock count up...
Senario: DLM kill lowest nodeid (cmirror server)

Those picked to face the revolver... hayes-01
Feeling lucky hayes-01? Well do ya? Go'head make my day...

Verify that hayes-01 has been removed from cluster on remaining nodes
Verifying that the dueler(s) are alive
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
<ignore name="hayes-01_1" pid="2226" time="Thu Jun  5 10:41:59 2008" type="cmd"
duration="1212680519" ec="127" />
<ignore name="hayes-01_0" pid="2224" time="Thu Jun  5 10:41:59 2008" type="cmd"
duration="1212680519" ec="127" />
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
All killed nodes are back up (able to be pinged), making sure they're qarshable...
All killed nodes are now qarshable

Starting cluster on the dueler(s)
Load mods on hayes-01:
scsi_mod...sd_mod...dm-mod...aoe...configfs...dlm...lock_dlm...gfs...lock_nolock...gnbd...OK
Mounting configfs on all nodes
Mounting configfs on hayes-01...pass
Starting ccsd on cluster
Starting ccsd on hayes-01...pass
nodes joining cluster...
cman joining on hayes-01
waiting for all nodes to join cluster...
cman_tool: Cannot open connection to cman, is it running ?
cman_tool: Cannot open connection to cman, is it running ?
cman_tool: Cannot open connection to cman, is it running ?
[...]

At this point is appears the hayes-01 had left the cluster shortly after joining.

[root@hayes-02 sbin]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X    504                        hayes-01
   2   M    488   2008-06-05 10:35:19  hayes-02
   3   M    496   2008-06-05 10:35:20  hayes-03
[root@hayes-02 sbin]# cman_tool services
type             level name     id       state
fence            0     default  00010001 none
[2 3]
dlm              1     clvmd    00020001 none
[2 3]
dlm              1     HAYES0   00040001 none
[2 3]
dlm              1     HAYES1   00060001 none
[2 3]
gfs              2     HAYES0   00030001 none
[2 3]
gfs              2     HAYES1   00050001 none
[2 3]

I'll attach the logs of all three node in the cluster.

Version-Release number of selected component (if applicable):
2.6.18-92.el5
cman-2.0.84-2.el5
openais-0.80.3-15.el5

Comment 1 Corey Marthaler 2008-06-05 16:28:34 UTC

Created attachment 308449 [details]
log from hayes-01 (the shot node)

Comment 2 Corey Marthaler 2008-06-05 16:29:13 UTC

Created attachment 308450 [details]
log from hayes-02

Comment 3 Corey Marthaler 2008-06-05 16:30:46 UTC

Created attachment 308452 [details]
log from hayes-03

Comment 4 RHEL Program Management 2008-06-05 16:34:11 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Christine Caulfield 2008-06-06 07:09:11 UTC

Hmm it just looks like aisexec has crashed. Did it leave a core file ?

Comment 6 Christine Caulfield 2008-06-06 08:06:12 UTC

Yes there is ... no debuginfo packages on there though (I won't install
them...it's not my machine) but this backtrace:

Core was generated by `aisexec'.
Program terminated with signal 6, Aborted.
#0  0x0000003f5a030155 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003f5a030155 in raise () from /lib64/libc.so.6
#1  0x0000003f5a031bf0 in abort () from /lib64/libc.so.6
#2  0x0000003f5a0295d6 in __assert_fail () from /lib64/libc.so.6
#3  0x00002aaaabcf3eef in internal_log_printf2@plt ()
   from /usr/libexec/lcrso/service_ckpt.lcrso
#4  0x0000000000414885 in totempg_groups_initialize ()
#5  0x0000000000414b98 in totempg_groups_initialize ()
#6  0x000000000040fbf8 in totem_callback_token_type ()
#7  0x0000000000410087 in totem_callback_token_type ()
#8  0x0000000000409a2e in rrp_deliver_fn ()
#9  0x0000000000407e96 in totemnet_net_mtu_adjust ()
#10 0x0000000000405a12 in poll_run ()
#11 0x0000000000418860 in main ()

Comment 7 Corey Marthaler 2008-06-06 15:29:55 UTC

I reproduced this and saw that aisexec was in fact not running any longer. I'll
install the debug packages and attempt to gather more info.

Comment 8 Steven Dake 2008-08-26 15:45:21 UTC

This backtrace is invalid because there is no debuginfo package installed on the machine.

If there is a core file you still have, a backtrace would tell us if this is a likely known duplicate issue.

Comment 9 Steven Dake 2008-08-26 16:13:50 UTC

may be a dup of 444909.

Comment 10 Corey Marthaler 2008-09-02 20:59:54 UTC

This hasn't been reproduced in almost 3 months, taking off 5.3 list until reproduced.

Comment 11 Steven Dake 2009-01-20 20:49:55 UTC


*** This bug has been marked as a duplicate of bug 261381 ***