234410 – starting cmirror on ppc64 causes loss of heartbeat

Bug 234410 - starting cmirror on ppc64 causes loss of heartbeat

Summary: starting cmirror on ppc64 causes loss of heartbeat

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	ppc64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-03-28 22:27 UTC by Nate Straz
Modified:	2009-04-16 20:01 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.5
Clone Of:
Environment:
Last Closed:	2008-05-30 18:19:54 UTC
Embargoed:

Attachments	(Terms of Use)
/var/log/messages and console output from ppc cluster (5.82 KB, text/plain) 2007-03-29 15:07 UTC, Nate Straz	no flags	Details
View All

Description Nate Straz 2007-03-28 22:27:26 UTC

Description of problem:

While trying to start clustered lvm testing on a cluster of pSeries servers
I found that when `service cmirror start` is run on one of the nodes,
all other nodes will say that it lost too many heartbeats (20 seconds) and
force it out of the cluster.

Version-Release number of selected component (if applicable):
cmirror-1.0.1-1
lvm2-cluster-2.02.21-3.el4
kernel-2.6.9-50.EL


How reproducible:
100%

Steps to Reproduce:
1. start cluster
2. service cmirror start on one node
3. fireworks
  
Actual results:

Node that was fenced:

Mar 28 17:10:31 basic qarshd[12204]: Running cmdline: service cmirror start 
Mar 28 17:10:47 basic kernel: CMAN: Being told to leave the cluster by node 3
Mar 28 17:10:47 basic kernel: dm-cmirror: dm-cmirror 0.2.0 (built Mar 14 2007 17
:07:04) installed
Mar 28 17:10:47 basic kernel: CMAN: we are leaving the cluster. 
Mar 28 17:10:47 basic kernel: WARNING: dlm_emergency_shutdown
Mar 28 17:10:47 basic kernel: WARNING: dlm_emergency_shutdown
Mar 28 17:10:47 basic kernel: SM: 00000002 sm_stop: SG still joined
Mar 28 17:10:47 basic ccsd[9777]: Cluster manager shutdown.  Attemping to reconn
ect... 
Mar 28 17:10:47 basic hald[3053]: Timed out waiting for hotplug event 1297. Reba
sing to 1288
Mar 28 17:10:47 basic cmirror: startup succeeded
Mar 28 17:10:47 basic qarshd[12220]: Talking to peer 10.15.89.98:52146
Mar 28 17:10:47 basic qarshd[12220]: Running cmdline: pidof clvmd 2>&1 
Mar 28 17:10:47 basic qarshd[12223]: Talking to peer 10.15.89.98:52147
Mar 28 17:10:48 basic qarshd[12223]: Running cmdline: clvmd 2>&1 
Mar 28 17:10:48 basic clvmd: Can't open cluster manager socket: Network is down
Mar 28 17:11:13 basic ccsd[9777]: Unable to connect to cluster infrastructure af
ter 30 seconds. 
Mar 28 17:18:35 basic syslogd 1.4.1: restart.

Fencing node:

Mar 28 17:10:48 kent kernel: CMAN: node basic has been removed from the cluster 
: Missed too many heartbeats
Mar 28 17:11:09 kent fenced[11819]: basic not a cluster member after 20 sec post
_fail_delay
Mar 28 17:11:09 kent fenced[11819]: fencing node "basic"
Mar 28 17:11:30 kent fenced[11819]: fence "basic" success

Expected results:
Starting cmirror should not cause a node to be fenced.

Additional info:

Comment 1 Nate Straz 2007-03-28 22:58:05 UTC

This may be a problem with CMAN on ppc.  If I comment out the cmirror start code,
just restarting clvmd causes things to go bad.

Comment 2 Christine Caulfield 2007-03-29 09:07:29 UTC

Thats interesting. Does anything work on CMAN? I notice you said "restarting"
there, the implication is that it started once OK, is that the case ?

It's probably worth ramping this up slowly to see what does and doesn't work.
Leave cman running with no services for a minute or two, then start fencing,
then clvmd then cmirror, then GFS (in order of cman complexity). Give each
service a minute or two to settle down on both nodes and make sure they are
speaking to each other.

Comment 3 Nate Straz 2007-03-29 15:07:01 UTC

Created attachment 151204 [details]
/var/log/messages and console output from ppc cluster

I tried starting things up slowly.  Right after clvmd started the whole cluster

fell apart.

Comment 4 Nate Straz 2007-03-29 15:33:25 UTC

I tried running clvmd in the foreground and got this on one node.  All other
nodes didn't output anything, but are still "running."

# clvmd -d
CLVMD[29540]: Mar 29 10:19:09 CLVMD started
CLVMD[29540]: Mar 29 10:19:21 Cluster ready, doing some more initialisation
CLVMD[29540]: Mar 29 10:19:21 starting LVM thread
CLVMD[142a280]: Mar 29 10:19:21 LVM thread function started
File descriptor 4 left open
  No volume groups found
CLVMD[142a280]: Mar 29 10:19:21 LVM thread waiting for work
CLVMD[29540]: Mar 29 10:19:36 clvmd ready for work
CLVMD[29540]: Mar 29 10:19:36 Using timeout of 60 seconds
libgcc_s.so.1 must be installed for pthread_cancel to work
Aborted

Comment 5 Christine Caulfield 2007-03-29 15:47:57 UTC

CLVMD[29540]: Mar 29 10:19:36 Using timeout of 60 seconds
libgcc_s.so.1 must be installed for pthread_cancel to work
Aborted

I'm going to ignore that error for the moment...but you might want to revisit it
later!

In the meantime I'll assign it to cman as there certainly seems to be a problem
here.

Comment 6 Christine Caulfield 2007-03-30 08:01:04 UTC

It's a missing byte-swap.

The NOACK flag in the ACK packet was not being byte-swapped so the nodes were
acking each other to death.

With this fix I can get clvmd & fenced up. I tried starting cmirror but nothing
seemed to happen (apart from a module being loaded) but there was certainly no
cman trouble!

I've checked this into the RHEL4 branch. I'll need authorisation to put it into
RHEL45 but I doubt that will be hard to get.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.29; previous revision: 1.42.2.28
done



It's worth keeping an eye out for other odd behaviours on these machines. it
certainly looks like very little bigendian testing has been done on cluster
suite until now, so if you see anything slightly odd it might be worth
investigating further.

I'm unsure about the libgcc error. I did get clvmd up and running quite happily
but I was no able to compile any userspace code that used threads because of
libgcc and libpthread errors. I didn't pursue these as they might just be
missing packages and I'm really not familiar with this architecture & it's gcc
foibles.

Comment 7 Christine Caulfield 2007-04-02 08:03:35 UTC

RHEL45 checkin:

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.28.2.1; previous revision: 1.42.2.28
done

Comment 8 Nate Straz 2007-04-03 14:34:08 UTC

I had Chris build me scratch packages prior to the RHEL45 checkin and they are
working great.  I'll move this to verified once I get packages from the normal
builds.

Comment 10 Nate Straz 2008-05-30 18:19:54 UTC

Closing this out since it missed the errata process.

Note You need to log in before you can comment on or make changes to this bug.