Description of problem: While trying to start clustered lvm testing on a cluster of pSeries servers I found that when `service cmirror start` is run on one of the nodes, all other nodes will say that it lost too many heartbeats (20 seconds) and force it out of the cluster. Version-Release number of selected component (if applicable): cmirror-1.0.1-1 lvm2-cluster-2.02.21-3.el4 kernel-2.6.9-50.EL How reproducible: 100% Steps to Reproduce: 1. start cluster 2. service cmirror start on one node 3. fireworks Actual results: Node that was fenced: Mar 28 17:10:31 basic qarshd[12204]: Running cmdline: service cmirror start Mar 28 17:10:47 basic kernel: CMAN: Being told to leave the cluster by node 3 Mar 28 17:10:47 basic kernel: dm-cmirror: dm-cmirror 0.2.0 (built Mar 14 2007 17 :07:04) installed Mar 28 17:10:47 basic kernel: CMAN: we are leaving the cluster. Mar 28 17:10:47 basic kernel: WARNING: dlm_emergency_shutdown Mar 28 17:10:47 basic kernel: WARNING: dlm_emergency_shutdown Mar 28 17:10:47 basic kernel: SM: 00000002 sm_stop: SG still joined Mar 28 17:10:47 basic ccsd[9777]: Cluster manager shutdown. Attemping to reconn ect... Mar 28 17:10:47 basic hald[3053]: Timed out waiting for hotplug event 1297. Reba sing to 1288 Mar 28 17:10:47 basic cmirror: startup succeeded Mar 28 17:10:47 basic qarshd[12220]: Talking to peer 10.15.89.98:52146 Mar 28 17:10:47 basic qarshd[12220]: Running cmdline: pidof clvmd 2>&1 Mar 28 17:10:47 basic qarshd[12223]: Talking to peer 10.15.89.98:52147 Mar 28 17:10:48 basic qarshd[12223]: Running cmdline: clvmd 2>&1 Mar 28 17:10:48 basic clvmd: Can't open cluster manager socket: Network is down Mar 28 17:11:13 basic ccsd[9777]: Unable to connect to cluster infrastructure af ter 30 seconds. Mar 28 17:18:35 basic syslogd 1.4.1: restart. Fencing node: Mar 28 17:10:48 kent kernel: CMAN: node basic has been removed from the cluster : Missed too many heartbeats Mar 28 17:11:09 kent fenced[11819]: basic not a cluster member after 20 sec post _fail_delay Mar 28 17:11:09 kent fenced[11819]: fencing node "basic" Mar 28 17:11:30 kent fenced[11819]: fence "basic" success Expected results: Starting cmirror should not cause a node to be fenced. Additional info:
This may be a problem with CMAN on ppc. If I comment out the cmirror start code, just restarting clvmd causes things to go bad.
Thats interesting. Does anything work on CMAN? I notice you said "restarting" there, the implication is that it started once OK, is that the case ? It's probably worth ramping this up slowly to see what does and doesn't work. Leave cman running with no services for a minute or two, then start fencing, then clvmd then cmirror, then GFS (in order of cman complexity). Give each service a minute or two to settle down on both nodes and make sure they are speaking to each other.
Created attachment 151204 [details] /var/log/messages and console output from ppc cluster I tried starting things up slowly. Right after clvmd started the whole cluster fell apart.
I tried running clvmd in the foreground and got this on one node. All other nodes didn't output anything, but are still "running." # clvmd -d CLVMD[29540]: Mar 29 10:19:09 CLVMD started CLVMD[29540]: Mar 29 10:19:21 Cluster ready, doing some more initialisation CLVMD[29540]: Mar 29 10:19:21 starting LVM thread CLVMD[142a280]: Mar 29 10:19:21 LVM thread function started File descriptor 4 left open No volume groups found CLVMD[142a280]: Mar 29 10:19:21 LVM thread waiting for work CLVMD[29540]: Mar 29 10:19:36 clvmd ready for work CLVMD[29540]: Mar 29 10:19:36 Using timeout of 60 seconds libgcc_s.so.1 must be installed for pthread_cancel to work Aborted
CLVMD[29540]: Mar 29 10:19:36 Using timeout of 60 seconds libgcc_s.so.1 must be installed for pthread_cancel to work Aborted I'm going to ignore that error for the moment...but you might want to revisit it later! In the meantime I'll assign it to cman as there certainly seems to be a problem here.
It's a missing byte-swap. The NOACK flag in the ACK packet was not being byte-swapped so the nodes were acking each other to death. With this fix I can get clvmd & fenced up. I tried starting cmirror but nothing seemed to happen (apart from a module being loaded) but there was certainly no cman trouble! I've checked this into the RHEL4 branch. I'll need authorisation to put it into RHEL45 but I doubt that will be hard to get. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.29; previous revision: 1.42.2.28 done It's worth keeping an eye out for other odd behaviours on these machines. it certainly looks like very little bigendian testing has been done on cluster suite until now, so if you see anything slightly odd it might be worth investigating further. I'm unsure about the libgcc error. I did get clvmd up and running quite happily but I was no able to compile any userspace code that used threads because of libgcc and libpthread errors. I didn't pursue these as they might just be missing packages and I'm really not familiar with this architecture & it's gcc foibles.
RHEL45 checkin: Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.28.2.1; previous revision: 1.42.2.28 done
I had Chris build me scratch packages prior to the RHEL45 checkin and they are working great. I'll move this to verified once I get packages from the normal builds.
Closing this out since it missed the errata process.