Bug 234410
| Summary: | starting cmirror on ppc64 causes loss of heartbeat | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Nate Straz <nstraz> | ||||
| Component: | cman | Assignee: | Christine Caulfield <ccaulfie> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4 | CC: | agk, cluster-maint, dwysocha, jbrassow, mbroz, prockai, teigland | ||||
| Target Milestone: | --- | Keywords: | TestBlocker | ||||
| Target Release: | --- | ||||||
| Hardware: | ppc64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | 4.5 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2008-05-30 18:19:54 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Nate Straz
2007-03-28 22:27:26 UTC
This may be a problem with CMAN on ppc. If I comment out the cmirror start code, just restarting clvmd causes things to go bad. Thats interesting. Does anything work on CMAN? I notice you said "restarting" there, the implication is that it started once OK, is that the case ? It's probably worth ramping this up slowly to see what does and doesn't work. Leave cman running with no services for a minute or two, then start fencing, then clvmd then cmirror, then GFS (in order of cman complexity). Give each service a minute or two to settle down on both nodes and make sure they are speaking to each other. Created attachment 151204 [details]
/var/log/messages and console output from ppc cluster
I tried starting things up slowly. Right after clvmd started the whole cluster
fell apart.
I tried running clvmd in the foreground and got this on one node. All other nodes didn't output anything, but are still "running." # clvmd -d CLVMD[29540]: Mar 29 10:19:09 CLVMD started CLVMD[29540]: Mar 29 10:19:21 Cluster ready, doing some more initialisation CLVMD[29540]: Mar 29 10:19:21 starting LVM thread CLVMD[142a280]: Mar 29 10:19:21 LVM thread function started File descriptor 4 left open No volume groups found CLVMD[142a280]: Mar 29 10:19:21 LVM thread waiting for work CLVMD[29540]: Mar 29 10:19:36 clvmd ready for work CLVMD[29540]: Mar 29 10:19:36 Using timeout of 60 seconds libgcc_s.so.1 must be installed for pthread_cancel to work Aborted CLVMD[29540]: Mar 29 10:19:36 Using timeout of 60 seconds libgcc_s.so.1 must be installed for pthread_cancel to work Aborted I'm going to ignore that error for the moment...but you might want to revisit it later! In the meantime I'll assign it to cman as there certainly seems to be a problem here. It's a missing byte-swap. The NOACK flag in the ACK packet was not being byte-swapped so the nodes were acking each other to death. With this fix I can get clvmd & fenced up. I tried starting cmirror but nothing seemed to happen (apart from a module being loaded) but there was certainly no cman trouble! I've checked this into the RHEL4 branch. I'll need authorisation to put it into RHEL45 but I doubt that will be hard to get. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.29; previous revision: 1.42.2.28 done It's worth keeping an eye out for other odd behaviours on these machines. it certainly looks like very little bigendian testing has been done on cluster suite until now, so if you see anything slightly odd it might be worth investigating further. I'm unsure about the libgcc error. I did get clvmd up and running quite happily but I was no able to compile any userspace code that used threads because of libgcc and libpthread errors. I didn't pursue these as they might just be missing packages and I'm really not familiar with this architecture & it's gcc foibles. RHEL45 checkin: Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.28.2.1; previous revision: 1.42.2.28 done I had Chris build me scratch packages prior to the RHEL45 checkin and they are working great. I'll move this to verified once I get packages from the normal builds. Closing this out since it missed the errata process. |