Bug 457107
| Summary: | Killing node X because it has rejoined the cluster with existing state | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||||
| Component: | cman | Assignee: | Christine Caulfield <ccaulfie> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 5.3 | CC: | bstevens, ccaulfie, cluster-maint, edamato, jplans, mkarg, riek, sdake, tao, teigland, zxvdr.au | ||||||||
| Target Milestone: | rc | Keywords: | Regression, TestBlocker | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2009-01-20 21:50:02 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Nate Straz
2008-07-29 17:03:10 UTC
I reproduced this today with a simple recovery attempt. I started cman on all 3 of my hayes machines (hayes-0[123]) and then rebooted hayes-01. This cause hayes-03 to get kicked out of the cluster. [root@hayes-02 ~]# cman_tool nodes Node Sts Inc Joined Name 1 X 668 hayes-01 2 M 664 2008-07-29 13:37:20 hayes-02 3 X 672 hayes-03 [root@hayes-02 ~]# cman_tool services type level name id state fence 0 default 00010001 FAIL_ALL_STOPPED [1 2 3] dlm 1 clvmd 00020002 FAIL_ALL_STOPPED [1 2 3] [root@hayes-03 ~]# cman_tool nodes cman_tool: Cannot open connection to cman, is it running ? [root@hayes-03 ~]# cman_tool services Unable to connect to groupd. Is it running? Jul 29 13:44:17 hayes-02 openais[3135]: [MAIN ] Killing node hayes-03 because it has rejoined the cluster with existing state Jul 29 13:45:29 hayes-03 openais[3132]: [CMAN ] cman killed by node 2 because we rejoined the cluster without a full restart Jul 29 13:45:29 hayes-03 groupd[3144]: cpg_mcast_joined error 2 handle 6b8b456700000000 Jul 29 13:45:29 hayes-03 gfs_controld[3164]: groupd_dispatch error -1 errno 11 Jul 29 13:45:29 hayes-03 fenced[3152]: groupd is down, exiting Jul 29 13:45:29 hayes-03 gfs_controld[3164]: groupd connection died Jul 29 13:45:29 hayes-03 gfs_controld[3164]: cluster is down, exiting Jul 29 13:45:29 hayes-03 clogd[3257]: cpg_dispatch failed: SA_AIS_ERR_LIBRARY Jul 29 13:45:29 hayes-03 last message repeated 2 times Jul 29 13:45:29 hayes-03 clogd[3257]: Bad callback on cluster/10 Jul 29 13:45:29 hayes-03 dlm_controld[3158]: cluster is down, exiting Jul 29 13:45:29 hayes-03 clogd[3257]: cpg_dispatch failed: SA_AIS_ERR_LIBRARY I'll post the full logs from this. I was running openais-0.80.3-17.el5 (Built: Mon 19 May 2008 02:38:39 PM CDT) Created attachment 312922 [details]
log from hayes-02
Created attachment 312923 [details]
log from hayes-03
I'm able to easily repo this on my taft cluster as well, it's running openais-0.80.3-15.el5, but both have the 2.6.18-98.el5 kernel. I was able to reproduce this on -92.1.10.el5 with the following steps: 1. Mount a gfs file system 2. start d_io across the cluster 3. tail -f /var/log/messages on all nodes 4. power off node 2 I was able to reproduce this with 2.6.18-98.el5 + openais-0.80.3-15.el5 on ia64. Following the steps in comment #5 left the cluster in this state: (I rebooted link-16, link-13 and link-14 left running, and link-14 was disallowed) [root@link-13 ~]# cman_tool nodes NOTE: There are 1 disallowed nodes, members list may seem inconsistent across the cluster Node Sts Inc Joined Name 1 M 92 2008-07-29 15:16:42 link-13 2 d 96 2008-07-29 15:16:42 link-14 4 X 96 link-16 [root@link-14 ~]# cman_tool nodes cman_tool: Cannot open connection to cman, is it running ? Changing the summary to reflect the key log message for this bug. Chrissie, can you look into this, might be related to the latest cman changes for 5.3. I started attacking this problem from a clean RHEL 5.2 install going forward: RHEL 5.2 - PASS kernel-2.6.18-92.el5 + openais-0.80.3-15.el5 + cman-2.0.84-2.el5 RHEL 5.2.Z - PASS kernel-2.6.18-92.1.10.el5 + openais-0.80.3-15.el5 + cman-2.0.84-2.el5 RHEL 5.2.Z w/ new cman - FAIL kernel-2.6.18-92.1.10.el5 + openais-0.80.3-15.el5 + cman-2.0.86-1.el5.test.plock.1 Since I've proven that the new cman package introduces the bug I'm adding the Regression keyword. Also, since the new cman is required for testing newer kernels, I'm adding the TestBlocker keyword. The fact that cluster mirrors and aoe are being used here, both of which put a bunch of additional traffic on the network, makes me wonder if excessive network load could be an issue. I also wonder about openais ipc problems, given some of the errors from aisexec, and the fact that both plocks and cluster mirrors have been known to create ipc problems in the past, and both are being used together here. Neither cmirrors nor aoe are required to reproduce this issue. All it takes is having a quorate cman cluster, then rebooting one of the nodes in that cluster. Looking at the syslogs on the two machines makes me pretty convinced it's the patch contributed in bz#443358 that's at fault. This bug isn't in cman-2.0.87-1.el5.test.plock.3. (Build Date: Thu 31 Jul 2008 01:37:30 PM CDT) Yes, it's that patch. It's totally broken and based on at least two
misunderstandings.
If the DLM is starting regardless of the fencing state then that might need to
be checked separately, but it might be that the lockspace is starting up with no
fencing (ie. before it has started at all) - which I believe is correct.
commit 4bb7b3ff842ac40a3705d82e74f266b64f8e7609
Author: Christine Caulfield <ccaulfie>
Date: Fri Aug 1 10:38:00 2008 +0100
cman: Revert dirty patch
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0189.html Created attachment 636951 [details]
cman killed by node2 because we jejoined the cluster without a full restart
cman killed by node2 because we jejoined the cluster without a full restart
|