Bug 157094
Summary: | No indication for result of config file propagation with "Send to Cluster" button | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Paul Kennedy <pkennedy> |
Component: | redhat-config-cluster | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | adstrong, cluster-maint, jha, jparsons |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2005-736 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-10-07 16:45:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 159783 |
Description
Paul Kennedy
2005-05-06 19:25:38 UTC
fixed in 0.9.51-1.0 three problems: 1. I get told that the file was successfully propagated to the cluster and it wasn't because the one of the nodes was dead (in fact the GUI even showed it was dead). 2. This new infomation window has YES/NO click boxes. Is it asking me to check if it got propagated? 3. You spelled propagate wrong. This is all good to go now in 0.9.54-1.0 *** Bug 158416 has been marked as a duplicate of this bug. *** fix verified in -57. I'm seeing this issue again, I'm told that the file was successfully propagated to the cluster and it wasn't because the one of the nodes was dead. And when this node attempts to rejoin, it fails due to the version number mismatch. I think that the GUI should still attempt to copy out to whoever is still up but it also needs to let the user know who it wasn't able to copy to. I may be wrong, but I do not believe that there is a way for the UI to get this information. ccsd would have to be modified to return this information for display. Reassigning to jbrassow for his comments and opinions. the update is successful if everyone in the cluster has been updated. If a node is down, it is not part of the cluster. The behaviour that should be expected when a failed node comes back after an update is: 1) ccsd starts on that node 2) cman_tool join is issued 3) ccs looks to the other systems to ensure it has the latest copy and retrieves as necessary 4) cman_tool completes 5) cluster infrastructure should be ready Looking at comment #6, the expected behaviour does not seem to be happening. Can you reproduce this? I was able to reproduce this. This only happens with the initscripts turned on. Without them on (by hand) everything appears to work as outlined above. I shot morph-01, made some changes, and propagated a few times with the GUI, when coming back up with init, morph-01 ran into problems: Starting ccsd: [ OK ] Starting cman:CMAN 2.6.9-34.5 (built May 23 2005 11:49:47) installed CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected CMAN: Cluster membership rejected [FAILED] Starting lock_gulmd:[WARNING] Starting fenced: CMAN: Cluster membership rejected Starting fence domain:[FAILED] [FAILED] Starting clvmd: clvmd could not connect to cluster manager Consult syslog for more information [FAILED] morph-02: Jun 7 10:52:47 morph-02 kernel: CMAN: removing node morph-01 from the cluster : Missed too many heartbeats CMAN: Join request from morph-01 rejected, config version local 6 remote 3 Jun 7 10:54:52 morph-02 kernel: CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 Jun 7 10:55:27 morph-02 last message repeated 7 times CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 Jun 7 10:56:32 morph-02 last message repeated 13 times CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 CMAN: Join request from morph-01 rejected, config version local 6 remote 3 Jun 7 10:56:52 morph-02 last message repeated 4 times After all this, I then tried everything on morph-01 by hand, it then did work. A key piece of evidence: Jun 14 07:17:54 morph-04 rc: Starting ccsd: succeeded Jun 14 07:17:54 morph-04 kernel: CMAN 2.6.9-36.0 (built May 31 2005 12:15:02) in stalled Jun 14 07:17:54 morph-04 kernel: NET: Registered protocol family 30 Jun 14 07:17:54 morph-04 ccsd[1703]: cluster.conf (cluster name = morph-cluster, version = 1) found. Jun 14 07:17:54 morph-04 ccsd[1703]: Unable to perform sendto: Cannot assign requested address broadcast for doc is bombing out on the error, thus failing to bring in the latest copy. - fix for bug 157094 A mysterious error being generated when trying to do a broadcast (sendto): ccsd[1704]: Unable to perform sendto: Cannot assign requested address On certain clusters (seems to be when ccs tries using IPv6), this error could show up 9 out of 10 times. When the error was received, the broadcast attempt would fail. This caused the attempt to grab any possibly updated cluster.conf files to abort. Waiting a moment, closing the socket, reopening the socket, and retrying the broadcast seems to solve the issue. (It has work 100+ times so far.) I'm not entirely certain what is causing the initial try to fail - perhaps the underlying subsystem is not quite ready... In any case, I have never seen a second attempt fail. Holding off on RHEL4U1 branch commit for ok... all other branches committed. The fix for this will not appear in RHEL4U1. So, here's the workaround. Option #1 (the "if it hurts, don't do it" option): Don't do updates when you have nodes down that you plan on bringing back into the cluster. Option #2: When a machine comes back up, it may not grab the latest copy of the CCS information; therefore, it will not be able to join the cluster and cluster services will fail to start. Rebooting the node will solve the problem. Option #3: When a machine comes back up, it may not grab the latest copy of the CCS information; therefore, it will not be able to join the cluster and cluster services will fail to start. Run 'cman_tool join; modprobe dlm' and then 'service start <fenced|clvmd|gfs|rgmanager>' fix verified in 1.0.15 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-736.html |