Description of problem: From time to time I see the following errors during the cman start up on my taft cluster: Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive Service: started and ready to provide service. Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] local node name "taft-01" not found in cluster.conf The clustering is all done on a private bonded network, so there's no reason to be looking for the local node name in the cluster.conf when it should be looking for "taft-01-bond". This typically works like it should, it's rare that I see this issue. [root@taft-01 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 29024 2009-03-04 14:27:42 taft-01-bond 2 M 29028 2009-03-04 14:27:42 taft-02-bond 3 M 29028 2009-03-04 14:27:42 taft-03-bond 4 M 29028 2009-03-04 14:27:42 taft-04-bond Here's the cluster.conf: <?xml version="1.0"?> <cluster config_version="4" name="TAFT"> <dlm log_debug="1"/> <cman></cman> <fence_daemon clean_start="0" post_fail_delay="30" post_join_delay="30"/> <clusternodes> <clusternode name="taft-01-bond" nodeid="1"> <fence> <method name="1"> <device name="qawti-01" option="off" port="1"/> <device name="qawti-01" option="off" port="5"/> <device name="qawti-01" option="on" port="1"/> <device name="qawti-01" option="on" port="5"/> </method> </fence> </clusternode> <clusternode name="taft-02-bond" nodeid="2"> <fence> <method name="1"> <device name="qawti-01" option="off" port="2"/> <device name="qawti-01" option="off" port="6"/> <device name="qawti-01" option="on" port="2"/> <device name="qawti-01" option="on" port="6"/> </method> </fence> </clusternode> <clusternode name="taft-03-bond" nodeid="3"> <fence> <method name="1"> <device name="qawti-01" option="off" port="3"/> <device name="qawti-01" option="off" port="7"/> <device name="qawti-01" option="on" port="3"/> <device name="qawti-01" option="on" port="7"/> </method> </fence> </clusternode> <clusternode name="taft-04-bond" nodeid="4"> <fence> <method name="1"> <device name="qawti-01" option="off" port="4"/> <device name="qawti-01" option="off" port="8"/> <device name="qawti-01" option="on" port="4"/> <device name="qawti-01" option="on" port="8"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_wti" ipaddr="10.15.89.2" name="qawti-01" passwd=" "/> </fencedevices> </cluster> Here's the rest of the messages: Mar 4 14:09:38 taft-01 kernel: Lock_DLM (built Dec 17 2008 11:44:32) installed Mar 4 14:09:38 taft-01 ccsd[6558]: Starting ccsd 2.0.98: Mar 4 14:09:38 taft-01 ccsd[6558]: Built: Dec 3 2008 16:32:30 Mar 4 14:09:38 taft-01 ccsd[6558]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Mar 4 14:09:38 taft-01 ccsd[6558]: cluster.conf (cluster name = TAFT, version = 4) found. Mar 4 14:09:38 taft-01 ccsd[6558]: Unable to perform sendto: Cannot assign requested address Mar 4 14:10:06 taft-01 last message repeated 16 times Mar 4 14:10:07 taft-01 ccsd[6558]: Unable to connect to cluster infrastructure after 30 seconds. Mar 4 14:10:08 taft-01 ccsd[6558]: Unable to perform sendto: Cannot assign requested address Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive Service: started and ready to provide service. Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] local node name "taft-01" not found in cluster.conf Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] Error reading CCS info, cannot start Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] Error reading config from CCS Mar 4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file). Mar 4 14:10:13 taft-01 kernel: device-mapper: dm-log-clustered: (built Jan 26 2009 14:26:21) installed Mar 4 14:10:13 taft-01 qdiskd[6609]: <info> Waiting for CMAN to start Mar 4 14:10:13 taft-01 clvmd: Can't open cluster manager socket: No such file or directory Version: openais-0.80.3-22.el5
cman plugin issue, reassinging.
actually might be an issue with cman's behavior when ccs isn't present.
The message 'local node name "taft-01" not found in cluster.conf' is slightly misleading. The node name printed will always be the one that cman got from uname - it's not necessarily the name that it was actually looking for at the time. In fact it's quite hard to establish what node cman was looking for at the time because it goes through quite a list of things before giving up. This is the list taken from chapter 2 of my CSNetworking.pdf document: 1. It looks up $HOSTNAME in cluster.conf 2. If this fails it strips the domain name from $HOSTNAME and looks up that in cluster.conf 3. If this fails it looks in cluster.conf for a fully-qualified name whose short version matches the short version of $HOSTNAME 4. If all this fails then it will search the interfaces list for an (ipv4 only) address that matches a name in cluster.conf So you're expecting cman to find the <name>-bond address in step 4 above but that one is failing sometimes. As you might be able to see, printing a useful nodename in this situation is not possible because its looping through both an interface list and a node list to find a correlation. There are two possible reasons I can see that might cause an intermittent failure here. One is that the bonding interface isn't set up in time for it to show up in the list of active interfaces, or perhaps the name resolution for the IP address has failed for some reason. It's also possible that the IP address resolves to several names and the -bond one isn't the first in the list ... cman (rather stupidly) only checks the first in that case. How likely do any of those sound ?
Created attachment 334135 [details] Patch to fix a potential cause I've found a possible cause for this. If getnameinfo() fails then cman gives up the search in 4 totally, when it should really move on to the next interface. I've committed a fix for this to STABLE3/3 & master. This patch is the RHEL5 equivalent of those fixes.
Bumping the priority of this bug. I hit tend to hit this issue quite a bit. Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] AIS Executive Service: started and ready to provide service. Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] local node name "taft-02" not found in cluster.conf Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Error reading CCS info, cannot start Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Error reading config from CCS Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration cman-2.0.99-1.el5
FWIW, I added a 25 sec sleep to the start of the cman init script (just in case the bonded net interface wasn't setting it's self up in time) and I still hit this bug.
Thus bug doesn't need a priority boost, it needs testing.
This raises a serious problem with the system here, I think. You have a bug, I have a patch I'd like you to test. But we can't seem to get together. I can't guarantee the patch solves the problem because I can't reproduce your error, but I really need to know whether it does or not before doing more work on it.
Just a note... I usually see this bug after 2-5 iterations during the "Kill Quorum + 1" scenario of revolver. Yesterday I commented out that test case, and it ran for 32 iterations (almost 15 hours) before eventually hitting this issue while killing less than quorum each iteration.
I suspect that's just a co-incidence. All the name checking code is way before anything to do with quorum has happened. There might be an odd side-effect such as more nodes booting up and generating network traffic but I don't think it's conclusive.
Just a note that I'm currently testing the above rpms. I'm running into issues however hitting 496130.
Appear to have hit this bug with the above rpms containing the patch : Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] AIS Executive Service: started and ready to provide service. Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] local node name "taft-04" not found in cluster.conf Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Error reading CCS info, cannot start Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Error reading config from CCS Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file). [root@taft-04 ~]# rpm -q cman cman-2.0.99-1cc.el5
That's very strange. I wonder if the bonded interface is delayed in starting up? I think the best thing I can do now is to add some debugging to the cman startup to display the interfaces it sees, and the names it resolves for them.
Here is the output for a normal run: Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'localhost.localdomain' on lo Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'localhost' on lo Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name '127.0.0.1' on lo Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'taft-01' on eth0 Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name '10.15.89.67' on eth0 Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'taft-01-bond.lab.msp.redhat.com' on bond0 Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'taft-01-bond' on bond0 I hope to grab the output for a failed one in just a bit...
Here is the output for a failed run: Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name 'localhost.localdomain' on lo Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name 'localhost' on lo Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name '127.0.0.1' on lo Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name 'taft-03' on eth0 Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name '10.15.89.69' on eth0 Apr 20 12:15:25 taft-03 openais[6630]: CC: Looking for name '10.15.91.134' on bond0 Apr 20 12:15:25 taft-03 openais[6630]: CC: Looking for name '10' on bond0 Apr 20 12:15:25 taft-03 openais[6630]: CC: Looking for name '10.15.91.134' on bond0 [...] Apr 20 12:15:25 taft-03 openais[6630]: [MAIN ] local node name "taft-03" not found in cluster.conf
Thanks, it does seem to be name resolution that's failing then. That list should show pairs of name/IP address. But taft-03-bond has resolved to 10.15.91.134 first instead. (the '10' afterwards is cman assuming the multiple dots make it a FQDN!) I can put some code into getnameinfo that will force it to look for a name. That /might/ force it to return a name rather than IP but I suspect it will more likely just return an error code and cman will move onto the IP address lookup anyway. RPMS for this are in the same place as before. But if the name is not resolving, I'm not hopeful it will fix things ...
Revolver ran much longer (3+ hours) with the latest rpms from yesterday, but did eventually end up hitting the same issue. Apr 21 15:15:59 taft-02 ccsd[6619]: Starting ccsd 2.0.99: Apr 21 15:15:59 taft-02 ccsd[6619]: Built: Apr 21 2009 03:24:54 Apr 21 15:15:59 taft-02 ccsd[6619]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Apr 21 15:15:59 taft-02 ccsd[6619]: cluster.conf (cluster name = TAFT, version = 4) found. Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for name 'localhost.localdomain' on lo Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for short name 'localhost' on lo Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for IP '127.0.0.1' on lo Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for name 'taft-02' on eth0 Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for IP '10.15.89.68' on eth0 Apr 21 15:16:05 taft-02 openais[6629]: CC: Looking for IP '10.15.91.133' on bond0 Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] AIS Executive Service: started and ready to provide service. Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] local node name "taft-02" not found in cluster.conf Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Error reading CCS info, cannot start Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Error reading config from CCS Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).
Thanks, So it seems that patch is worth having but doesn't solve the real issue ... which is name resolution. If you look at the log above, there is no name that can be resolved for the IP address on bond0. If that's the case then cman has no chance of matching it with anything in cluster.conf of course. I'll commit this patch (less the extra syslog calls) to the RHEL5.4 branch but I can't see what else can be done inside cman to alleviate this. Doing a sleep & retry would add a very long delay to startup as it tries to match against other interfaces on the system that are not relevant.
I've committed the patch to git. If getnameinfo is not resolving the name then there's nothing more in cman that can be done I don't think. commit ee8d1b5915282b144ff5b74a0c2736b38c172274 Author: Christine Caulfield <ccaulfie> Date: Wed May 6 10:52:02 2009 +0100 cman: Improve node name search
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: cman iterates the interface list to try and find a match between active network interfaces and names in cluster.conf. If a name lookup fails then cman abandons the whole operation, rather than just that name lookup. Consequence: If a name lookup fails for some reason (DNS error) then cman can fail to start. The error message it prints gives the impression that $HOSTNAME was not found in cluster.conf Fix: The interface iterator code was fixed to use 'continue' rather than 'break' if a name lookup failed. Result: This bug CAN still occur. If the name that cman is looking for is subject to a DNS failure then this message will be printed and cman will not start. fixing this is outside the scope of cman however.
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1341.html