Bug 488565 - cman uses local node name for lookup during start up
Summary: cman uses local node name for lookup during start up
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.4
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-03-04 20:36 UTC by Corey Marthaler
Modified: 2018-11-14 13:17 UTC (History)
7 users (show)

Fixed In Version: cman-2.0.100-1.el5
Doc Type: Bug Fix
Doc Text:
Cause: cman iterates the interface list to try and find a match between active network interfaces and names in cluster.conf. If a name lookup fails then cman abandons the whole operation, rather than just that name lookup. Consequence: If a name lookup fails for some reason (DNS error) then cman can fail to start. The error message it prints gives the impression that $HOSTNAME was not found in cluster.conf Fix: The interface iterator code was fixed to use 'continue' rather than 'break' if a name lookup failed. Result: This bug CAN still occur. If the name that cman is looking for is subject to a DNS failure then this message will be printed and cman will not start. fixing this is outside the scope of cman however.
Clone Of:
Environment:
Last Closed: 2009-09-02 11:07:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to fix a potential cause (1.67 KB, patch)
2009-03-05 14:53 UTC, Christine Caulfield
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1341 0 normal SHIPPED_LIVE Low: cman security, bug fix, and enhancement update 2009-09-01 10:43:16 UTC

Description Corey Marthaler 2009-03-04 20:36:04 UTC
Description of problem:
From time to time I see the following errors during the cman start up on my taft cluster:

Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive Service: started and ready to provide service.
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] local node name "taft-01" not found in cluster.conf

The clustering is all done on a private bonded network, so there's no reason to be looking for the local node name in the cluster.conf when it should be looking for "taft-01-bond".

This typically works like it should, it's rare that I see this issue.

[root@taft-01 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  29024   2009-03-04 14:27:42  taft-01-bond
   2   M  29028   2009-03-04 14:27:42  taft-02-bond
   3   M  29028   2009-03-04 14:27:42  taft-03-bond
   4   M  29028   2009-03-04 14:27:42  taft-04-bond

Here's the cluster.conf:
<?xml version="1.0"?>
<cluster config_version="4" name="TAFT">
  <dlm log_debug="1"/>
  <cman></cman>
  <fence_daemon clean_start="0" post_fail_delay="30" post_join_delay="30"/>
  <clusternodes>
    <clusternode name="taft-01-bond" nodeid="1">
      <fence>
        <method name="1">
          <device name="qawti-01" option="off" port="1"/>
          <device name="qawti-01" option="off" port="5"/>
          <device name="qawti-01" option="on" port="1"/>
          <device name="qawti-01" option="on" port="5"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="taft-02-bond" nodeid="2">
      <fence>
        <method name="1">
          <device name="qawti-01" option="off" port="2"/>
          <device name="qawti-01" option="off" port="6"/>
          <device name="qawti-01" option="on" port="2"/>
          <device name="qawti-01" option="on" port="6"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="taft-03-bond" nodeid="3">
      <fence>
        <method name="1">
          <device name="qawti-01" option="off" port="3"/>
          <device name="qawti-01" option="off" port="7"/>
          <device name="qawti-01" option="on" port="3"/>
          <device name="qawti-01" option="on" port="7"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="taft-04-bond" nodeid="4">
      <fence>
        <method name="1">
          <device name="qawti-01" option="off" port="4"/>
          <device name="qawti-01" option="off" port="8"/>
          <device name="qawti-01" option="on" port="4"/>
          <device name="qawti-01" option="on" port="8"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_wti" ipaddr="10.15.89.2" name="qawti-01" passwd=" "/>
  </fencedevices>
</cluster>

Here's the rest of the messages:
Mar  4 14:09:38 taft-01 kernel: Lock_DLM (built Dec 17 2008 11:44:32) installed
Mar  4 14:09:38 taft-01 ccsd[6558]: Starting ccsd 2.0.98:
Mar  4 14:09:38 taft-01 ccsd[6558]:  Built: Dec  3 2008 16:32:30
Mar  4 14:09:38 taft-01 ccsd[6558]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
Mar  4 14:09:38 taft-01 ccsd[6558]: cluster.conf (cluster name = TAFT, version = 4) found.
Mar  4 14:09:38 taft-01 ccsd[6558]: Unable to perform sendto: Cannot assign requested address
Mar  4 14:10:06 taft-01 last message repeated 16 times
Mar  4 14:10:07 taft-01 ccsd[6558]: Unable to connect to cluster infrastructure after 30 seconds.
Mar  4 14:10:08 taft-01 ccsd[6558]: Unable to perform sendto: Cannot assign requested address
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive Service: started and ready to provide service.
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] local node name "taft-01" not found in cluster.conf
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] Error reading CCS info, cannot start
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] Error reading config from CCS
Mar  4 14:10:13 taft-01 openais[6568]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).
Mar  4 14:10:13 taft-01 kernel: device-mapper: dm-log-clustered: (built Jan 26 2009 14:26:21) installed
Mar  4 14:10:13 taft-01 qdiskd[6609]: <info> Waiting for CMAN to start
Mar  4 14:10:13 taft-01 clvmd: Can't open cluster manager socket: No such file or directory


Version:
openais-0.80.3-22.el5

Comment 1 Steven Dake 2009-03-05 02:00:40 UTC
cman plugin issue, reassinging.

Comment 2 Steven Dake 2009-03-05 02:03:00 UTC
actually might be an issue with cman's behavior when ccs isn't present.

Comment 3 Christine Caulfield 2009-03-05 09:09:52 UTC
The message 'local node name "taft-01" not found in cluster.conf' is slightly misleading. The node name printed will always be the one that cman got from uname - it's not necessarily the name that it was actually looking for at the time.

In fact it's quite hard to establish what node cman was looking for at the time because it goes through quite a list of things before giving up. This is the list taken from chapter 2 of my CSNetworking.pdf document:

  1. It looks up $HOSTNAME in cluster.conf
  2. If this fails it strips the domain name from $HOSTNAME and looks up that 
     in cluster.conf
  3. If this fails it looks in cluster.conf for a fully-qualified name whose  
     short version matches the short version of $HOSTNAME
  4. If all this fails then it will search the interfaces list for an (ipv4 
     only) address that matches a name in cluster.conf 

So you're expecting cman to find the <name>-bond address in step 4 above but that one is failing sometimes. As you might be able to see, printing a useful nodename in this situation is not possible because its looping through both an interface list and a node list to find a correlation.

There are two possible reasons I can see that might cause an intermittent failure here. One is that the bonding interface isn't set up in time for it to show up in the list of active interfaces, or perhaps the name resolution for the IP address has failed for some reason. It's also possible that the IP address resolves to several names and the -bond one isn't the first in the list ... cman (rather stupidly) only checks the first in that case.

How likely do any of those sound ?

Comment 4 Christine Caulfield 2009-03-05 14:53:15 UTC
Created attachment 334135 [details]
Patch to fix a potential cause

I've found a possible cause for this. If getnameinfo() fails then cman gives up the search in 4 totally, when it should really move on to the next interface.

I've committed a fix for this to STABLE3/3 & master.

This patch is the RHEL5 equivalent of those fixes.

Comment 5 Corey Marthaler 2009-03-16 19:21:00 UTC
Bumping the priority of this bug. I hit tend to hit this issue quite a bit.

Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] AIS Executive Service: started and ready to provide service.
Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] local node name "taft-02" not found in cluster.conf
Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Error reading CCS info, cannot start
Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] Error reading config from CCS
Mar 16 11:45:04 taft-02 openais[6557]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration 

cman-2.0.99-1.el5

Comment 6 Corey Marthaler 2009-03-16 22:07:09 UTC
FWIW, I added a 25 sec sleep to the start of the cman init script (just in case the bonded net interface wasn't setting it's self up in time) and I still hit this bug.

Comment 7 Christine Caulfield 2009-03-17 08:23:00 UTC
Thus bug doesn't need a priority boost, it needs testing.

Comment 8 Christine Caulfield 2009-03-17 08:32:52 UTC
This raises a serious problem with the system here, I think.

You have a bug, I have a patch I'd like you to test. But we can't seem to get together. 

I can't guarantee the patch solves the problem because I can't reproduce your error, but I really need to know whether it does or not before doing more work on it.

Comment 9 Corey Marthaler 2009-03-17 15:02:59 UTC
Just a note... 

I usually see this bug after 2-5 iterations during the "Kill Quorum + 1" scenario of revolver. Yesterday I commented out that test case, and it ran for 32 iterations (almost 15 hours) before eventually hitting this issue while killing less than quorum each iteration.

Comment 10 Christine Caulfield 2009-03-17 16:57:26 UTC
I suspect that's just a co-incidence. All the name checking code is way before anything to do with quorum has happened. There might be an odd side-effect such as more nodes booting up and generating network traffic but I don't think it's conclusive.

Comment 12 Corey Marthaler 2009-04-17 17:01:08 UTC
Just a note that I'm currently testing the above rpms. I'm running into issues however hitting 496130.

Comment 13 Corey Marthaler 2009-04-17 17:04:04 UTC
Appear to have hit this bug with the above rpms containing the patch :

Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] AIS Executive Service: started and ready to provide service.
Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] local node name "taft-04" not found in cluster.conf
Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Error reading CCS info, cannot start
Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] Error reading config from CCS
Apr 17 12:00:04 taft-04 openais[6632]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).

[root@taft-04 ~]# rpm -q cman
cman-2.0.99-1cc.el5

Comment 14 Christine Caulfield 2009-04-20 10:19:17 UTC
That's very strange. I wonder if the bonded interface is delayed in starting up?

I think the best thing I can do now is to add some debugging to the cman startup to display the interfaces it sees, and the names it resolves for them.

Comment 16 Corey Marthaler 2009-04-20 16:36:32 UTC
Here is the output for a normal run:

Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'localhost.localdomain' on lo
Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'localhost' on lo
Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name '127.0.0.1' on lo
Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'taft-01' on eth0
Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name '10.15.89.67' on eth0
Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'taft-01-bond.lab.msp.redhat.com' on bond0
Apr 20 11:34:54 taft-01 openais[7292]: CC: Looking for name 'taft-01-bond' on bond0

I hope to grab the output for a failed one in just a bit...

Comment 17 Corey Marthaler 2009-04-20 18:30:04 UTC
Here is the output for a failed run:

Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name 'localhost.localdomain' on lo
Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name 'localhost' on lo
Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name '127.0.0.1' on lo 
Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name 'taft-03' on eth0
Apr 20 12:15:20 taft-03 openais[6630]: CC: Looking for name '10.15.89.69' on eth0
Apr 20 12:15:25 taft-03 openais[6630]: CC: Looking for name '10.15.91.134' on bond0
Apr 20 12:15:25 taft-03 openais[6630]: CC: Looking for name '10' on bond0
Apr 20 12:15:25 taft-03 openais[6630]: CC: Looking for name '10.15.91.134' on bond0

[...]

Apr 20 12:15:25 taft-03 openais[6630]: [MAIN ] local node name "taft-03" not found in cluster.conf

Comment 18 Christine Caulfield 2009-04-21 07:40:03 UTC
Thanks, it does seem to be name resolution that's failing then.

That list should show pairs of name/IP address. But taft-03-bond has resolved to 10.15.91.134 first instead. (the '10' afterwards is cman assuming the multiple dots make it a FQDN!)

I can put some code into getnameinfo that will force it to look for a name. That /might/ force it to return a name rather than IP but I suspect it will more likely just return an error code and cman will move onto the IP address lookup anyway. RPMS for this are in the same place as before.

But if the name is not resolving, I'm not hopeful it will fix things ...

Comment 19 Corey Marthaler 2009-04-22 16:27:40 UTC
Revolver ran much longer (3+ hours) with the latest rpms from yesterday, but did eventually end up hitting the same issue.

Apr 21 15:15:59 taft-02 ccsd[6619]: Starting ccsd 2.0.99:
Apr 21 15:15:59 taft-02 ccsd[6619]:  Built: Apr 21 2009 03:24:54
Apr 21 15:15:59 taft-02 ccsd[6619]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
Apr 21 15:15:59 taft-02 ccsd[6619]: cluster.conf (cluster name = TAFT, version = 4) found.
Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for name 'localhost.localdomain' on lo
Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for short name 'localhost' on lo
Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for IP '127.0.0.1' on lo
Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for name 'taft-02' on eth0
Apr 21 15:16:00 taft-02 openais[6629]: CC: Looking for IP '10.15.89.68' on eth0
Apr 21 15:16:05 taft-02 openais[6629]: CC: Looking for IP '10.15.91.133' on bond0
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] AIS Executive Service: started and ready to provide service.
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] local node name "taft-02" not found in cluster.conf
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Error reading CCS info, cannot start
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] Error reading config from CCS
Apr 21 15:16:05 taft-02 openais[6629]: [MAIN ] AIS Executive exiting (reason: could not read the main configuration file).

Comment 20 Christine Caulfield 2009-04-23 08:29:11 UTC
Thanks,

So it seems that patch is worth having but doesn't solve the real issue ... which is name resolution. If you look at the log above, there is no name that can be resolved for the IP address on bond0. If that's the case then cman has no chance of matching it with anything in cluster.conf of course.

I'll commit this patch (less the extra syslog calls) to the RHEL5.4 branch but I can't see what else can be done inside cman to alleviate this. Doing a sleep & retry would add a very long delay to startup as it tries to match against other interfaces on the system that are not relevant.

Comment 21 Christine Caulfield 2009-05-06 09:57:59 UTC
I've committed the patch to git. If getnameinfo is not resolving the name then there's nothing more in cman that can be done I don't think.

commit ee8d1b5915282b144ff5b74a0c2736b38c172274
Author: Christine Caulfield <ccaulfie>
Date:   Wed May 6 10:52:02 2009 +0100

cman: Improve node name search

Comment 24 Christine Caulfield 2009-05-19 07:57:04 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cause:   cman iterates the interface list to try and find a match between active network interfaces and names in cluster.conf. If a name lookup fails then cman abandons the whole operation, rather than just that name lookup.

Consequence:  If a name lookup fails for some reason (DNS error) then cman can fail to start. The error message it prints gives the impression that $HOSTNAME was not found in cluster.conf

Fix:   The interface iterator code was fixed to use 'continue' rather than 'break' if a name lookup failed.

Result:  This bug CAN still occur. If the name that cman is looking for is subject to a DNS failure then this message will be printed and cman will not start. fixing this is outside the scope of cman however.

Comment 26 Chris Ward 2009-07-03 18:26:24 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 28 errata-xmlrpc 2009-09-02 11:07:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1341.html


Note You need to log in before you can comment on or make changes to this bug.