213946 – Cluster creation failing - aisexec crashing

Bug 213946 - Cluster creation failing - aisexec crashing

Summary: Cluster creation failing - aisexec crashing

Keywords:
Status:	CLOSED DUPLICATE of bug 210050
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	conga
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jim Parsons
QA Contact:	Corey Marthaler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-11-03 21:43 UTC by Len DiMaggio
Modified:	2009-04-16 22:33 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-11-07 16:04:04 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Audit log - 20061103 (63.35 KB, text/plain) 2006-11-03 21:43 UTC, Len DiMaggio	no flags	Details
View All

Description Len DiMaggio 2006-11-03 21:43:49 UTC

Description of problem:

I first noticed this problem late yesterday and have been unable to determine if
the cause is a misconfiguration - but it is reproducible 100% of the time.

We've had some bug fixes lately - are these the current/correct versions of all
the packages listed below:

Version-Release number of selected component (if applicable):

RHEL5-Server-20061027.0 and
   luci-0.8-23.el5
   ricci-0.8-23.el5
   selinux-policy-devel-2.4.2-8
   selinux-policy-2.4.2-8
   selinux-policy-targeted-2.4.2-8
   openais-0.80.1-11.el5
   cman-2.0.30-3.el5

How reproducible:
100%

Steps to Reproduce:
1. Create a new cluster via the luci web interface.

Before cluster is created via luci:

------------------------------------------
[root@tng3-3 ~]# service cman status
ccsd is stopped
[root@tng3-3 ~]# chkconfig --list cman
cman            0:off   1:off   2:off   3:off   4:off   5:off   6:off
[root@tng3-3 ~]# cat /etc/cluster/cluster.conf
cat: /etc/cluster/cluster.conf: No such file or directory
------------------------------------------

After cluster is created via luci (the cluster.conf file contents are correct):

-----------------------------------------
[root@tng3-3 ~]# service cman status
groupd is stopped
[root@tng3-3 ~]# chkconfig --list cman
cman            0:off   1:off   2:off   3:off   4:off   5:off   6:off
[root@tng3-3 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="node3" config_version="1" name="node3">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="tng3-3.lab.msp.redhat.com" nodeid="1" votes="1"/>
        </clusternodes>
        <cman/>
        <fencedevices/>
        <rm/>
</cluster>
-----------------------------------------

It looks like aisexec is crashing and cannot write a core file:

type=AVC msg=audit(1162589191.313:70): avc:  denied  { add_name } for  pid=2071
comm="aisexec" name="core.2071" scontext=system_u:system_r:ricci_modcluster_t:s0
tcontext=system_u:object_r:sbin_t:s0 tclass=dir
  
Actual results:
The errors listed above.

Expected results:
No errors.

Additional info:
See the attached /var/log/audit/audit.log file

Comment 1 Len DiMaggio 2006-11-03 21:43:49 UTC

Created attachment 140325 [details]
Audit log - 20061103

Comment 2 Stanko Kupcevic 2006-11-06 19:42:31 UTC

Are you able to start cman as root, eg. `service cman start` (with cluster.conf
in place)? 

If so, this is SELinux policy bug, otherwise, this is OpenAIS bug.

Comment 3 Len DiMaggio 2006-11-07 01:46:00 UTC

This looks like an OpenAIS bug - changed the component from Conga to openais.

With SELinux=Enforcing:

------------------------------------------------
[root@tng3-3 ~]# !get
getenforce
Enforcing
[root@tng3-3 ~]# service cman status
ccsd is stopped
[root@tng3-3 ~]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... failed

                                                           [FAILED]
[root@tng3-3 ~]# tail /var/log/debug.log
Nov  6 17:27:29 tng3-3 ccsd[11184]: Starting ccsd 2.0.30:
Nov  6 17:27:29 tng3-3 ccsd[11184]:  Built: Oct 27 2006 15:13:22
Nov  6 17:27:29 tng3-3 ccsd[11184]:  Copyright (C) Red Hat, Inc.  2004  All
rights reserved.
Nov  6 17:27:29 tng3-3 ccsd[11184]: Unable to bind socket: Permission denied
------------------------------------------------

And - with SELinux = Permissive:

------------------------------------------------
Nov  6 18:30:56 tng3-3 kernel: Lock_DLM (built Oct 26 2006 16:00:06) installed
Nov  6 18:30:57 tng3-3 ccsd[2064]: Starting ccsd 2.0.30:
Nov  6 18:30:57 tng3-3 ccsd[2064]:  Built: Oct 27 2006 15:13:22
Nov  6 18:30:57 tng3-3 ccsd[2064]:  Copyright (C) Red Hat, Inc.  2004  All
rights reserved.
Nov  6 18:30:57 tng3-3 ccsd[2064]: cluster.conf (cluster name = Node3, version =
1) found.

[root@tng3-3 queue]# service cman status
groupd is stopped
[root@tng3-3 queue]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
cman not started: CCS does not have a nodeid for this node, run 'ccs_tool
addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start
                                                           [FAILED]

Comment 4 RHEL Program Management 2006-11-07 02:06:27 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 5 Kiersten (Kerri) Anderson 2006-11-07 14:42:37 UTC

Actually, this looks like a conga defect.  It is supposed to be putting nodeid
into the cluster.conf file. If nodeid is missing, none of the cluster code will
start.

Comment 6 Kiersten (Kerri) Anderson 2006-11-07 14:45:01 UTC

Just thought of another possibility, at initial startup ccsd is finding a
different cluster with a rhel4 cluster.conf file and pulling that one instead of
the local one.  Could do this if the version number is less than the one found
on the network.  Either way, the message about nodeid is correct.  Please
include the current cluster.conf file that is on the machine in the bugzilla.

Comment 7 Len DiMaggio 2006-11-07 14:48:31 UTC

Here's the cluster file - what should the value of nodeid be?

===================================
<?xml version="1.0"?>
<cluster alias="node3" config_version="1" name="node3">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="tng3-3.lab.msp.redhat.com" nodeid="1" votes="1"/>
        </clusternodes>
        <cman/>
        <fencedevices/>
        <rm/>
</cluster>
==========================================

Comment 8 Kiersten (Kerri) Anderson 2006-11-07 14:53:42 UTC

Nodeid looks okay, Weird, do you have tng3-3 defined in the /etc/hosts file or
do you just use dhcp. Wonder if there is a difference due to the fully qualified
name in the cluster.conf file.

Comment 9 Kiersten (Kerri) Anderson 2006-11-07 14:55:23 UTC

Just looked, hosts file looks goofed up:
[root@tng3-3 etc]# cat hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost610.15.89.176          tng3-3

Comment 10 Len DiMaggio 2006-11-07 15:02:03 UTC

Just saw that too - thanks! - the machines were reloaded last Thur - I didn't
edit /etc/hosts, but I also did not look at them. I'll correct and retry the
tests. Very strange.

Comment 11 Dean Jansa 2006-11-07 15:33:18 UTC

Re: Comment #9 --  See BZ 210050 -- should be fixed in later trees than the one
installed on the tngs.

Comment 13 Len DiMaggio 2006-11-07 16:04:04 UTC

Correcting the /etc/hosts file - inserted the missing <CR> so that it reads
(correctly) solved the problem. Marking this defect as a dup of 210050


# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6
10.15.89.176            tng3-3



*** This bug has been marked as a duplicate of 210050 ***

Note You need to log in before you can comment on or make changes to this bug.