Description of problem: I first noticed this problem late yesterday and have been unable to determine if the cause is a misconfiguration - but it is reproducible 100% of the time. We've had some bug fixes lately - are these the current/correct versions of all the packages listed below: Version-Release number of selected component (if applicable): RHEL5-Server-20061027.0 and luci-0.8-23.el5 ricci-0.8-23.el5 selinux-policy-devel-2.4.2-8 selinux-policy-2.4.2-8 selinux-policy-targeted-2.4.2-8 openais-0.80.1-11.el5 cman-2.0.30-3.el5 How reproducible: 100% Steps to Reproduce: 1. Create a new cluster via the luci web interface. Before cluster is created via luci: ------------------------------------------ [root@tng3-3 ~]# service cman status ccsd is stopped [root@tng3-3 ~]# chkconfig --list cman cman 0:off 1:off 2:off 3:off 4:off 5:off 6:off [root@tng3-3 ~]# cat /etc/cluster/cluster.conf cat: /etc/cluster/cluster.conf: No such file or directory ------------------------------------------ After cluster is created via luci (the cluster.conf file contents are correct): ----------------------------------------- [root@tng3-3 ~]# service cman status groupd is stopped [root@tng3-3 ~]# chkconfig --list cman cman 0:off 1:off 2:off 3:off 4:off 5:off 6:off [root@tng3-3 ~]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster alias="node3" config_version="1" name="node3"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="tng3-3.lab.msp.redhat.com" nodeid="1" votes="1"/> </clusternodes> <cman/> <fencedevices/> <rm/> </cluster> ----------------------------------------- It looks like aisexec is crashing and cannot write a core file: type=AVC msg=audit(1162589191.313:70): avc: denied { add_name } for pid=2071 comm="aisexec" name="core.2071" scontext=system_u:system_r:ricci_modcluster_t:s0 tcontext=system_u:object_r:sbin_t:s0 tclass=dir Actual results: The errors listed above. Expected results: No errors. Additional info: See the attached /var/log/audit/audit.log file
Created attachment 140325 [details] Audit log - 20061103
Are you able to start cman as root, eg. `service cman start` (with cluster.conf in place)? If so, this is SELinux policy bug, otherwise, this is OpenAIS bug.
This looks like an OpenAIS bug - changed the component from Conga to openais. With SELinux=Enforcing: ------------------------------------------------ [root@tng3-3 ~]# !get getenforce Enforcing [root@tng3-3 ~]# service cman status ccsd is stopped [root@tng3-3 ~]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... failed [FAILED] [root@tng3-3 ~]# tail /var/log/debug.log Nov 6 17:27:29 tng3-3 ccsd[11184]: Starting ccsd 2.0.30: Nov 6 17:27:29 tng3-3 ccsd[11184]: Built: Oct 27 2006 15:13:22 Nov 6 17:27:29 tng3-3 ccsd[11184]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Nov 6 17:27:29 tng3-3 ccsd[11184]: Unable to bind socket: Permission denied ------------------------------------------------ And - with SELinux = Permissive: ------------------------------------------------ Nov 6 18:30:56 tng3-3 kernel: Lock_DLM (built Oct 26 2006 16:00:06) installed Nov 6 18:30:57 tng3-3 ccsd[2064]: Starting ccsd 2.0.30: Nov 6 18:30:57 tng3-3 ccsd[2064]: Built: Oct 27 2006 15:13:22 Nov 6 18:30:57 tng3-3 ccsd[2064]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Nov 6 18:30:57 tng3-3 ccsd[2064]: cluster.conf (cluster name = Node3, version = 1) found. [root@tng3-3 queue]# service cman status groupd is stopped [root@tng3-3 queue]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed cman not started: CCS does not have a nodeid for this node, run 'ccs_tool addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED]
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering. This request is not yet committed for inclusion in release.
Actually, this looks like a conga defect. It is supposed to be putting nodeid into the cluster.conf file. If nodeid is missing, none of the cluster code will start.
Just thought of another possibility, at initial startup ccsd is finding a different cluster with a rhel4 cluster.conf file and pulling that one instead of the local one. Could do this if the version number is less than the one found on the network. Either way, the message about nodeid is correct. Please include the current cluster.conf file that is on the machine in the bugzilla.
Here's the cluster file - what should the value of nodeid be? =================================== <?xml version="1.0"?> <cluster alias="node3" config_version="1" name="node3"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="tng3-3.lab.msp.redhat.com" nodeid="1" votes="1"/> </clusternodes> <cman/> <fencedevices/> <rm/> </cluster> ==========================================
Nodeid looks okay, Weird, do you have tng3-3 defined in the /etc/hosts file or do you just use dhcp. Wonder if there is a difference due to the fully qualified name in the cluster.conf file.
Just looked, hosts file looks goofed up: [root@tng3-3 etc]# cat hosts # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost610.15.89.176 tng3-3
Just saw that too - thanks! - the machines were reloaded last Thur - I didn't edit /etc/hosts, but I also did not look at them. I'll correct and retry the tests. Very strange.
Re: Comment #9 -- See BZ 210050 -- should be fixed in later trees than the one installed on the tngs.
Correcting the /etc/hosts file - inserted the missing <CR> so that it reads (correctly) solved the problem. Marking this defect as a dup of 210050 # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 10.15.89.176 tng3-3 *** This bug has been marked as a duplicate of 210050 ***