Description of problem: If, in creating a new cluster, a node's password is entered incorectly, that node is still rebooted and the cluster creation is attempted - but it fails. Version-Release number of selected component (if applicable): luci-0.8-21.el5 and ricci-0.8-21.el5 How reproducible: 100% Steps to Reproduce: 1. Create a new cluster 2. Mis-type the password on one of the nodes 3. When the user tries to access the cluster or view the cluser list, lucci displays this error: An error occurred when trying to contact any of the nodes in the <name> cluster. Actual results: lucci displays this error: An error occurred when trying to contact any of the nodes in the <name> cluster. Expected results: The bad password should be caught before the cluster creation/node rebooting is performed. Additional info: This is the /var/lib/ricci/queue entry that is created on the node with the bad password: ----------------------------------------------- <?xml version="1.0"?> <batch batch_id="2022661279" status="4"> <module name="rpm" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="install"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="reboot" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="reboot_now"> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> <module name="cluster" status="4"> <response API_version="1.0" sequence=""> <function_response function_name="set_cluster.conf"> <var mutable="false" name="success" type="boolean" value="false"/> <var mutable="false" name="error_code" type="int" value="-1"/> <var mutable="false" name="error_description" type="string" value="failed to create /etc/cluster/"/> </function_response> </response> </module> <module name="cluster" status="5"> <request API_version="1.0"> <function_call name="start_node"> <var mutable="false" name="cluster_startup" type="boolean" value="true"/> </function_call> </request> </module> </batch> ----------------------------------------------- Here's what shows up in the debug log on the luci server - The cluster name is "badpassword" ----------------------------------------------- Oct 27 08:27:47 tng3-1 luci[26626]: received from tng3-2.lab.msp.redhat.com XML "<?xml version="1.0" ?><ricci authenticated="true" success="0" version="1.0"> <batch batch_id="820899843" status="0"> <module name="cluster" status="0"> <response API_version="1.0" sequence=""> <function_response function_name="status"> <var mutable="false" name="status" type="xml"> <cluster alias="nodes_2_3" cluster_version="5" minQuorum="1" name="nodes_2_3" quorate="true" votes="2"> <node clustered="true" name="tng3-2.lab.msp.redhat.com" online="true" uptime="3746" votes="1"/> <node clustered="true" name="tng3-3.lab.msp.redhat.com" online="true" uptime="3746" votes="1"/> </cluster> </var> <var mutable="false" name="success" type="boolean" value="true"/> </function_response> </response> </module> </batch> </ricci>" Oct 27 08:27:48 tng3-1 luci[26626]: Connected to tng3-4.lab.msp.redhat.com:11111 Oct 27 08:27:48 tng3-1 luci[26626]: Received XML "<?xml version="1.0"?> <ricci authenticated="true" hostname="tng3-4.lab.msp.redhat.com" os="Red Hat Enterprise Linux Server release 4.91 (Tikanga)" version="1.0" xen_host="false"/> " from host tng3-4.lab.msp.redhat.com Oct 27 08:27:48 tng3-1 luci[26626]: Received header from tng3-4.lab.msp.redhat.com: "<?xml version="1.0" ?><ricci authenticated="true" hostname="tng3-4.lab.msp.redhat.com" os="Red Hat Enterprise Linux Server release 4.91 (Tikanga)" version="1.0" xen_host="false"/>" Oct 27 08:27:48 tng3-1 luci[26626]: [auth 1] reported cluster_info = (,) for tng3-4.lab.msp.redhat.com Oct 27 08:27:48 tng3-1 luci[26626]: tng3-4.lab.msp.redhat.com reports it's in cluster :; we expect badpassword Oct 27 08:27:48 tng3-1 luci[26626]: no ricci agent could be found for cluster badpassword -----------------------------------------------
what i believe happened here is the cluster node has an associated storage system. when you add a cluster, it creates storage system nodes as a convenience so that storage can be probed in the storage area. when you remove a cluster, luci will only unauthenticate to the ricci agent on the host if neither it is a member of a managed cluster and it has a storage system. if luci does not unauthenticate, the password provided in the dialog is ignored, as authentication is not needed. if there is neither a cluster node nor a storage system for the host where you saw this, could you please change the state of the bug back to new?
The nodes in question had been defined as storage nodes - via the homebase 'add a system' selection first. I was adding them as cluster nodes after that and mistyped a password.
ok, this is the expected behavior then. passwords are checked only if the nodes are not already authenticated.
I don't think we can close out this bug - the net result for the user is that a bad/mistyped password results in a node being rebooted and and unusable cluster. If a bad password is entered, the node in question should not be rebooted.
i think you're seeing two different issues here: the one relating to authentication and a second one that's causing a cluster node to misbehave when during cluster deployment. i think the correlation is incidental. currently, we can't check whether a password is correct without unauthenticating first, and we've decided we don't want to do that if the node has been entered as a storage system. from this excerpt of the queue file: <module name="cluster" status="4"> <response API_version="1.0" sequence=""> <function_response function_name="set_cluster.conf"> <var mutable="false" name="success" type="boolean" value="false"/> <var mutable="false" name="error_code" type="int" value="-1"/> <var mutable="false" name="error_description" type="string" value="failed to create /etc/cluster/"/> </function_response> </response> </module> it looks like what's causing the cluster creation to fail is the problem described in bz# 212582 (SELinux enforcing prevents creation of /etc/cluster/cluster.conf). are you able to reproduce this with the latest SELinux policy and the latest conga build?
Good point! I think that you're correct on the 2nd point - the failure to create the cluster - I'll retest that with the new (22.el5) today. On the other point, maybe it's a GUI/usability issue. If a user first defines a node as a storage system, and that node is successfully authenticated, then maybe we need a different approach for having users select those storage nodes for inlcusion in a cluster. Right now, the user enters the node name and password, and the password is not checked. I'm thinking that discarding any user input may cause confusion. Maybe we should list the authenticated nodes in a pull-down list and only require the user to enter the names of nodes that are not already authenticated?
Changed the summary to reflect the actual problem - "luci and ricci - if node is already authenticated, user-supplied node password is ignored" I haven't been able to find any real functional impact to the user. - If the node has already been authenticated - by being added to the system list before it is added to a cluster - then whatever the user enters as a password is ignored and the node is added to the cluster. - If the node is authenticated when it is added to a cluster, then the password is verified before it is rebooted as part of the cluster creation process. So - I guess that it's not an issue for beta2 - but we should look at it for GA - whenever we require a user to fill in a field, we should do something with the data.
In order to reliably verify the user is authenticated to hosts, we'd have to make one extra connection to each host to be added. If you're attempting to add a large cluster, this overhead may be unacceptable. I have code in -HEAD that should print a warning message informing the user that because they're already authenticated, the password the supplied was ignored, if this case arises. Do you think that's a sufficient fix for the problem?
With: ricci-0.8-30.el5 luci-0.8-30.el5 No error is displayed to the user, but the cluster creation is successful. Marking the bz as verified.