Description of problem: I have spent a few hours trying to figure out why certain nodes were not connecting and my cluster could not start. It turns out that I was missing nodes in rhq-storage-auth.conf that I needed. This was through a manual process. The authenticator should output a WARN message if it fails to authenticate, perhaps at most 2 or 3 times, to keep the log from flooding. Version-Release number of selected component (if applicable): 4.9 How reproducible: Always Steps to Reproduce: 1. Remote a host from rhq-storage-auth.conf 2. Attempt to start the cluster Actual results: Cluster can't start Expected results: Not starting, but something explaining why. Additional info:
RHQ updates the rhq-storage-auth.conf file when nodes are added/removed from the cluster. The only time a user directly edit the file is when multiple storage nodes are deployed prior to your RHQ server being installed. With that said, it is entirely possible for the file to be incorrect. We could certainly see about adding some logging, but we would certainly want to keep it light and fast as the authenticator executes at the bottom of the C* stack in the messaging layer. More importantly though, we need a comprehensive solution for when new nodes fail to join the cluster or when existing nodes cannot communicate with the cluster. The cluster status column in the storage node UI already addresses deployment scenarios. If a node's cluster status is DOWN, then it can be assumed that the node is not part of the cluster. It does not however address post-deployment scenarios.