Description of problem: There was a following HA set up: server1 with agent1 server2 with agent2 agent3 connected to server1 agent4 connected to server2 Operation scheduled on group of all agent failed with "AuthenticationException: Command failed to be authenticated! This command will be ignored and not processed". This exception is visible in server2.log and in agent1.log which means that server2 is complaining about agent1 even though the agent1 should be connected to server1. Version-Release number of selected component (if applicable): Version : 4.10.0-SNAPSHOT Build Number : 9088ac7 How reproducible: Only by automation, there are no certain steps to reproduce this manually Steps to Reproduce: 1. install server1 (db is located here) 2. install server2 3. install agents 3 and 4 (may be not necessary) 4. check that HA works -> go to Administration->Agents and check that agents are connected to the servers equally (2 agents to each server) 5. create resource group containing all agents 6. schedule an operation 'Get Current Date/Time' on the group Actual results: Operation fails with "AuthenticationException: Command failed to be authenticated! This command will be ignored and not processed" Expected results: No errors Additional info: Failover works, I stopped server1 and then agents 1 and 3 were successfully reconnected to server2. In this state the operation failed with the same result. Then I started server1 again and used agent's operation 'Switch to server' to move all agents to the server1. The same operation on the group of agents worked correctly. Then I moved all agents to the server2 and launch the same operation again and the operation was successful. Since then every other combination I tried was working correctly. I noticed following exception in agent1.log which could be connected to this issue: 2014-03-03 03:04:33,287 WARN [InventoryManager.discovery-1] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [1]. Exception that triggered the failover: [org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s)] 2014-03-03 03:04:33,287 ERROR [InventoryManager.discovery-1] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=auto-rhq.bc.jonqe.lab.eng.bos.redhat.com, rhq.externalizable-strategy=AGENT, rhq.security-token=cCOI30VeTPEh8IATwdGIcmSXfKur29C8r1oK8pThnyFNx/FqyFIbFO06jH8XHr330+w=, rhq.timeout=0, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeInventoryReport], targetInterfaceName=org.rhq.core.clientapi.server.discovery.DiscoveryServerService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not connect http client invoker after 1 attempt(s) -> java.util.ConcurrentModificationException:null. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s) 2014-03-03 03:04:33,289 WARN [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Exception caught while executing runtime discovery scan rooted at [Resource[id=10003, uuid=aaea5f5a-6dda-4003-b221-5959dcf08a9f, type={Postgres}Postgres Server, key=jdbc:postgresql://127.0.0.1:5432/postgres, name=rhq, parent=auto-rhq.bc.jonqe.lab.eng.bos.redhat.com, version=8.4.13]]. org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s) at org.jboss.remoting.transport.http.HTTPClientInvoker.makeInvocation(HTTPClientInvoker.java:271) at org.jboss.remoting.transport.http.HTTPClientInvoker.transport(HTTPClientInvoker.java:176) at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:169) at org.jboss.remoting.Client.invoke(Client.java:2084) at org.jboss.remoting.Client.invoke(Client.java:879) . . . Caused by: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:894) at java.util.HashMap$KeyIterator.next(HashMap.java:928) at java.util.HashSet.writeObject(HashSet.java:283) at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source) . . . The only agent which thrown this exception was the agent1.
If this happens again please attach the relevant server and agent log. Or, do you still have them?
(In reply to Jay Shaughnessy from comment #1) > If this happens again please attach the relevant server and agent log. Or, > do you still have them? I do. Attachment contains complete server1.log, server2.log and agent1.log. Those logs contain complete automation run but steps described in Additional info section (Failover,Switch to server operation) were made manually later and are not visible in the logs. Use following string to find relevant lines: cCOI30VeTPEh8IATwdGIcmSXfKur29C8r1oK8pThnyFNx
Created attachment 871242 [details] logs
So this basically would indicate the the agent (agent2) had a token in its config (i.e. java prefs) that did not match the token sent from the server (and presumably is stored in the DB for agent2). The token in the error message should be the token sent from the server. If this happens again it may be useful to ensure that the token is in fact the expected token stored in the db for agent2. Also, looking at agent 2's config should indicate the token it expects. Unless the automation env somehow has server1 and server2 pointing at *different* databases it's not yet clear how the wrong token could be sent. Still investigating...
assigning to jay - he's gonna commit some fix for this. Hopefully, it will also fix the issue that I just set as "blocked" on this one. see bug #1074470
master commit 50a9e9a33df90dae9699b481886f5b93e25bd12d Author: Jay Shaughnessy <jshaughn> Date: Mon Mar 10 12:11:43 2014 -0400 [1071994] Agent's operation failed with AuthenticationException in HA set up The issue is that each HA node holds an AgentClient cache. If the agent gets deleted (e.g. platform uninventoried) the cache is only cleaned on the one HA node handling the delete. Ensure dirty cache members are detected and updated on other HA nodes, as needed. This is only an issue if a deleted platform then gets re-inventoried with the same name/address/port.
Bulk closing of 4.10 issues. If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.