1071994 – Agent's operation failed with AuthenticationException in HA set up

Bug 1071994 - Agent's operation failed with AuthenticationException in HA set up

Summary: Agent's operation failed with AuthenticationException in HA set up

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	High Availability
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.10
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1074470
TreeView+	depends on / blocked

Reported:	2014-03-03 16:18 UTC by Filip Brychta
Modified:	2014-04-23 12:31 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-04-23 12:31:06 UTC
Embargoed:

Attachments	(Terms of Use)
logs (242.74 KB, application/x-gzip) 2014-03-06 07:48 UTC, Filip Brychta	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1074470	0	unspecified	CLOSED	Agent disappeared from list of registered agents and corresponding imported platform resource disappeared as well	2021-02-22 00:41:40 UTC

Internal Links: 1074470

Description Filip Brychta 2014-03-03 16:18:14 UTC

Description of problem:
There was a following HA set up:
   server1 with agent1
   server2 with agent2
   agent3 connected to server1
   agent4 connected to server2

Operation scheduled on group of all agent failed with "AuthenticationException: Command failed to be authenticated!  This command will be ignored and not processed".
This exception is visible in server2.log and in agent1.log which means that server2 is complaining about agent1 even though the agent1 should be connected to server1.



Version-Release number of selected component (if applicable):
Version :	
4.10.0-SNAPSHOT
Build Number :	
9088ac7

How reproducible:
Only by automation, there are no certain steps to reproduce this manually

Steps to Reproduce:
1. install server1 (db is located here)
2. install server2
3. install agents 3 and 4 (may be not necessary)
4. check that HA works -> go to Administration->Agents and check that agents are connected to the servers equally (2 agents to each server)
5. create resource group containing all agents
6. schedule an operation 'Get Current Date/Time' on the group

Actual results:
Operation fails with "AuthenticationException: Command failed to be authenticated!  This command will be ignored and not processed"

Expected results:
No errors

Additional info:
Failover works, I stopped server1 and then agents 1 and 3 were successfully reconnected to server2. In this state the operation failed with the same result.
Then I started server1 again and used agent's operation 'Switch to server' to move all agents to the server1. The same operation on the group of agents worked correctly. Then I moved all agents to the server2 and launch the same operation again and the operation was successful. Since then every other combination I tried was working correctly.

I noticed following exception in agent1.log which could be connected to this issue:
2014-03-03 03:04:33,287 WARN  [InventoryManager.discovery-1] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [1]. Exception that triggered the failover: [org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s)]
2014-03-03 03:04:33,287 ERROR [InventoryManager.discovery-1] (enterprise.communications.command.client.ClientCommandSenderTask)- {ClientCommandSenderTask.send-failed}Failed to send command [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=auto-rhq.bc.jonqe.lab.eng.bos.redhat.com, rhq.externalizable-strategy=AGENT, rhq.security-token=cCOI30VeTPEh8IATwdGIcmSXfKur29C8r1oK8pThnyFNx/FqyFIbFO06jH8XHr330+w=, rhq.timeout=0, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeInventoryReport], targetInterfaceName=org.rhq.core.clientapi.server.discovery.DiscoveryServerService}]]. Cause: org.jboss.remoting.CannotConnectException:Can not connect http client invoker after 1 attempt(s) -> java.util.ConcurrentModificationException:null. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s)
2014-03-03 03:04:33,289 WARN  [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Exception caught while executing runtime discovery scan rooted at [Resource[id=10003, uuid=aaea5f5a-6dda-4003-b221-5959dcf08a9f, type={Postgres}Postgres Server, key=jdbc:postgresql://127.0.0.1:5432/postgres, name=rhq, parent=auto-rhq.bc.jonqe.lab.eng.bos.redhat.com, version=8.4.13]].
org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s)
	at org.jboss.remoting.transport.http.HTTPClientInvoker.makeInvocation(HTTPClientInvoker.java:271)
	at org.jboss.remoting.transport.http.HTTPClientInvoker.transport(HTTPClientInvoker.java:176)
	at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:169)
	at org.jboss.remoting.Client.invoke(Client.java:2084)
	at org.jboss.remoting.Client.invoke(Client.java:879)
.
.
.
Caused by: java.util.ConcurrentModificationException
	at java.util.HashMap$HashIterator.nextEntry(HashMap.java:894)
	at java.util.HashMap$KeyIterator.next(HashMap.java:928)
	at java.util.HashSet.writeObject(HashSet.java:283)
	at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
.
.
.


The only agent which thrown this exception was the agent1.

Comment 1 Jay Shaughnessy 2014-03-05 22:47:28 UTC

If this happens again please attach the relevant server and agent log. Or, do you still have them?

Comment 2 Filip Brychta 2014-03-06 07:48:13 UTC

(In reply to Jay Shaughnessy from comment #1)
> If this happens again please attach the relevant server and agent log. Or,
> do you still have them?

I do.
Attachment contains complete server1.log, server2.log and agent1.log. Those logs contain complete automation run but steps described in Additional info section (Failover,Switch to server operation) were made manually later and are not visible in the logs.
Use following string to find relevant lines: cCOI30VeTPEh8IATwdGIcmSXfKur29C8r1oK8pThnyFNx

Comment 3 Filip Brychta 2014-03-06 07:48:50 UTC

Created attachment 871242 [details]
logs

Comment 4 Jay Shaughnessy 2014-03-07 22:35:00 UTC

So this basically would indicate the the agent (agent2) had a token in its config (i.e. java prefs) that did not match the token sent from the server (and presumably is stored in the DB for agent2).  The token in the error message should be the token sent from the server.

If this happens again it may be useful to ensure that the token is in fact the expected token stored in the db for agent2.  Also, looking at agent 2's config should indicate the token it expects.

Unless the automation env somehow has server1 and server2 pointing at *different* databases it's not yet clear how the wrong token could be sent.

Still investigating...

Comment 5 John Mazzitelli 2014-03-10 15:21:56 UTC

assigning to jay - he's gonna commit some fix for this. Hopefully, it will also fix the issue that I just set as "blocked" on this one. see bug #1074470

Comment 6 Jay Shaughnessy 2014-03-10 16:13:57 UTC

master commit 50a9e9a33df90dae9699b481886f5b93e25bd12d
Author: Jay Shaughnessy <jshaughn>
Date:   Mon Mar 10 12:11:43 2014 -0400

    [1071994] Agent's operation failed with AuthenticationException in HA set up
    The issue is that each HA node holds an AgentClient cache.  If the agent gets
    deleted (e.g. platform uninventoried) the cache is only cleaned on the
    one HA node handling the delete.  Ensure dirty cache members are detected
    and updated on other HA nodes, as needed.  This is only an issue if a deleted
    platform then gets re-inventoried with the same name/address/port.

Comment 7 Heiko W. Rupp 2014-04-23 12:31:06 UTC

Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.