Bug 995424

Summary: Agent is not able to communicate with server
Product: [Other] RHQ Project Reporter: Filip Brychta <fbrychta>
Component: Agent, Communications Subsystem, Core ServerAssignee: Nobody <nobody>
Status: ON_DEV --- QA Contact:
Severity: urgent Docs Contact:
Priority: medium    
Version: 4.8CC: hrupp, mazz
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
agent log
none
server log
none
retestedLogs.tar.gz none

Description Filip Brychta 2013-08-09 10:23:20 UTC
Created attachment 784789 [details]
agent log

Version-Release number of selected component (if applicable):
Version: 4.9.0-SNAPSHOT
Build Number: 9632b96

How reproducible:
3/3

Steps to Reproduce:
1. install rhq: ./rhqctl install


Actual results:
Local agent is registered but not available. I can see following error in agent log:
2013-08-09 05:57:34,948 ERROR [ClientCommandSenderTask Timer Thread #1] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
2013-08-09 05:57:34,948 WARN  [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Could not transmit availability report to server
org.rhq.enterprise.communications.util.NotProcessedException
        at org.rhq.enterprise.communications.GlobalSuspendCommandListener.receivedCommand(GlobalSuspendCommandListener.java:55)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.notifyListenersOfReceivedCommand(CommandProcessor.java:402)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:242)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184)
        at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:967)
        at org.jboss.remoting.transport.servlet.ServletServerInvoker.processRequest(ServletServerInvoker.java:416)
        at sun.reflect.GeneratedMethodAccessor141.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)



Additional info:
complete agent.log attached

Comment 1 Filip Brychta 2013-08-09 10:24:31 UTC
Created attachment 784790 [details]
server log

Comment 2 Filip Brychta 2013-08-09 13:52:50 UTC
After more testing the issue seems to be nondeterministic -> unable to provide reliable repro steps -> lowering severity

Comment 3 John Mazzitelli 2013-08-09 14:02:23 UTC
this is because the server is in maintenance mode. probably due to the new code stefan added - server goes in MM if storage node is not accessible.

Comment 4 John Mazzitelli 2013-08-09 15:18:07 UTC
stefan is making changes in the code around here (has to do with the server startup/initialization code and when we flip the server to Maintenance Mode if it needs to be.

I think he is going to fix this, so I'm assigning this to him.

Comment 5 Stefan Negrea 2013-08-09 21:41:16 UTC
Please retest, the startup sequence was updated to avoid going into maintenance mode by initializing the storage cluster connection first.

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=d1aa8e0eaf0cc2ac72b09599ddc406f9928cbdd9

Comment 6 Filip Brychta 2013-08-12 12:26:18 UTC
Retested on 
Version: 4.9.0-SNAPSHOT
Build Number: 34e1c7b

Scenario:
1 - i have a clean machine (rm -rf .java/.userPrefs/rhq-*, rm -rf /var/lib/rhq/*, no rhq agent or server, rhq database is clean)
2- ./rhqctl install

Result:
2 out of 3 tries had following result:
- agent was registered
- platform resource was NOT available -> Storage Node resource not available as well
- server was in maintenance mode (Administration->Topology->Servers-> my server)
- exceptions in the agent log 
2013-08-12 06:32:20,653 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
- after some time (don't know exact time but it was more then 10 minutes) the server went to normal mode (visible in GUI administration) but platform resource was still NOT available and the agent was still throwing exceptions


See attached retestedLogs.tar.gz with full agent and server logs (server with debug level).

Comment 7 Filip Brychta 2013-08-12 12:27:10 UTC
Created attachment 785668 [details]
retestedLogs.tar.gz

Comment 8 Filip Brychta 2013-08-12 14:22:11 UTC
Would it be possible to add some informative agent logging about server being in  maintenance mode? Something like 'Unable to connect to the server - server may be in maintenance mode'