Bug 995424 - Agent is not able to communicate with server
Agent is not able to communicate with server
Status: ON_DEV
Product: RHQ Project
Classification: Other
Component: Agent, Communications Subsystem, Core Server (Show other bugs)
4.8
Unspecified Unspecified
medium Severity urgent (vote)
: ---
: ---
Assigned To: Simeon Pinder
Mike Foley
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-09 06:23 EDT by Filip Brychta
Modified: 2018-01-05 11:49 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
agent log (212.25 KB, text/x-log)
2013-08-09 06:23 EDT, Filip Brychta
no flags Details
server log (714.11 KB, text/x-log)
2013-08-09 06:24 EDT, Filip Brychta
no flags Details
retestedLogs.tar.gz (440.96 KB, application/x-gzip)
2013-08-12 08:27 EDT, Filip Brychta
no flags Details

  None (edit)
Description Filip Brychta 2013-08-09 06:23:20 EDT
Created attachment 784789 [details]
agent log

Version-Release number of selected component (if applicable):
Version: 4.9.0-SNAPSHOT
Build Number: 9632b96

How reproducible:
3/3

Steps to Reproduce:
1. install rhq: ./rhqctl install


Actual results:
Local agent is registered but not available. I can see following error in agent log:
2013-08-09 05:57:34,948 ERROR [ClientCommandSenderTask Timer Thread #1] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
2013-08-09 05:57:34,948 WARN  [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Could not transmit availability report to server
org.rhq.enterprise.communications.util.NotProcessedException
        at org.rhq.enterprise.communications.GlobalSuspendCommandListener.receivedCommand(GlobalSuspendCommandListener.java:55)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.notifyListenersOfReceivedCommand(CommandProcessor.java:402)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:242)
        at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184)
        at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:967)
        at org.jboss.remoting.transport.servlet.ServletServerInvoker.processRequest(ServletServerInvoker.java:416)
        at sun.reflect.GeneratedMethodAccessor141.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)



Additional info:
complete agent.log attached
Comment 1 Filip Brychta 2013-08-09 06:24:31 EDT
Created attachment 784790 [details]
server log
Comment 2 Filip Brychta 2013-08-09 09:52:50 EDT
After more testing the issue seems to be nondeterministic -> unable to provide reliable repro steps -> lowering severity
Comment 3 John Mazzitelli 2013-08-09 10:02:23 EDT
this is because the server is in maintenance mode. probably due to the new code stefan added - server goes in MM if storage node is not accessible.
Comment 4 John Mazzitelli 2013-08-09 11:18:07 EDT
stefan is making changes in the code around here (has to do with the server startup/initialization code and when we flip the server to Maintenance Mode if it needs to be.

I think he is going to fix this, so I'm assigning this to him.
Comment 5 Stefan Negrea 2013-08-09 17:41:16 EDT
Please retest, the startup sequence was updated to avoid going into maintenance mode by initializing the storage cluster connection first.

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=d1aa8e0eaf0cc2ac72b09599ddc406f9928cbdd9
Comment 6 Filip Brychta 2013-08-12 08:26:18 EDT
Retested on 
Version: 4.9.0-SNAPSHOT
Build Number: 34e1c7b

Scenario:
1 - i have a clean machine (rm -rf .java/.userPrefs/rhq-*, rm -rf /var/lib/rhq/*, no rhq agent or server, rhq database is clean)
2- ./rhqctl install

Result:
2 out of 3 tries had following result:
- agent was registered
- platform resource was NOT available -> Storage Node resource not available as well
- server was in maintenance mode (Administration->Topology->Servers-> my server)
- exceptions in the agent log 
2013-08-12 06:32:20,653 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
- after some time (don't know exact time but it was more then 10 minutes) the server went to normal mode (visible in GUI administration) but platform resource was still NOT available and the agent was still throwing exceptions


See attached retestedLogs.tar.gz with full agent and server logs (server with debug level).
Comment 7 Filip Brychta 2013-08-12 08:27:10 EDT
Created attachment 785668 [details]
retestedLogs.tar.gz
Comment 8 Filip Brychta 2013-08-12 10:22:11 EDT
Would it be possible to add some informative agent logging about server being in  maintenance mode? Something like 'Unable to connect to the server - server may be in maintenance mode'

Note You need to log in before you can comment on or make changes to this bug.