Bug 1052390

Summary: Agent NullPointerException (NPE) in org.jboss.remoting.Client.invoke method
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: high    
Version: 4.9CC: hrupp, jshaughn
Target Milestone: GA   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-23 12:30:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch for RHQ_4_9_0 none

Description Elias Ross 2014-01-13 18:29:18 UTC
Description of problem:

There appears to be a NPE in the agent communications.

2014-01-13 12:57:59,746 ERROR [ClientCommandSenderTask Timer Thread #89264] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: java.lang.NullPointerException:null. Cause: java.lang
.NullPointerException

Also:

2014-01-13 12:57:59,746 WARN  [InventoryManager.availability-1] (InventoryManager)- Could not transmit availability report to server
java.lang.NullPointerException
        at org.jboss.remoting.Client.invoke(Client.java:2084)
        at org.jboss.remoting.Client.invoke(Client.java:879)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.rawSend(JBossRemotingRemoteCommunicator.java:514)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutCallbacks(JBossRemotingRemoteCommunicator.java:456)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:475)
        at org.rhq.enterprise.agent.AgentMain.sendConnectRequestToServer(AgentMain.java:2112)
        at org.rhq.enterprise.agent.ConnectAgentInitializeCallback.sendingInitialCommand(ConnectAgentInitializeCallback.java:43)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(JBossRemotingRemoteCommunicator.java:579)
        at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:491)
        at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1084)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
        at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

Code:

public class Client implements Externalizable {
...
   private Object invoke(Object param, Map metadata, InvokerLocator callbackServerLocator)
         throws Throwable
   {
      if (isConnected())
      {
         return invoker.invoke(new InvocationRequest(sessionId, subsystem, param,
                                                     metadata, null, callbackServerLocator));

^^^ Seems that invoker is set to null. So it thinks it is connected but 'invoker' is somehow null.

Version-Release number of selected component (if applicable): 4.9


How reproducible: Unclear

Comment 1 Elias Ross 2014-01-18 01:20:10 UTC
The biggest problem is the errors repeat over and over again and the agent never connects. The only way to repeat it is to restart the agent.

Couple of ideas:
1. Client isn't really thread safe (as, member variables aren't volatile, etc.) It could be a thread visibility issue. However, as the error repeats (indefinitely) it appears a bad state condition.
2. Disconnect is happening while the invoker is being used. I'm not sure this would result in repeating errors.

Looking at the code, it's clear some refactoring is in order. I'll post my patch when tested and approved.

Comment 2 Elias Ross 2014-01-28 19:22:51 UTC
Created attachment 856790 [details]
Patch for RHQ_4_9_0

I've tested this fix with about 1500 agents in two different environments.

I haven't seen the above NullPointerException anymore at the very least. Not sure there are potential regressions, but the NPE issue went away.

Comment 3 Jay Shaughnessy 2014-01-29 23:06:22 UTC
reviewing...

Comment 4 Jay Shaughnessy 2014-01-30 17:00:25 UTC
master commit 37263f7ece17f28541702666009fe057a28452c1
Author: Jay Shaughnessy <jshaughn>
Date:   Thu Jan 30 11:11:46 2014 -0500

    BZ 1052390 - Clean up remoting wrapper to avoid race conditions if possible
    
    Some unused or rarely constructors and methods were dropped.
    
    The biggest change is in the client caching. The cache code is guaranteed to
    call disconnect when a client is 'thrown away'. There is still a possibility
    disconnect can happen in the middle of an invoke.
    
    Original Author: Elias Ross <elias_ross>
    Signed-off-by: Jay Shaughnessy <jshaughn>
    Applying this patch as-is, I see no issues with it and it cleans some



QA Test Notes:
This is not directly testable and barring identified runtime regressions can be set to Verified.  It is covered by unit testing.

Comment 5 Elias Ross 2014-02-19 03:24:38 UTC
*** Bug 1024145 has been marked as a duplicate of this bug. ***

Comment 6 Heiko W. Rupp 2014-04-23 12:30:29 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.