Bug 677349

Summary: Premature locking during plugin container initialization can cause NPE
Product: [Other] RHQ Project Reporter: John Sanda <jsanda>
Component: Plugin ContainerAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: high    
Version: 3.0.0CC: ccrouch, loleary
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 769971 (view as bug list) Environment:
Last Closed: 2013-09-03 12:57:31 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 725852    

Description John Sanda 2011-02-14 08:10:57 EST
Description of problem:
This is a concurrency issue that has manifested itself during PC initialization when the PC starts a plugin component and the plugin component tries to register its event pollers. In the support case where I came across this, the exception was occurring in the apache plugin; however, upon further review I see that the exception could occur with the jboss-as plugin and possibly others. Here is the exception from the customer's agent log:

[main] (rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler)- Call to [org.rhq.plugins.apache.ApacheServerComponent.start()] with args [[org.rhq.core.pluginapi.inventory.ResourceContext@75144026]] failed.
java.util.concurrent.ExecutionException: java.lang.NullPointerException
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
	at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:446)
	at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:434)
	at $Proxy40.start(Unknown Source)
	at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1374)
	at org.rhq.core.pc.inventory.InventoryManager.activateFromDisk(InventoryManager.java:1527)
	at org.rhq.core.pc.inventory.InventoryManager.activateFromDisk(InventoryManager.java:1534)
	at org.rhq.core.pc.inventory.InventoryManager.loadFromDisk(InventoryManager.java:1562)
	at org.rhq.core.pc.inventory.InventoryManager.initialize(InventoryManager.java:205)
	at org.rhq.core.pc.PluginContainer.startContainerService(PluginContainer.java:372)
	at org.rhq.core.pc.PluginContainer.initialize(PluginContainer.java:247)
	at org.rhq.enterprise.agent.AgentMain.startPluginContainer(AgentMain.java:1748)
	at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:643)
	at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:413)
Caused by: java.lang.NullPointerException
	at org.rhq.core.pc.event.EventManager.registerEventPoller(EventManager.java:154)
	at org.rhq.core.pc.event.EventContextImpl.registerEventPollerInternal(EventContextImpl.java:107)
	at org.rhq.core.pc.event.EventContextImpl.registerEventPoller(EventContextImpl.java:76)
	at org.rhq.plugins.apache.ApacheServerComponent.startEventPollers(ApacheServerComponent.java:840)
	at org.rhq.plugins.apache.ApacheServerComponent.start(ApacheServerComponent.java:213)
	at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:525)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)
2011-02-11 10:07:27,239 DEBUG [main] (rhq.core.pc.inventory.InventoryManager)- Failed to activate from disk [Resource[id=12329, type=Apache HTTP Server, key=/hosting/configs/apache22/idcs01/web2, name=0.0.0.0:10025, parent=crxecsluwb01, version=2.2.11]]

Here is a detailed description of what happens. PluginContainer.initialize() is called in thread T1. It starts off by obtaining its write lock. Then move down into InventoryManager.initialize() which is called from PluginContainer.initialize(). InventoryManager.loadFromDisk() is called. Eventually InventoryManager.activateResource() is invoked for the apache server component. Note that the PC's read lock has to be obtained in the following line,

component = PluginContainer.getInstance().getPluginComponentFactory().buildResourceComponent(resource);

the call to getPluginComponentFactory() requires obtaining a read lock. According to the javadocs for ReentrantReadWriteLock, this should be possible even though the PC already is holding its write lock. This is all ok because PluginComponentFactory is initialized before InventoryManager. Next we create the ResourceContext. I will come back to this momentarily. Then we create the resource component proxy and invoke its start() method. The component's start method will run in some thread Tn where n > 1. The exception propagates out of Tn. Go back to line where we create the ResourceContext. In particular, look at the call to getEventContext(). Look at the constructor for EventContextImpl(). On line 46 you should see,

this.eventManager = PluginContainer.getInstance().getEventManager()

which requires a read lock (from the PC), and we obtain that read lock because we are still in thread T1. And here in lies the problem. EventContextImpl has obtained the read lock prematurely, or I suppose another way of looking at it is that the reference to EventManager is obtained prematurely. When the offending event poller thread runs in thread Tn, it already has a cached EventManager reference and will not have to wait to obtain any locks. If PluginContainer.initialize() in T1 has not finished executing, or more precisely, if it has not yet finished initializing EventManager, then we will get the above exception in Tn.

Because this is a concurrency/timing issue, it is not easily reproducable or easily testable; however, the exception has been reported by a customer.

Fortunately I think there is any easy solution to this. EventContextImpl obtains the read lock prematurely. It does not need to cache a reference to EventManager. EventContextImpl should instead call PluginManager.getInstance().getEventManager() in each method where it it is needed. If we put that into the registerEventPoller() method, then Tn will block until T1 releases the write lock, ensuring the PC is able to initialize EventManager before any plugin component registers event pollers.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 John Sanda 2011-02-14 09:37:49 EST
I have pushed a fix for EventContextImpl to master so that it does not prematurely obtain a PluginContainer lock.

commit hash: 3850db08f742740deb9cf6a2c1ef18315a5cfcf3
Comment 2 John Sanda 2011-02-14 09:59:11 EST
I want to come back to the exception in the description and discuss what happens as a result of it getting thrown. The exception is thrown while trying to activate the resource, effectively making the resource dead in the water. In other words, its availability will be reported as DOWN, and discovery scans for that component will not happen. To resolve this the PC has to be restarted.
Comment 4 John Sanda 2011-03-14 22:22:54 EDT
It was brought to my attention earlier today that this issue may not be resolved. The issue initially surfaced with the apache plugin. Now, my changes may cause a TimeoutException with the platform plugin when event logging is enabled. I have reproduced the exception both on Windows and on a Fedora box. It should be noted that the exception is less likely and may not happen at all on a single processor machine. My Windows virtual machine initially had a single processor allocated to it, and I did not hit the TimeoutException. Then I allocated a second processor to it, and I hit the exception. Since there are different execution paths for Windows and for Linux, I will list both exceptions.

Windows exception:
ERROR [main] (rhq.core.pc.inventory.InventoryManager)- Failed to start component for Resource[id=11961, type=Windows, key=192.168.1.110, name=192.168.1.110, parent=<null>, version=Win32 5.2] from synchronized merge.
org.rhq.core.clientapi.agent.PluginContainerException: Failed to start component for resource Resource[id=11961, type=Windows, key=192.168.1.110, name=192.168.1.110, parent=<null>, version=Win32 5.2].
	at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1552)
	at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:2640)
	at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2423)
	at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1022)
	at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:996)
	at org.rhq.core.pc.inventory.InventoryManager.upgradeResources(InventoryManager.java:2665)
	at org.rhq.core.pc.inventory.InventoryManager.initialize(InventoryManager.java:228)
	at org.rhq.core.pc.PluginContainer.startContainerService(PluginContainer.java:399)
	at org.rhq.core.pc.PluginContainer.initialize(PluginContainer.java:262)
	at org.rhq.enterprise.agent.AgentMain.startPluginContainer(AgentMain.java:1867)
	at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:648)
	at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:418)
Caused by: org.rhq.core.pc.inventory.TimeoutException: [Warning] Call to [org.rhq.plugins.platform.WindowsPlatformComponent.start()] with args [[org.rhq.core.pluginapi.inventory.ResourceContext@aad0b]] timed out. Invocation thread will be interrupted
	at org.rhq.core.clientapi.agent.PluginContainerException.wrapIfNecessary(PluginContainerException.java:69)
	at org.rhq.core.clientapi.agent.PluginContainerException.<init>(PluginContainerException.java:96)
	... 12 more


Linux exception:
[ResourceContainer.invoker.daemon-1] (org.rhq.core.pc.PluginContainer)- There may be a deadlock in the plugin container.
java.lang.Throwable: There may be a deadlock in the plugin container.
        at org.rhq.core.pc.PluginContainer.obtainReadLock(PluginContainer.java:417)
        at org.rhq.core.pc.PluginContainer.getEventManager(PluginContainer.java:528)
        at org.rhq.core.pc.event.EventContextImpl.getEventManager(EventContextImpl.java:51)
        at org.rhq.core.pc.event.EventContextImpl.registerEventPollerInternal(EventContextImpl.java:112)
        at org.rhq.core.pc.event.EventContextImpl.registerEventPoller(EventContextImpl.java:81)
        at org.rhq.plugins.platform.SyslogFileEventLogDelegate.<init>(SyslogFileEventLogDelegate.java:56)
        at org.rhq.plugins.platform.PosixPlatformComponent.start(PosixPlatformComponent.java:89)
        at org.rhq.plugins.platform.LinuxPlatformComponent.start(LinuxPlatformComponent.java:80)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:525)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-03-14 21:05:26,758 ERROR [ResourceContainer.invoker.daemon-1] (org.rhq.plugins.platform.SyslogFileEventLogDelegate)- Failed attempt to setup syslog file poller. Cannot process syslog messages.
java.lang.NullPointerException
        at org.rhq.core.pc.event.EventManager.registerEventPoller(EventManager.java:161)
        at org.rhq.core.pc.event.EventContextImpl.registerEventPollerInternal(EventContextImpl.java:112)
        at org.rhq.core.pc.event.EventContextImpl.registerEventPoller(EventContextImpl.java:81)
        at org.rhq.plugins.platform.SyslogFileEventLogDelegate.<init>(SyslogFileEventLogDelegate.java:56)
        at org.rhq.plugins.platform.PosixPlatformComponent.start(PosixPlatformComponent.java:89)
        at org.rhq.plugins.platform.LinuxPlatformComponent.start(LinuxPlatformComponent.java:80)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:525)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-03-14 21:05:26,759 INFO  [ResourceContainer.invoker.daemon-1] (org.rhq.plugins.platform.LinuxPlatformComponent)- Internal yum server is disabled.
2011-03-14 21:05:28,812 ERROR [main] (rhq.core.pc.inventory.InventoryManager)- Failed to start component for Resource[id=10001, type=Linux, key=localhost.localdomain, name=localhost.localdomain, parent=<null>, version=Linux 2.6.34.7-66.fc13.x86_64] from synchronized merge.
org.rhq.core.clientapi.agent.PluginContainerException: Failed to start component for resource Resource[id=10001, type=Linux, key=localhost.localdomain, name=localhost.localdomain, parent=<null>, version=Linux 2.6.34.7-66.fc13.x86_64].
        at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1552)
        at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:2640)
        at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2423)
        at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1022)
        at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:996)
        at org.rhq.core.pc.inventory.InventoryManager.upgradeResources(InventoryManager.java:2665)
        at org.rhq.core.pc.inventory.InventoryManager.initialize(InventoryManager.java:228)
        at org.rhq.core.pc.PluginContainer.startContainerService(PluginContainer.java:399)
        at org.rhq.core.pc.PluginContainer.initialize(PluginContainer.java:262)
        at org.rhq.enterprise.agent.AgentMain.startPluginContainer(AgentMain.java:1867)
        at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:648)
        at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:418)
Caused by: org.rhq.core.pc.inventory.TimeoutException: [Warning] Call to [org.rhq.plugins.platform.LinuxPlatformComponent.start()] with args [[org.rhq.core.pluginapi.inventory.ResourceContext@612438f1]] timed out. Invocation thread will be interrupted
        at org.rhq.core.clientapi.agent.PluginContainerException.wrapIfNecessary(PluginContainerException.java:69)
        at org.rhq.core.clientapi.agent.PluginContainerException.<init>(PluginContainerException.java:96)
        ... 12 more

Here is what is happening. PluginContainer.initialize executes in thread T1. This method initializes all of the XXXManager classes that make up the PC. When InventoryManager initializes, it activates/starts the platform resource. On Windows, this corresponds to calling WindowsPlatformComponent.start, and on Linux it corresponds to calling LinuxPlatformComponent.start. The start() method is called in a separate thread T2. The start() method in T2 tries to register event pollers. 

The TimeoutException is happening when EventContextImpl is waiting on a read lock so it can obtain a reference to EventManager. T1 obtained the write lock at the start of PluginContainer.initialize. This situation previously did not occur prior to my changes because EventContextImpl obtained the read lock in thread T1. The locking mechanism used is re-entrant in that the same thread that already holds the write lock can also obtain a read lock. This is why the code prior to my changes worked.

But now the plugin component is executing in thread T2, and it has to block until the write lock is released before it can obtain the read lock. The plugin component is wrapped in a proxy such that call to its start() method goes through that proxy. It is the proxy which schedules the execution of start() in T2. And the proxy which is running in T1 blocks on the call to start() which executes in T2. The proxy blocks for at most 60 seconds. It eventually times out, and we get our TimeoutException.

In case you did not follow the past couple paragraphs, we have deadlock. The proxy in T1 is waiting for the start() method in T2 to finish. T1 holds the write lock. T2 attempts to obtain a read lock, but it has to block because T1 holds the write lock.

If we revert my previous change in EventContextImpl, we are back to square one which is undoubtedly better since we would not be caught in a dead lock while trying to initialize the platform component; however, the problem will still manifest itself in the apache plugin. It could also manifest itself in any other plugins that support events and follow a similar patter for registering event pollers. This includes the jboss-as and jboss-as-5 plugins.
Comment 5 John Sanda 2011-03-16 16:13:06 EDT
Committed and pushed fix to resolve deadlock issue. This commit adds an  initializationListener api to PluginContainer. When plugin components start up while the PC is still initializing and register event pollers, they now do so by way of this InitializationListener api. EventContextImpl now registers a listener,
which does not require a lock and therefore eliminates the possibility for deadlock. When the listener is invoked, the PC has already released the write lock, EventManager is already up and running, and there should be no problem registering the event poller at this point. Also note the listener runs in the same thread that executes PluginContainer.initialize.

commit hash: a12a99c6806199323f2917fdc9f26603a76c7849
Comment 6 John Sanda 2011-03-29 21:08:44 EDT
Cherry picked fix over to release-3.0.1 branch.

commit hash: 97fde8e669a981f393000b8c7c8e52f78e9eda88
Comment 7 Mike Foley 2011-06-29 15:17:38 EDT
defined verification steps with jsanda, as follows:
-on a multi-core/cpu machine
-turn on event logging in RHQ for the platform and Apache plug-in
-restart the agent
-review agent logs for anything that looks like:

java.lang.NullPointerException 
at org.rhq.core.pc.event.EventManager.registerEventPoller(EventManager.java:154)

-repeat several times due to intermittent nature of this issue.
Comment 9 Heiko W. Rupp 2013-09-03 12:57:31 EDT
Bulk closing of old issues that are in VERIFIED state.