| Summary: | Premature locking during plugin container initialization can cause NPE | |||
|---|---|---|---|---|
| Product: | [Other] RHQ Project | Reporter: | John Sanda <jsanda> | |
| Component: | Plugin Container | Assignee: | John Sanda <jsanda> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 3.0.0 | CC: | ccrouch, loleary | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 769971 (view as bug list) | Environment: | ||
| Last Closed: | 2013-09-03 16:57:31 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 725852 | |||
|
Description
John Sanda
2011-02-14 13:10:57 UTC
I have pushed a fix for EventContextImpl to master so that it does not prematurely obtain a PluginContainer lock. commit hash: 3850db08f742740deb9cf6a2c1ef18315a5cfcf3 I want to come back to the exception in the description and discuss what happens as a result of it getting thrown. The exception is thrown while trying to activate the resource, effectively making the resource dead in the water. In other words, its availability will be reported as DOWN, and discovery scans for that component will not happen. To resolve this the PC has to be restarted. It was brought to my attention earlier today that this issue may not be resolved. The issue initially surfaced with the apache plugin. Now, my changes may cause a TimeoutException with the platform plugin when event logging is enabled. I have reproduced the exception both on Windows and on a Fedora box. It should be noted that the exception is less likely and may not happen at all on a single processor machine. My Windows virtual machine initially had a single processor allocated to it, and I did not hit the TimeoutException. Then I allocated a second processor to it, and I hit the exception. Since there are different execution paths for Windows and for Linux, I will list both exceptions.
Windows exception:
ERROR [main] (rhq.core.pc.inventory.InventoryManager)- Failed to start component for Resource[id=11961, type=Windows, key=192.168.1.110, name=192.168.1.110, parent=<null>, version=Win32 5.2] from synchronized merge.
org.rhq.core.clientapi.agent.PluginContainerException: Failed to start component for resource Resource[id=11961, type=Windows, key=192.168.1.110, name=192.168.1.110, parent=<null>, version=Win32 5.2].
at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1552)
at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:2640)
at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2423)
at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1022)
at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:996)
at org.rhq.core.pc.inventory.InventoryManager.upgradeResources(InventoryManager.java:2665)
at org.rhq.core.pc.inventory.InventoryManager.initialize(InventoryManager.java:228)
at org.rhq.core.pc.PluginContainer.startContainerService(PluginContainer.java:399)
at org.rhq.core.pc.PluginContainer.initialize(PluginContainer.java:262)
at org.rhq.enterprise.agent.AgentMain.startPluginContainer(AgentMain.java:1867)
at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:648)
at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:418)
Caused by: org.rhq.core.pc.inventory.TimeoutException: [Warning] Call to [org.rhq.plugins.platform.WindowsPlatformComponent.start()] with args [[org.rhq.core.pluginapi.inventory.ResourceContext@aad0b]] timed out. Invocation thread will be interrupted
at org.rhq.core.clientapi.agent.PluginContainerException.wrapIfNecessary(PluginContainerException.java:69)
at org.rhq.core.clientapi.agent.PluginContainerException.<init>(PluginContainerException.java:96)
... 12 more
Linux exception:
[ResourceContainer.invoker.daemon-1] (org.rhq.core.pc.PluginContainer)- There may be a deadlock in the plugin container.
java.lang.Throwable: There may be a deadlock in the plugin container.
at org.rhq.core.pc.PluginContainer.obtainReadLock(PluginContainer.java:417)
at org.rhq.core.pc.PluginContainer.getEventManager(PluginContainer.java:528)
at org.rhq.core.pc.event.EventContextImpl.getEventManager(EventContextImpl.java:51)
at org.rhq.core.pc.event.EventContextImpl.registerEventPollerInternal(EventContextImpl.java:112)
at org.rhq.core.pc.event.EventContextImpl.registerEventPoller(EventContextImpl.java:81)
at org.rhq.plugins.platform.SyslogFileEventLogDelegate.<init>(SyslogFileEventLogDelegate.java:56)
at org.rhq.plugins.platform.PosixPlatformComponent.start(PosixPlatformComponent.java:89)
at org.rhq.plugins.platform.LinuxPlatformComponent.start(LinuxPlatformComponent.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:525)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2011-03-14 21:05:26,758 ERROR [ResourceContainer.invoker.daemon-1] (org.rhq.plugins.platform.SyslogFileEventLogDelegate)- Failed attempt to setup syslog file poller. Cannot process syslog messages.
java.lang.NullPointerException
at org.rhq.core.pc.event.EventManager.registerEventPoller(EventManager.java:161)
at org.rhq.core.pc.event.EventContextImpl.registerEventPollerInternal(EventContextImpl.java:112)
at org.rhq.core.pc.event.EventContextImpl.registerEventPoller(EventContextImpl.java:81)
at org.rhq.plugins.platform.SyslogFileEventLogDelegate.<init>(SyslogFileEventLogDelegate.java:56)
at org.rhq.plugins.platform.PosixPlatformComponent.start(PosixPlatformComponent.java:89)
at org.rhq.plugins.platform.LinuxPlatformComponent.start(LinuxPlatformComponent.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocationThread.call(ResourceContainer.java:525)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2011-03-14 21:05:26,759 INFO [ResourceContainer.invoker.daemon-1] (org.rhq.plugins.platform.LinuxPlatformComponent)- Internal yum server is disabled.
2011-03-14 21:05:28,812 ERROR [main] (rhq.core.pc.inventory.InventoryManager)- Failed to start component for Resource[id=10001, type=Linux, key=localhost.localdomain, name=localhost.localdomain, parent=<null>, version=Linux 2.6.34.7-66.fc13.x86_64] from synchronized merge.
org.rhq.core.clientapi.agent.PluginContainerException: Failed to start component for resource Resource[id=10001, type=Linux, key=localhost.localdomain, name=localhost.localdomain, parent=<null>, version=Linux 2.6.34.7-66.fc13.x86_64].
at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1552)
at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:2640)
at org.rhq.core.pc.inventory.InventoryManager.processSyncInfo(InventoryManager.java:2423)
at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1022)
at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:996)
at org.rhq.core.pc.inventory.InventoryManager.upgradeResources(InventoryManager.java:2665)
at org.rhq.core.pc.inventory.InventoryManager.initialize(InventoryManager.java:228)
at org.rhq.core.pc.PluginContainer.startContainerService(PluginContainer.java:399)
at org.rhq.core.pc.PluginContainer.initialize(PluginContainer.java:262)
at org.rhq.enterprise.agent.AgentMain.startPluginContainer(AgentMain.java:1867)
at org.rhq.enterprise.agent.AgentMain.start(AgentMain.java:648)
at org.rhq.enterprise.agent.AgentMain.main(AgentMain.java:418)
Caused by: org.rhq.core.pc.inventory.TimeoutException: [Warning] Call to [org.rhq.plugins.platform.LinuxPlatformComponent.start()] with args [[org.rhq.core.pluginapi.inventory.ResourceContext@612438f1]] timed out. Invocation thread will be interrupted
at org.rhq.core.clientapi.agent.PluginContainerException.wrapIfNecessary(PluginContainerException.java:69)
at org.rhq.core.clientapi.agent.PluginContainerException.<init>(PluginContainerException.java:96)
... 12 more
Here is what is happening. PluginContainer.initialize executes in thread T1. This method initializes all of the XXXManager classes that make up the PC. When InventoryManager initializes, it activates/starts the platform resource. On Windows, this corresponds to calling WindowsPlatformComponent.start, and on Linux it corresponds to calling LinuxPlatformComponent.start. The start() method is called in a separate thread T2. The start() method in T2 tries to register event pollers.
The TimeoutException is happening when EventContextImpl is waiting on a read lock so it can obtain a reference to EventManager. T1 obtained the write lock at the start of PluginContainer.initialize. This situation previously did not occur prior to my changes because EventContextImpl obtained the read lock in thread T1. The locking mechanism used is re-entrant in that the same thread that already holds the write lock can also obtain a read lock. This is why the code prior to my changes worked.
But now the plugin component is executing in thread T2, and it has to block until the write lock is released before it can obtain the read lock. The plugin component is wrapped in a proxy such that call to its start() method goes through that proxy. It is the proxy which schedules the execution of start() in T2. And the proxy which is running in T1 blocks on the call to start() which executes in T2. The proxy blocks for at most 60 seconds. It eventually times out, and we get our TimeoutException.
In case you did not follow the past couple paragraphs, we have deadlock. The proxy in T1 is waiting for the start() method in T2 to finish. T1 holds the write lock. T2 attempts to obtain a read lock, but it has to block because T1 holds the write lock.
If we revert my previous change in EventContextImpl, we are back to square one which is undoubtedly better since we would not be caught in a dead lock while trying to initialize the platform component; however, the problem will still manifest itself in the apache plugin. It could also manifest itself in any other plugins that support events and follow a similar patter for registering event pollers. This includes the jboss-as and jboss-as-5 plugins.
Committed and pushed fix to resolve deadlock issue. This commit adds an initializationListener api to PluginContainer. When plugin components start up while the PC is still initializing and register event pollers, they now do so by way of this InitializationListener api. EventContextImpl now registers a listener, which does not require a lock and therefore eliminates the possibility for deadlock. When the listener is invoked, the PC has already released the write lock, EventManager is already up and running, and there should be no problem registering the event poller at this point. Also note the listener runs in the same thread that executes PluginContainer.initialize. commit hash: a12a99c6806199323f2917fdc9f26603a76c7849 Cherry picked fix over to release-3.0.1 branch. commit hash: 97fde8e669a981f393000b8c7c8e52f78e9eda88 defined verification steps with jsanda, as follows: -on a multi-core/cpu machine -turn on event logging in RHQ for the platform and Apache plug-in -restart the agent -review agent logs for anything that looks like: java.lang.NullPointerException at org.rhq.core.pc.event.EventManager.registerEventPoller(EventManager.java:154) -repeat several times due to intermittent nature of this issue. Bulk closing of old issues that are in VERIFIED state. |