Bug 857189 - Plugin JBossAS7 times out in the middle of loading the resource types
Summary: Plugin JBossAS7 times out in the middle of loading the resource types
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Inventory
Version: JON 3.1.0
Hardware: All
OS: All
urgent
high
Target Milestone: ---
: JON 3.1.2
Assignee: John Sanda
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On: 857208
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-09-13 18:30 UTC by dsteigne
Modified: 2018-11-30 20:04 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 857208 (view as bug list)
Environment:
Last Closed: 2013-09-11 10:58:24 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 869063 0 high CLOSED Stale resource types due to failed plug-in purge 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 877073 0 unspecified NEW Failed plugin installation should not install plugin JAR in database 2022-03-31 04:28:29 UTC
Red Hat Knowledge Base (Solution) 162653 0 None None None 2012-09-24 22:45:56 UTC

Internal Links: 869063 877073

Description dsteigne 2012-09-13 18:30:59 UTC
Description of problem:
The transaction timeout by default is 10 minutes if the transaction times out during the Resource Type database inserts, the log says the Plugin was loaded but you will receive the following errors when a JBossAS7 server is discovered on the agent.

ERROR [org.rhq.enterprise.server.discovery.DiscoveryBossBean] Reported resource [Resource[id=0, uuid=0fc086b0-376f-4741-b24b-0ec92bdcbab2, type={JBossAS7}JBossAS7 Standalone Server, key=/app/jboss/standalone, name=EAP (131.115.209.135:9990), parent=sehan9500astest.han.telia.se, version=EAP ]] has an unknown type [{JBossAS7}JBossAS7 Standalone Server]. The Agent most likely has a plugin named 'JBossAS7' installed that is not installed on the Server. Resource will be ignored...

You have to delete, purge and re-load the JBossAS7 plugin to correct the issue.


Version-Release number of selected component (if applicable):
3.1.0

How reproducible:


Steps to Reproduce:
1. Install JON 3.1.0
2. Edit the jboss-service.xml to lower the timeout to 1 minute 
3. Drop the EAP Plugin Pack into the jon-server/plugins directory
4. Start the server
5. The load for the JBossAS7 plugin should timeout and not complete (Even though the rhq-server-log4j.log states it's load.
6. Install EAP 6 and start up in Standalone 
7. Install and start an agent
8. Run Discovery -f
  
Actual results:
The above ERROR message is displayed.


Expected results:
The server would roll back the transactions and attempt to re-load the plugins Resource Types

Additional info:

Comment 1 Charles Crouch 2012-09-24 19:45:22 UTC
IIRC we did work prior to the jon310 to increase this timeout for exactly this reason, the as7 plugin has a lot of metadata.

Comment 2 Larry O'Leary 2012-09-24 22:45:56 UTC
This appears to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=848390 which identifies a failure to install the plug-ins on the initial startup of the server. However, with Bug 848390 a re-start of the server clears up the issue because the plug-in update failure occurs during the initial read of the plug-in file from the file-system and therefore has not yet been added to the database. Upon restart, an attempt to upload the plug-in to the database is performed again -- which succeeds.

In this case however, the plug-in was successfully uploaded to the database but the total time to process the plug-ins types is taking too long. The result being the timeout. A restart WILL NOT fix this issue because we assume that the plug-in's types have been installed because we see the plug-in in the database. Therefore, the only solution is to purge the plug-in and try again -- after increasing the timeout of course.

Comment 3 Larry O'Leary 2012-09-24 22:55:07 UTC
After reviewing Bug 848390 more closely, I can see that it is most likely caused by a concurrency issue with how plugins are read from the plugins install directory and the rhq-plugins directory. So, perhaps these issues are not similar at all.

Comment 4 John Sanda 2012-11-07 19:22:08 UTC
It should be noted that the transaction time out was logged in the server log from the support case. The error message was,

2012-09-07 14:51:18,359 WARN  [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_40] - Abort called on already aborted atomic action -7c8cc5bb:a093:5049e9f9:139
2012-09-07 14:51:18,827 ERROR [org.rhq.enterprise.server.core.plugin.ProductPluginDeployer] Failed to register RHQ plugin file [file:/app/jon-server-3.1.0.GA/jbossas/server/default/deploy/rhq.ear/rhq-downloads/rhq-plugins/rhq-jboss-as-7-plugin-4.4.0.JON310GA.jar]
java.lang.IllegalStateException: [com.arjuna.ats.internal.jta.transaction.arjunacore.inactive] [com.arjuna.ats.internal.jta.transaction.arjunacore.inactive] The transaction is not active!
        at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1379)
        at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:135)
        at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:87)
        at org.jboss.aspects.tx.TxPolicy.endTransaction(TxPolicy.java:175)
.
.
.
at $Proxy520.registerPluginTypes(Unknown Source)
        at org.rhq.enterprise.server.resource.metadata.PluginManagerBean.registerPlugin(PluginManagerBean.java:349)

Comment 6 John Sanda 2012-11-07 20:17:18 UTC
I want to provide a little insight into why this transaction time out does/can occur. At the bottom of the stack trace in comment 4, we find that registerPlugin calls reigsterPluginTypes. The container will start a new transaction for the registerPluginTypes method, assuming one is not already in progress. All resource types for the plugin are processed (i.e., persisted) during this method. 

To put things in perspective, the AS 7 plugin contains 406 resource types. The plugin with the next highest number of resource types is the AS 5 plugin and it only comes in with a count of 49 types. I see three potential solutions.

1) Increase the transaction time out globally.

2) Increase the transaction time out for just the registerPluginTypes method.

3) Change the transaction boundaries such that the the transaction or transactions are not so coarse-grained.

Option 1 is a bad idea. It would likely lead to masking not so subtle bugs elsewhere in the code base in addition to creating unnecessary, higher load on the database.

Option 2 could be a quick fix, but it is not a robust solution. Whatever setting we choose might not be sufficient for someone running on slow hardware and/or with low bandwidth. Then what about when the AS 8 plugin or some other plugin that contains even more types comes along?

The best, most robust solution is option 3. Inserting/updating each resource type should be done in a separate transaction. This will scale much better with plugins that contain a lot of types and will also reduce contention for database resources.

Comment 7 John Sanda 2012-11-07 22:11:22 UTC
Changes have been pushed to master. I have run the relevant integration tests and have done manual testing of installing all plugins against both oracle and postgresql. Once I see those tests pass on jenkins against postgresql and oracle I will cherry-pick the commit over to the release branch.

master commit hash:
20212484ca

Comment 8 John Sanda 2012-11-12 22:42:39 UTC
After further review, I have reverted commit 20212484ca that changed the transaction boundaries and instead increased the transaction timeout on the PluginManagerBean.registerPluginTypes method to 30 minutes. The default we use is 10 minutes. 

master commit hash: a21997a6624 

I still firmly believe that more fine-grained transactions is the way to go, but it  really is a larger effort that would involve a lot more testing. If we insert/update each resource type in a separate transaction, we could have a partial failure where some but not all of the resource types are installed/updated. Unless the user immediately resolves the problem, then he could wind up with problems like those described in bug 869063. Resource types would exist on the agent but not on the server resulting in the server rejecting inventory reports.

Failed plugin installation would have to be resolved in one of two ways. Either delete/purge the plugin or fix whatever problem(s) occurred and re-install it. Deleting the plugin might not be an option as it could wind up deleting existing resources if it is an existing plugin. Furthermore, because we do not have a way to report plugin installation problems in the UI right now, the user might not be aware of it before the plugin JAR is installed on agent machines. Then the problem gets a lot worse.

If we decide to revisit using more fine-grained transactions, then we need to also look at implementing application level rollback so that we can make a best effort to get back to a consistent state.

Comment 9 John Sanda 2012-11-13 02:10:30 UTC
Changes have been pushed to the release/jon3.1.x branch. Moving to MODIFIED.

commit hash: 1eee73702c2

Comment 10 Simeon Pinder 2012-11-21 21:55:49 UTC
Moving to ON_QA as available for test with build : https://brewweb.devel.redhat.com//buildinfo?buildID=244662.

Comment 11 Armine Hovsepyan 2012-12-05 13:51:21 UTC
cannot reproduce the case even with transaction timeout 6 secs - marking as verified (JON 3.1.2)


Note You need to log in before you can comment on or make changes to this bug.