Bug 857189
Summary: | Plugin JBossAS7 times out in the middle of loading the resource types | |||
---|---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | dsteigne | |
Component: | Inventory | Assignee: | John Sanda <jsanda> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Mike Foley <mfoley> | |
Severity: | high | Docs Contact: | ||
Priority: | urgent | |||
Version: | JON 3.1.0 | CC: | ahovsepy, loleary, myarboro | |
Target Milestone: | --- | |||
Target Release: | JON 3.1.2 | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 857208 (view as bug list) | Environment: | ||
Last Closed: | 2013-09-11 10:58:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 857208 | |||
Bug Blocks: |
Description
dsteigne
2012-09-13 18:30:59 UTC
IIRC we did work prior to the jon310 to increase this timeout for exactly this reason, the as7 plugin has a lot of metadata. This appears to be similar to https://bugzilla.redhat.com/show_bug.cgi?id=848390 which identifies a failure to install the plug-ins on the initial startup of the server. However, with Bug 848390 a re-start of the server clears up the issue because the plug-in update failure occurs during the initial read of the plug-in file from the file-system and therefore has not yet been added to the database. Upon restart, an attempt to upload the plug-in to the database is performed again -- which succeeds. In this case however, the plug-in was successfully uploaded to the database but the total time to process the plug-ins types is taking too long. The result being the timeout. A restart WILL NOT fix this issue because we assume that the plug-in's types have been installed because we see the plug-in in the database. Therefore, the only solution is to purge the plug-in and try again -- after increasing the timeout of course. After reviewing Bug 848390 more closely, I can see that it is most likely caused by a concurrency issue with how plugins are read from the plugins install directory and the rhq-plugins directory. So, perhaps these issues are not similar at all. It should be noted that the transaction time out was logged in the server log from the support case. The error message was, 2012-09-07 14:51:18,359 WARN [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_40] - Abort called on already aborted atomic action -7c8cc5bb:a093:5049e9f9:139 2012-09-07 14:51:18,827 ERROR [org.rhq.enterprise.server.core.plugin.ProductPluginDeployer] Failed to register RHQ plugin file [file:/app/jon-server-3.1.0.GA/jbossas/server/default/deploy/rhq.ear/rhq-downloads/rhq-plugins/rhq-jboss-as-7-plugin-4.4.0.JON310GA.jar] java.lang.IllegalStateException: [com.arjuna.ats.internal.jta.transaction.arjunacore.inactive] [com.arjuna.ats.internal.jta.transaction.arjunacore.inactive] The transaction is not active! at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1379) at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:135) at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:87) at org.jboss.aspects.tx.TxPolicy.endTransaction(TxPolicy.java:175) . . . at $Proxy520.registerPluginTypes(Unknown Source) at org.rhq.enterprise.server.resource.metadata.PluginManagerBean.registerPlugin(PluginManagerBean.java:349) I want to provide a little insight into why this transaction time out does/can occur. At the bottom of the stack trace in comment 4, we find that registerPlugin calls reigsterPluginTypes. The container will start a new transaction for the registerPluginTypes method, assuming one is not already in progress. All resource types for the plugin are processed (i.e., persisted) during this method. To put things in perspective, the AS 7 plugin contains 406 resource types. The plugin with the next highest number of resource types is the AS 5 plugin and it only comes in with a count of 49 types. I see three potential solutions. 1) Increase the transaction time out globally. 2) Increase the transaction time out for just the registerPluginTypes method. 3) Change the transaction boundaries such that the the transaction or transactions are not so coarse-grained. Option 1 is a bad idea. It would likely lead to masking not so subtle bugs elsewhere in the code base in addition to creating unnecessary, higher load on the database. Option 2 could be a quick fix, but it is not a robust solution. Whatever setting we choose might not be sufficient for someone running on slow hardware and/or with low bandwidth. Then what about when the AS 8 plugin or some other plugin that contains even more types comes along? The best, most robust solution is option 3. Inserting/updating each resource type should be done in a separate transaction. This will scale much better with plugins that contain a lot of types and will also reduce contention for database resources. Changes have been pushed to master. I have run the relevant integration tests and have done manual testing of installing all plugins against both oracle and postgresql. Once I see those tests pass on jenkins against postgresql and oracle I will cherry-pick the commit over to the release branch. master commit hash: 20212484ca After further review, I have reverted commit 20212484ca that changed the transaction boundaries and instead increased the transaction timeout on the PluginManagerBean.registerPluginTypes method to 30 minutes. The default we use is 10 minutes. master commit hash: a21997a6624 I still firmly believe that more fine-grained transactions is the way to go, but it really is a larger effort that would involve a lot more testing. If we insert/update each resource type in a separate transaction, we could have a partial failure where some but not all of the resource types are installed/updated. Unless the user immediately resolves the problem, then he could wind up with problems like those described in bug 869063. Resource types would exist on the agent but not on the server resulting in the server rejecting inventory reports. Failed plugin installation would have to be resolved in one of two ways. Either delete/purge the plugin or fix whatever problem(s) occurred and re-install it. Deleting the plugin might not be an option as it could wind up deleting existing resources if it is an existing plugin. Furthermore, because we do not have a way to report plugin installation problems in the UI right now, the user might not be aware of it before the plugin JAR is installed on agent machines. Then the problem gets a lot worse. If we decide to revisit using more fine-grained transactions, then we need to also look at implementing application level rollback so that we can make a best effort to get back to a consistent state. Changes have been pushed to the release/jon3.1.x branch. Moving to MODIFIED. commit hash: 1eee73702c2 Moving to ON_QA as available for test with build : https://brewweb.devel.redhat.com//buildinfo?buildID=244662. cannot reproduce the case even with transaction timeout 6 secs - marking as verified (JON 3.1.2) |