Bug 588515

Summary: framework for asynchronous aggregate-level work
Product: [Other] RHQ Project Reporter: Joseph Marques <jmarques>
Component: Core ServerAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED DEFERRED QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0.0CC: ccrouch, jshaughn
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-29 17:32:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joseph Marques 2010-05-03 20:47:47 UTC
We have a growing number of SLSBs that need to perform more (and more complex) work as the system side-effect(s) of a single user action.  The necessary work may either:

1) loads large quantities of entities into the Hibernate session, which requires the meticulous use of flush/clear to keep memory consumption normal
2) agent interaction, which requires timeout of communication attempts so the large thread carrying out the work is not blocked indefinitely
3) performing updates against a large number of rows, which increases the likelihood of transaction exceptions at READ_COMMITTED isolation level (for example, not noticing a row that you associated yourself to was deleted until persist time)
4) If the transaction is very large, it could put a significant load on the transaction manager, which might exceed the recommended DB tuning values (based on the size of the given enterprise) causing further performance bottlenecks as the DB attempts to catch up with its transactional backlog

Here are just some of the more recent bugs we've seen from these issues:

* https://bugzilla.redhat.com/show_bug.cgi?id=RHQ-2479 -- transaction timeout occurs when updating metric template and applying it to existing Resources, when there are a large number of Resources spread across many Agents
* https://bugzilla.redhat.com/show_bug.cgi?id=RHQ-2257 -- Alert Templates take hours to be applied to existing resources

Realize though that both of those bugs refer to subsystems that have *already* received varying degrees of performance tuning.  However, the tuning to date has only focused on keeping the size of the Hibernate session reasonable so that there aren't memory issues within the application server communicating with the database.

Unfortunately, what both of these bugs indicate is that Hibernate session tuning alone was insufficient.  Even though the act of "updating a metric template" or "updating an alert template" seems simple to the user, it might require thousands if not tends of thousands of row updates, talking to dozens if not hundreds of agents, loading megabytes of data across the wire between the server and database, etc.

My proposal to fix these problems would be to use smaller transactions.  Instead of trying to update all metric templates in a single transaction, which requires talking to every agent in the system, it could refactored such that the updates required for any one box (and all comm between agent and db) is done in its own transaction.  Likewise, instead of trying to update all alert definitions under an alert template in a single transaction, each individual alert definition could be updated in its own transaction.  The problem here, however, is that the logical update at the aggregate / group / template level is no longer atomic.  Instead, each individual transaction carried out as part of the larger one can succeed or fail on its own.

Although this proposal may seem like it has overlap with what the quartz scheduler provides today, it is actually much more than that.  Take for example group operations.  A group operation creates one cluster-aware quartz job for the group as well as N cluster-aware quartz jobs for each of the N children in that group.  The children jobs can either be executed all at the same time or serially.  The jobs can independently succeed or fail, and they have their own timeouts.  The group job can not complete until all children jobs timeout, or until the group-level timeout is reached, which then forces the cancellation of the remaining child jobs.

Notice, however, that nearly the entire last paragraph talked about semantics that were *layered on top* of the quartz scheduler: the relationship between individual jobs and the larger aggregate controller, the serial or concurrent nature of job execution, timeout logic.  In other words, quartz provides the concept of a "job"; it does not provide aggregate-level semantics like what we need for many of the template/group features of the product.

As a result, I suggest that we introduce a mechanism that would ease the implementation and maintenance of writing template/group-level features.  Instead of having to rewrite boilerplate code for quartz job creations and submissions, job/state management, timeout tracking, retry mechanisms, boilerplate exception handling, etc...all of these features would be available in the framework.

All you would have to do is provide a little bit of metadata to describe the job, tell it what code you want to execute at the child layer, provide some mechanism to pass the contextually relevant data to each child instance, and submit it.  Then we could then provide a job tracker that monitored the progress of each job, reported percent completion as well percent failure (those that couldn't complete even with retry), provided an audit trail through which failures could be manually retried or removed from the queue entirely, etc.

A couple options:

1) write a custom framework taylored to our application backend (EJB3 SLSBs) and scheduler library (quartz) - looping constructs and error handling written in Java code as part of the aggregate job framework itself

* pros - no new 3rd party dependencies, no integration work necessary, we own the code so we can fix and enhance as necessary to provide the exact semantics desired, easier to understand than a general framework as it will only provide features we absolutely need

* cons - capital investment and maintenance cost, error handling may be limited over what a full workflow engine can provide

2) use a workflow engine, such as jbpm, where we would write an abstract workflow definition that is taylored to our application backend (EJB3 SLSBs) and scheduler library (quartz) - looping constructs and error handling would be written as state transitions of the workflow definition in a generic jpdl

* pros - extremely flexible error-handling because compensating activities are at the heart of the workflow world, whether they are automated or require human intervention; we may eventually want to orchestrate different RHQ features together into higher-level workflows for system administrators to have greater control of their environments across subsystems
* cons - another 3rd party library to integrate and maintain, overriding the default error-handling mechanism may be so rare it's not worth it; custom error-handling logic, although possible, would require writing jpdl (jbpm process definition language), which increases the complexity of that job, which increases its maintenance burden and make it more difficult to immediately grok

3) use an open source MapReduce framework where we write an extension to the Map job, taylored to our application backend (EJB3 SLSBs) and scheduler library (quartz)

* pros - we will eventually (if i had my druthers) be using the MapReduce paradigm for (at the very least) rendering metric graphs, indexing events, and search, so might as well only have one job submission / execution / control framework
* cons - could be considered overkill for this problem since work distribution (across nodes and racks) is not required...reduce phase is more or less a no-op...we just want to be able to dissect a large transaction into small fragments while incorporating exception handling, retry mechanisms, and timeout logic...all of which would have to be coded as a thin layer on top of the abstract Map job anyway

Once the framework is written, I see several candidates for rewrite and consolidation onto it:

* alert template creation / update
* group alert definition creation / update
* group plugin configuration
* group resource configuration
* group operations
* metric template updates
* compatible-group / auto-group / cluster-group metric schedule updates

Comment 1 Corey Welton 2010-12-16 13:20:05 UTC
Triaged 13-Dec

Comment 3 Jay Shaughnessy 2014-05-29 17:32:01 UTC
We have addressed these sorts of issues in various ways, smaller transactions being one, offloading metrics to a NOSQL DB is another.  Recursive queries are yet another. There is more to do but it is incremental and addressed as needed. This is still a useful reference, but closing as there is no specific work planned here.