Red Hat Bugzilla – Bug 697001
group operations are broken when one fails
Last modified: 2011-05-23 21:14:09 EDT
Description of problem:
If you have a group operation where one of the resources fails, the other operations never get out of "inprogress".
create a group op with multiple resources. Ensure one of them fails. Notice that some of the others are still in in progress and never complete. Do NOT select "halt on error" to ensure you say you want all operations to be invoked.
I used "paralle" execution - did not specify any order to execute each operation.
to me this is blocker to the rhq4 release. When this happens (and remember, it happens on the DEFAULT tab that you land on when you go to another resource), sometimes the browser just hangs and you have to kill it to make it go away.
shoot - ignore comment #2, that was supposed to be on another issue.
if for some reason (usually due to a bug) a resource op history is in
INPROGRESS for a really long time, we now flip the status to FAILURE after 1
day of bring in that inprogress state.
Also, there was a bug in one of our queries that affected group ops that also
is fixed in the above commit.
There is still one remaining bug in here that is to be fixed. In OperationManagerBean (which is called when the GroupOperationJob tries to update a history), this hibernate exception happens during a entityManager.merge here:
Caused by: java.lang.IllegalStateException: org.hibernate.TransientObjectException: object references an unsaved transient instance - save the transient instance before flushing: org.rhq.core.domain.operation.ResourceOperationHistory
This happens in GroupOperationJob here:
// failed to even send to the agent, immediately mark the job as failed
groupHistory = (GroupOperationHistory) operationManager.updateOperationHistory(
getUserWithSession(user, true), groupHistory);
need to also test if this bug occurs if you use serial execution (i.e. not parallel, where you specify the order in which you want the resources to be invoked)
BTW: a simple way to test this is to start two agents on the same box, get them both registered, then import both their platforms, then shutdown the first one but keep the second one up. Then create a compatible group consisting of both platform resources. Then go to the group's operation tab and execute one of the platform's operations.
This is how you can start two agents on the same box. In one cmdline window:
./rhq-agent.sh -l -p agent1
(when the setup prompt appear, you can answer them as you would normally except make the agent name "agent1" and the agent port to be 26163)
now in a separate cmdline window:
./rhq-agent.sh -l -p agent2
(when the setup prompt appear, you can answer them as you would normally except make the agent name "agent2" and the agent port to be 36163)
commit a68b7c0 fixes the problem where the histories remain in inprogress state. the problem was due to a hibernate issue - we weren't attaching entities before we attempted a merge.
there is one more issue I just found - if we set "halt on error", this same problem happens - the individual resource histories after the error remain inprogress - we should set them all to "canceled".
fixes a few things:
1) the status icon was missing in the details view
2) if the operation never started, the details view was showing dec 31, 1970 - now it shows "never"
3) if halt-on-error is set and one resource failed , the subsequent res operations will be canceled and an error message will be set on the history item
4) the error message will be shown for cancelled history item details, not just for errors
Verified on build#39 (Version: 4.0.0-SNAPSHOT Build Number: 15a53e5)
Verified for the serial and parallel group operation execution. Followed the steps as per comment 6 .
The status of the group operation shows failed.
It displays the status icon and the hover information (Ex: failed) in the details view.
The details view shows "Not yet started" when operation never started.
When halt-on-error is set and one resource failed , the subsequent resource operations is cancelled and the error message is displayed on the history item.
The error message is shown for cancelled history item details : "This has been cancelled due to halt-on-error being set on the parent group operation schedule. A previous resource operation that executed prior to this resource operation failed, thus causing this resource operation to be cancelled."
Bookkeeping - closing bug - fixed in recent release.