697001 – group operations are broken when one fails

Bug 697001 - group operations are broken when one fails

Summary: group operations are broken when one fails

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core UI
Sub Component:
Version:	4.0.0.Beta1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	John Mazzitelli
QA Contact:	Corey Welton
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	699386 RHQ-2100 rhq4
TreeView+	depends on / blocked

Reported:	2011-04-15 14:49 UTC by John Mazzitelli
Modified:	2011-05-24 01:14 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Clones:	699386 (view as bug list)
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description John Mazzitelli 2011-04-15 14:49:34 UTC

Description of problem:

If you have a group operation where one of the resources fails, the other operations never get out of "inprogress".

How reproducible:

create a group op with multiple resources. Ensure one of them fails. Notice that some of the others are still in in progress and never complete. Do NOT select "halt on error" to ensure you say you want all operations to be invoked.

I used "paralle" execution - did not specify any order to execute each operation.

Comment 2 John Mazzitelli 2011-04-20 22:14:34 UTC

to me this is blocker to the rhq4 release. When this happens (and remember, it happens on the DEFAULT tab that you land on when you go to another resource), sometimes the browser just hangs and you have to kill it to make it go away.

Comment 3 John Mazzitelli 2011-04-21 17:35:36 UTC

shoot - ignore comment #2, that was supposed to be on another issue.

Comment 4 John Mazzitelli 2011-04-21 17:38:22 UTC

commit 7f4f22446b9d692e191415f612a40f0f23d42b89

if for some reason (usually due to a bug) a resource op history is in
INPROGRESS for a really long time, we now flip the status to FAILURE after 1
day of bring in that inprogress state.

Also, there was a bug in one of our queries that affected group ops that also
is fixed in the above commit.

There is still one remaining bug in here that is to be fixed. In OperationManagerBean (which is called when the GroupOperationJob tries to update a history), this hibernate exception happens during a entityManager.merge here:

Caused by: java.lang.IllegalStateException: org.hibernate.TransientObjectException: object references an unsaved transient instance - save the transient instance before flushing: org.rhq.core.domain.operation.ResourceOperationHistory
...
	at org.rhq.enterprise.server.operation.OperationManagerBean.updateOperationHistory(OperationManagerBean.java:851)

This happens in GroupOperationJob here:

// failed to even send to the agent, immediately mark the job as failed
groupHistory.setErrorMessage(ThrowableUtil.getStackAsString(e));
groupHistory = (GroupOperationHistory) operationManager.updateOperationHistory(
    getUserWithSession(user, true), groupHistory);

Comment 5 John Mazzitelli 2011-04-21 19:10:22 UTC

need to also test if this bug occurs if you use serial execution (i.e. not parallel, where you specify the order in which you want the resources to be invoked)

Comment 6 John Mazzitelli 2011-04-21 19:14:05 UTC

BTW: a simple way to test this is to start two agents on the same box, get them both registered, then import both their platforms, then shutdown the first one but keep the second one up. Then create a compatible group consisting of both platform resources. Then go to the group's operation tab and execute one of the platform's operations.

This is how you can start two agents on the same box. In one cmdline window:

./rhq-agent.sh -l -p agent1

(when the setup prompt appear, you can answer them as you would normally except make the agent name "agent1" and the agent port to be 26163)

now in a separate cmdline window:

./rhq-agent.sh -l -p agent2

(when the setup prompt appear, you can answer them as you would normally except make the agent name "agent2" and the agent port to be 36163)

Comment 7 John Mazzitelli 2011-04-21 20:19:15 UTC

commit a68b7c0 fixes the problem where the histories remain in inprogress state. the problem was due to a hibernate issue - we weren't attaching entities before we attempted a merge.

there is one more issue I just found - if we set "halt on error", this same problem happens - the individual resource histories after the error remain inprogress - we should set them all to "canceled".

Comment 8 John Mazzitelli 2011-04-21 21:29:38 UTC

commit 697001

fixes a few things:

1) the status icon was missing in the details view
2) if the operation never started, the details view was showing dec 31, 1970 - now it shows "never"
3) if halt-on-error is set and one resource failed , the subsequent res operations will be canceled and an error message will be set on the history item
4) the error message will be shown for cancelled history item details, not just for errors

Comment 9 Sunil Kondkar 2011-04-28 09:56:53 UTC

Verified on build#39 (Version: 4.0.0-SNAPSHOT Build Number: 15a53e5)

Verified for the serial and parallel group operation execution. Followed the steps as per comment 6 .

The status of the group operation shows failed.

It displays the status icon and the hover information (Ex: failed) in the details view.

The details view shows "Not yet started" when operation never started.

When halt-on-error is set and one resource failed , the subsequent resource operations is cancelled and the error message is displayed on the history item.

The error message is shown for cancelled history item details : "This has been cancelled due to halt-on-error being set on the parent group operation schedule. A previous resource operation that executed prior to this resource operation failed, thus causing this resource operation to be cancelled."

Comment 10 Corey Welton 2011-05-24 01:14:07 UTC

Bookkeeping - closing bug - fixed in recent release.

Comment 11 Corey Welton 2011-05-24 01:14:09 UTC

Bookkeeping - closing bug - fixed in recent release.

Comment 12 Corey Welton 2011-05-24 01:14:09 UTC

Bookkeeping - closing bug - fixed in recent release.

Note You need to log in before you can comment on or make changes to this bug.