738427 – when an operation is executed on a large group, some of the child Resource operations are never started, and group operation history remains in-progress indefinitely

Bug 738427 - when an operation is executed on a large group, some of the child Resource operations are never started, and group operation history remains in-progress indefinitely

Summary: when an operation is executed on a large group, some of the child Resource op...

Keywords:
Status:	NEW
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	rhq-perf jon3
TreeView+	depends on / blocked

Reported:	2011-09-14 18:35 UTC by Ian Springer
Modified:	2024-03-04 13:35 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description Ian Springer 2011-09-14 18:35:17 UTC

I executed an operation on a group of 1010 PerfTest:service-a Resources. An hour or two later, the group operation history still has a status of IN PROGRESS, and 274 of the member Resource operation histories have a status of IN_PROGRESS and a started time of "not yet started". 

At some point soon after I executed the group operation, my server ran out of heap and I had to restart it. This may be why the member operations never got kicked off. However, I think this is a bug. There should be a Quartz job that either marks such member operations as timed out or failed, so that the group operation history can be marked as completed/failed.

Comment 1 Ian Springer 2011-09-14 20:58:29 UTC

The group operation did eventually complete, because it timed out. The following message was displayed in the group history's Error field:

This group operation timed out before all child resource operations could complete normally, those still in progress will attempt to be canceled.

However, I think there may still be an issue here, because it's not that those 274 member operations never completed - they were never even started. Such operations should probably just be marked as failed with an error message explaining that RHQ failed to start them for unknown reasons.

Comment 2 Ian Springer 2011-09-14 21:18:23 UTC

Note, I noticed the following message in the Server log for each of the member operations:

16:51:25,105 INFO  [OperationManagerBean] Operation execution seems to have been orphaned - timing it out: ResourceOperationHistory: resource=[Resource[id=11656, type=service-a, key=service-a-8, name=service-a-8, parent=server-a-3, version=1.0]], group-history=[GroupOperationHistory: group=[ResourceGroup[id=10055, name=DynaGroup - compats ( PerfTest,service-a ), category=COMPATIBLE, type=service-a, isDynaGroup=true, isClusterGroup=false]], id=[11013], job-name=[rhq-group-10055-889910757-1316031145548], job-group=[rhq-group-10055], status=[In Progress], subject-name=[rhqadmin], ctime=[Wed Sep 14 16:12:26 EDT 2011], mtime=[Wed Sep 14 16:12:26 EDT 2011], duration-millis=[2338531], error-message=[null]], id=[11804], job-name=[rhq-resource-11656-889910757-1316031194522], job-group=[rhq-group-10055], status=[In Progress], subject-name=[rhqadmin], ctime=[Wed Sep 14 16:13:14 EDT 2011], mtime=[Wed Sep 14 16:40:05 EDT 2011], duration-millis=[679874], error-message=[null]

Comment 3 John Mazzitelli 2011-09-28 15:00:01 UTC

this is to be expected. if for some reason a operation gets "stuck" in in-progress state (say, due to catastrophic error in server causes the job never to get kicked off or if the agent dies in the middle and never reports back), we'll eventually time them out using our server-side async cleanup job, as comment #2 shows.

Comment 4 Ian Springer 2011-09-28 15:07:48 UTC

Yeah, they eventually get timed out, but ideally we should add a note to the operation result's errorMessage that informs the user the operation was never even run, e.g.:

"Note, the operation was never invoked. This is probably because either the RHQ Server was in a bad state or the corresponding RHQ Agent was not reachable."

We could ascertain that the operation was never executed by looking for a null startedTime.

Note You need to log in before you can comment on or make changes to this bug.