I executed an operation on a group of 1010 PerfTest:service-a Resources. An hour or two later, the group operation history still has a status of IN PROGRESS, and 274 of the member Resource operation histories have a status of IN_PROGRESS and a started time of "not yet started".
At some point soon after I executed the group operation, my server ran out of heap and I had to restart it. This may be why the member operations never got kicked off. However, I think this is a bug. There should be a Quartz job that either marks such member operations as timed out or failed, so that the group operation history can be marked as completed/failed.
The group operation did eventually complete, because it timed out. The following message was displayed in the group history's Error field:
This group operation timed out before all child resource operations could complete normally, those still in progress will attempt to be canceled.
However, I think there may still be an issue here, because it's not that those 274 member operations never completed - they were never even started. Such operations should probably just be marked as failed with an error message explaining that RHQ failed to start them for unknown reasons.
Note, I noticed the following message in the Server log for each of the member operations:
16:51:25,105 INFO [OperationManagerBean] Operation execution seems to have been orphaned - timing it out: ResourceOperationHistory: resource=[Resource[id=11656, type=service-a, key=service-a-8, name=service-a-8, parent=server-a-3, version=1.0]], group-history=[GroupOperationHistory: group=[ResourceGroup[id=10055, name=DynaGroup - compats ( PerfTest,service-a ), category=COMPATIBLE, type=service-a, isDynaGroup=true, isClusterGroup=false]], id=, job-name=[rhq-group-10055-889910757-1316031145548], job-group=[rhq-group-10055], status=[In Progress], subject-name=[rhqadmin], ctime=[Wed Sep 14 16:12:26 EDT 2011], mtime=[Wed Sep 14 16:12:26 EDT 2011], duration-millis=, error-message=[null]], id=, job-name=[rhq-resource-11656-889910757-1316031194522], job-group=[rhq-group-10055], status=[In Progress], subject-name=[rhqadmin], ctime=[Wed Sep 14 16:13:14 EDT 2011], mtime=[Wed Sep 14 16:40:05 EDT 2011], duration-millis=, error-message=[null]
this is to be expected. if for some reason a operation gets "stuck" in in-progress state (say, due to catastrophic error in server causes the job never to get kicked off or if the agent dies in the middle and never reports back), we'll eventually time them out using our server-side async cleanup job, as comment #2 shows.
Yeah, they eventually get timed out, but ideally we should add a note to the operation result's errorMessage that informs the user the operation was never even run, e.g.:
"Note, the operation was never invoked. This is probably because either the RHQ Server was in a bad state or the corresponding RHQ Agent was not reachable."
We could ascertain that the operation was never executed by looking for a null startedTime.