Bug 738427
Summary: | when an operation is executed on a large group, some of the child Resource operations are never started, and group operation history remains in-progress indefinitely | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Ian Springer <ian.springer> |
Component: | Core Server | Assignee: | Nobody <nobody> |
Status: | NEW --- | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 4.1 | CC: | hrupp, mazz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | --- | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 620933, 678340 |
Description
Ian Springer
2011-09-14 18:35:17 UTC
The group operation did eventually complete, because it timed out. The following message was displayed in the group history's Error field: This group operation timed out before all child resource operations could complete normally, those still in progress will attempt to be canceled. However, I think there may still be an issue here, because it's not that those 274 member operations never completed - they were never even started. Such operations should probably just be marked as failed with an error message explaining that RHQ failed to start them for unknown reasons. Note, I noticed the following message in the Server log for each of the member operations: 16:51:25,105 INFO [OperationManagerBean] Operation execution seems to have been orphaned - timing it out: ResourceOperationHistory: resource=[Resource[id=11656, type=service-a, key=service-a-8, name=service-a-8, parent=server-a-3, version=1.0]], group-history=[GroupOperationHistory: group=[ResourceGroup[id=10055, name=DynaGroup - compats ( PerfTest,service-a ), category=COMPATIBLE, type=service-a, isDynaGroup=true, isClusterGroup=false]], id=[11013], job-name=[rhq-group-10055-889910757-1316031145548], job-group=[rhq-group-10055], status=[In Progress], subject-name=[rhqadmin], ctime=[Wed Sep 14 16:12:26 EDT 2011], mtime=[Wed Sep 14 16:12:26 EDT 2011], duration-millis=[2338531], error-message=[null]], id=[11804], job-name=[rhq-resource-11656-889910757-1316031194522], job-group=[rhq-group-10055], status=[In Progress], subject-name=[rhqadmin], ctime=[Wed Sep 14 16:13:14 EDT 2011], mtime=[Wed Sep 14 16:40:05 EDT 2011], duration-millis=[679874], error-message=[null] this is to be expected. if for some reason a operation gets "stuck" in in-progress state (say, due to catastrophic error in server causes the job never to get kicked off or if the agent dies in the middle and never reports back), we'll eventually time them out using our server-side async cleanup job, as comment #2 shows. Yeah, they eventually get timed out, but ideally we should add a note to the operation result's errorMessage that informs the user the operation was never even run, e.g.: "Note, the operation was never invoked. This is probably because either the RHQ Server was in a bad state or the corresponding RHQ Agent was not reachable." We could ascertain that the operation was never executed by looking for a null startedTime. |