Bug 974352
Summary: | Log implicit XML-RPC retries on lab controller | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Nick Coghlan <ncoghlan> |
Component: | lab controller | Assignee: | Nick Coghlan <ncoghlan> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 0.12 | CC: | asaha, dcallagh, llim, qwan, rglasz, rmancy, xjia |
Target Milestone: | 0.13.x | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | 974319 | Environment: | |
Last Closed: | 2013-07-11 02:44:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Nick Coghlan
2013-06-14 04:13:07 UTC
Analysis of the available logs suggests that this failure mode occurs when the "mark_command_running" operation succeeds on the server, and then one of the following happens: 1. The lab controller fails to receive the reply and thus assumes the request failed and either retries (giving a "Command X already run" failure) or else abandons the operation. 2. The lab controller receives the reply and handles the command, but the subsequent call to "mark_command_running" or "mark_command_failed" doesn't work. Thus, this appears to be a secondary failure that arises only in the presence of network stability issues for the main server, the lab controller or both. The available logs are currently confused by the "implicit retry" behaviour in the lab controller's XML-RPC client code. Since those retries are unlikely to work, and make analysis of the underlying networking misbehaviour more difficult, we will disable them for the next 0.13 maintenance release. On Gerrit: http://gerrit.beaker-project.org/#/c/2057 I updated title to reflect the planned changes, as we're not going to do anything more sophisticated at this point. The only major surgery we're likely to consider for the command queue system is replacing it with a point-to-point messaging system like fedmsg, and that's well beyond the scope of this bug report. These retries are actually needed to allow the lab controllers to handle deliberate restarts of the main web server and genuine network glitches for more remote labs. So, the current fix isn't appropriate, and needs to be replaced with one that keeps the retry mechanism, and just ensures they're logged appropriately. Completed review of new and improved version on Gerrit: http://gerrit.beaker-project.org/#/c/2066/ Beaker 0.13.2 has been released. (http://beaker-project.org/docs/whats-new/release-0.13.html#beaker-0-13-2). |