974352 – Log implicit XML-RPC retries on lab controller

Bug 974352 - Log implicit XML-RPC retries on lab controller

Summary: Log implicit XML-RPC retries on lab controller

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	lab controller
Sub Component:
Version:	0.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	0.13.x
Assignee:	Nick Coghlan
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-14 04:13 UTC by Nick Coghlan
Modified:	2018-02-06 00:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:	974319
Environment:
Last Closed:	2013-07-11 02:44:42 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	974319	0	unspecified	CLOSED	clear_running_commands XML-RPC call fails with MemoryError	2021-02-22 00:41:40 UTC

Internal Links: 974319

Description Nick Coghlan 2013-06-14 04:13:07 UTC

+++ This bug was initially created as a clone of Bug #974319 +++

Description of problem:

beaker-provision was restarted with 51 Running operations in the command queue. 
It failed to restart because the "clear_running_commands" call back to the main server was failing with MemoryError.

Further investigation showed that the supposedly "Running" commands were anywhere up to 3 weeks old, and almost certainly due to their status not being updated correctly when a failure is encountered while updating the command status in mark_command_completed or mark_command_failed.

Version-Release number of selected component (if applicable):

0.12.1

How reproducible:

Timing related (requires the callback or command status update to fail, rolling back the transaction)

Steps to Reproduce:
1.
2.
3.

Actual results:

Commands are left in "Running" state (unless/until cleared out by clear_running_commands)

Expected results:

Running commands are marked as Completed or Failed as appropriate.

Additional info:

Comment 1 Nick Coghlan 2013-06-20 04:56:42 UTC

Analysis of the available logs suggests that this failure mode occurs when the "mark_command_running" operation succeeds on the server, and then one of the following happens:

1. The lab controller fails to receive the reply and thus assumes the request failed and either retries (giving a "Command X already run" failure) or else abandons the operation.
2. The lab controller receives the reply and handles the command, but the subsequent call to "mark_command_running" or "mark_command_failed" doesn't work.

Thus, this appears to be a secondary failure that arises only in the presence of network stability issues for the main server, the lab controller or both.

The available logs are currently confused by the "implicit retry" behaviour in the lab controller's XML-RPC client code. Since those retries are unlikely to work, and make analysis of the underlying networking misbehaviour more difficult, we will disable them for the next 0.13 maintenance release.

Comment 4 Nick Coghlan 2013-06-25 04:40:00 UTC

On Gerrit: http://gerrit.beaker-project.org/#/c/2057

I updated title to reflect the planned changes, as we're not going to do anything more sophisticated at this point. The only major surgery we're likely to consider for the command queue system is replacing it with a point-to-point messaging system like fedmsg, and that's well beyond the scope of this bug report.

Comment 6 Nick Coghlan 2013-06-27 03:23:39 UTC

These retries are actually needed to allow the lab controllers to handle deliberate restarts of the main web server and genuine network glitches for more remote labs.

So, the current fix isn't appropriate, and needs to be replaced with one that keeps the retry mechanism, and just ensures they're logged appropriately.

Comment 7 Nick Coghlan 2013-07-01 09:25:29 UTC

Completed review of new and improved version on Gerrit:
http://gerrit.beaker-project.org/#/c/2066/

Comment 10 Amit Saha 2013-07-11 02:44:42 UTC

Beaker 0.13.2 has been released. (http://beaker-project.org/docs/whats-new/release-0.13.html#beaker-0-13-2).

Note You need to log in before you can comment on or make changes to this bug.