Bug 1366098

Summary: beaker-provision leaks Running commands if it retries a mark_command_running call which already succeeded
Product: [Community] Beaker Reporter: Dan Callaghan <dcallagh>
Component: lab controllerAssignee: Jon Orris <jorris>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: bmcivor, dcallagh, dowang, jorris, mjia, rjoost
Target Milestone: 24.0Keywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-21 18:48:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Dan Callaghan 2016-08-11 03:50:35 UTC
Sigh, distributed systems are hard...

Like all Beaker code, beaker-provision does a delay-and-retry loop when making XMLRPC requests in case they fail with any kind of socket error (including timeouts). However one possible failure mode is when the request was successfully handled by the server but some network glitch causes the client to time out reading from it. This is treated on the client side as a socket.timeout leading to retry *but* the request was already successfully handled so repeating it will fail consistency checks.

I observed this happening in beaker-provision logs. When the mark_command_complete call succeeds but is retried, you get:

Aug  8 07:39:38 lab-devel-02.example.com beaker-provision[27312]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x1821f50: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x1746990>>({'quiescent_period': 60, 'power': {'passwd': <repr, [<Greenlet at 0x1821af0: <bound method CommandQueu)> had unhandled exception: <Fault 1: "<type 'exceptions.ValueError'>:Command 386310 not running">

This error is in itself not harmful though, because mark_command_complete only happens as the very last step in handling the power command and if the server already considers it to be complete then there is no problem.

However the same issue can affect the mark_command_running call before beaker-provision runs the command, which results in:

Aug  8 07:21:42 lab-devel-02.example.com beaker-provision[27312]: bkr.labcontroller.provision ERROR Command handler <Greenlet at 0x19417d0: <bound method CommandQueuePoller.handle of <bkr.labcontroller.provision.CommandQueuePoller object at 0x1746990>>({'quiescent_period': 5, 'power': {'passwd': <repr , [<Greenlet at 0x18215f0>, <Greenlet at 0x18214b0>,)> had unhandled exception: <Fault 1: "<type 'exceptions.ValueError'>:Command 386291 already run">

In this case, the greenlet handling the command dies, but the command is left in Running state on the server. So beaker-provision will not pick it up again (since the command is no longer Queued) but nothing will ever run it either. The command is effectively leaked, at least until the next time beaker-provision is restarted and it clears stale commands.

This is the same problem we hit in bug 974352, bug 974319, and earlier... but I think we can actually do a better job of detecting and aborting these leaked commands.

Comment 1 Dan Callaghan 2016-08-11 03:53:26 UTC
It also doesn't help that the retries are not being shown in logs anymore: bug 1366100.

Comment 2 Dan Callaghan 2016-08-11 03:57:34 UTC
(In reply to Dan Callaghan from comment #0)
> but I think we can actually do a better job of detecting and aborting these
> leaked commands.

So the problem is that the leaked Running commands are cleared when beaker-provision restarts, but that is often weeks or months apart.

We could make beaker-provision check for Running commands on each polling loop (same way it checks for Queued commands that it can start to run) and if the server has any Running commands which beaker-provision does not know about then it will mark them as Aborted. That way, even though the command doesn't run properly because of the network glitch, it is Aborted straight away (and its associated recipe if any) instead of sitting there for weeks.

Also need to double-check that we are properly distinguish between Failed and Aborted -- a Failed command should mark the system as broken but an Aborted one shouldn't, since that represents a failure inside Beaker internally rather than an inability to power the system.

Comment 3 Dan Callaghan 2016-08-11 03:59:57 UTC
See also bug 1362371 and bug 1362369. This is part of my war on leaked commands, to get the graph of Queued/Running commands down to zero where it belongs, as per bug 1362371 comment 2.

Comment 4 Jon Orris 2016-10-18 22:26:14 UTC
So in addition to purging leaked commands, we can handle these cases better on the lab controller side.

How about instead of throwing an exception, the server responds with something like:

False, 'Command 386291 already run'

That would make it cleaner for the client side to handle. We could also parse the returned exception message, but that's a lot messier IMO, and it isn't really an 'exception', unlike an authorization error.

Comment 5 Dan Callaghan 2016-10-19 07:32:56 UTC
I would be very wary of indicating success in the return value and making the caller check it. That's really what XMLRPC faults are for. Because this really *is* an error condition -- it's not supposed to happen.

It sucks that the XMLRPC protocol doesn't give anything better than a string message, to help the client programatically detect certain error conditions... but that is just one of the reasons why we are trying to abandon XMLRPC.

But anyway, I don't think we actually want to handle that case, do we? If beaker-provision sees that error, it means things are already in a messed up state and a command has been leaked, so having the greenlet crash with an exception in the logs seems reasonable.

This bug is about what happens *after* that (namely that the command stays Running forever). We just want beaker-provision to detect those leaked commands sooner, instead of when it gets restarted.

Comment 6 Jon Orris 2016-10-20 21:56:03 UTC
(In reply to Dan Callaghan from comment #5)
> I would be very wary of indicating success in the return value and making
> the caller check it. That's really what XMLRPC faults are for. Because this
> really *is* an error condition -- it's not supposed to happen.

It is an error condition, but isn't it one we're essentially expecting and designing for? If I understand correctly, the basic pattern, unwrapped from automatic retries, is:

if command_is_queued:
  mark_command_running(id)
  # Timeout, try again
  mark_command_running(id)
  # Timeout, try again
  mark_command_running(id)
  # Succeeds, but raises ValueError since command is now running

In the third mark_command_running, it turns out the command has already been marked running by one of the previous tries. Shouldn't we just be continuing at this point, instead of crashing the greenlet?

> But anyway, I don't think we actually want to handle that case, do we? If
> beaker-provision sees that error, it means things are already in a messed up
> state and a command has been leaked, so having the greenlet crash with an
> exception in the logs seems reasonable.

Maybe I'm not understanding this properly. How is the state messed up? Isn't it just in the state we want it to be in after a network error prevented us from verifying it on the first call? If we handled it, wouldn't it help prevent leaked commands?

Obviously we still want to handle the possibility of other leaked commands more frequently that provision restart.

Comment 7 Dan Callaghan 2016-10-21 00:27:40 UTC
(In reply to Jon Orris from comment #6)
> It is an error condition, but isn't it one we're essentially expecting and
> designing for? If I understand correctly, the basic pattern, unwrapped from
> automatic retries, is:
> 
> if command_is_queued:
>   mark_command_running(id)
>   # Timeout, try again
>   mark_command_running(id)
>   # Timeout, try again
>   mark_command_running(id)
>   # Succeeds, but raises ValueError since command is now running
> 
> In the third mark_command_running, it turns out the command has already been
> marked running by one of the previous tries. Shouldn't we just be continuing
> at this point, instead of crashing the greenlet?

No I don't think we should.

If we have a queued command, and it turns out the queued command is already running, it really is an error (command is not in the state we thought it was in). We *could* assume that it's caused by this particular scenario (server received an earlier mark_command_running call and successfully processed it by the response was lost due to a network glitch and then we re-tried) but there are other possible causes, like someone is running two beaker-provisions in the same lab by mistake, or some other failure condition we have not anticipated.

So I *don't* think it's safe to assume that the bad state is due to this (very rare) network glitch and then just run the command anyway -- even though the server has told us that someone else is already running it. I think it's better for the greenlet to just crash with an error message in the logs (as it does now).

The idea here is just to do a bit of extra book-keeping to catch leaked commands faster and give us a greater chance of figuring out when and why they are getting leaked.

Comment 8 Dan Callaghan 2016-10-21 00:33:12 UTC
There is an underlying issue here too, which is that the XMLRPC retry logic is implemented in a generic way that is totally oblivious to the actual semantics and behaviour of the call that it's making -- so that logic is not capable of deciding on subsequent tries, whether some particular response indicates that it already succeeded on a previous try or if it's just a completely unrelated error.

Personally I think that's a bad approach, having explicit retries is more work but it means that the retry logic can account for the semantics of each call (whether it's idempotent or not, whether it might have already succeeded).

Although the problem is still the same in the end -- beaker-provision is trying to transition a command from Queued to Running (with the implication that it is taking responsibility for handling that command) so if that state transition fails because the command is already Running it's not really safe to assume that it already owns the command.

Comment 9 Jon Orris 2016-10-21 16:26:58 UTC
(In reply to Dan Callaghan from comment #7)
> If we have a queued command, and it turns out the queued command is already
> running, it really is an error (command is not in the state we thought it
> was in). We *could* assume that it's caused by this particular scenario
> (server received an earlier mark_command_running call and successfully
> processed it by the response was lost due to a network glitch and then we
> re-tried) but there are other possible causes, like someone is running two
> beaker-provisions in the same lab by mistake, or some other failure
> condition we have not anticipated.

Ok, this a point that I wasn't thinking of. If there's a possibility of a race condition for ownership, then 'let it crash' is the right course of action here.

Comment 10 Jon Orris 2016-11-15 21:36:45 UTC
https://gerrit.beaker-project.org/#/c/5348/

Comment 11 Roman Joost 2016-12-13 07:06:28 UTC
This bug fix is included in beaker-server-24.0.git.210.01bf4bf which is currently available for testing here:

https://beaker-devel.app.eng.bos.redhat.com/

Comment 13 Dan Callaghan 2017-02-21 18:48:43 UTC
Beaker 24.0 has been released.