Bug 1366098
Summary: | beaker-provision leaks Running commands if it retries a mark_command_running call which already succeeded | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Dan Callaghan <dcallagh> |
Component: | lab controller | Assignee: | Jon Orris <jorris> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 23 | CC: | bmcivor, dcallagh, dowang, jorris, mjia, rjoost |
Target Milestone: | 24.0 | Keywords: | Patch |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-02-21 18:48:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Callaghan
2016-08-11 03:50:35 UTC
It also doesn't help that the retries are not being shown in logs anymore: bug 1366100. (In reply to Dan Callaghan from comment #0) > but I think we can actually do a better job of detecting and aborting these > leaked commands. So the problem is that the leaked Running commands are cleared when beaker-provision restarts, but that is often weeks or months apart. We could make beaker-provision check for Running commands on each polling loop (same way it checks for Queued commands that it can start to run) and if the server has any Running commands which beaker-provision does not know about then it will mark them as Aborted. That way, even though the command doesn't run properly because of the network glitch, it is Aborted straight away (and its associated recipe if any) instead of sitting there for weeks. Also need to double-check that we are properly distinguish between Failed and Aborted -- a Failed command should mark the system as broken but an Aborted one shouldn't, since that represents a failure inside Beaker internally rather than an inability to power the system. See also bug 1362371 and bug 1362369. This is part of my war on leaked commands, to get the graph of Queued/Running commands down to zero where it belongs, as per bug 1362371 comment 2. So in addition to purging leaked commands, we can handle these cases better on the lab controller side. How about instead of throwing an exception, the server responds with something like: False, 'Command 386291 already run' That would make it cleaner for the client side to handle. We could also parse the returned exception message, but that's a lot messier IMO, and it isn't really an 'exception', unlike an authorization error. I would be very wary of indicating success in the return value and making the caller check it. That's really what XMLRPC faults are for. Because this really *is* an error condition -- it's not supposed to happen. It sucks that the XMLRPC protocol doesn't give anything better than a string message, to help the client programatically detect certain error conditions... but that is just one of the reasons why we are trying to abandon XMLRPC. But anyway, I don't think we actually want to handle that case, do we? If beaker-provision sees that error, it means things are already in a messed up state and a command has been leaked, so having the greenlet crash with an exception in the logs seems reasonable. This bug is about what happens *after* that (namely that the command stays Running forever). We just want beaker-provision to detect those leaked commands sooner, instead of when it gets restarted. (In reply to Dan Callaghan from comment #5) > I would be very wary of indicating success in the return value and making > the caller check it. That's really what XMLRPC faults are for. Because this > really *is* an error condition -- it's not supposed to happen. It is an error condition, but isn't it one we're essentially expecting and designing for? If I understand correctly, the basic pattern, unwrapped from automatic retries, is: if command_is_queued: mark_command_running(id) # Timeout, try again mark_command_running(id) # Timeout, try again mark_command_running(id) # Succeeds, but raises ValueError since command is now running In the third mark_command_running, it turns out the command has already been marked running by one of the previous tries. Shouldn't we just be continuing at this point, instead of crashing the greenlet? > But anyway, I don't think we actually want to handle that case, do we? If > beaker-provision sees that error, it means things are already in a messed up > state and a command has been leaked, so having the greenlet crash with an > exception in the logs seems reasonable. Maybe I'm not understanding this properly. How is the state messed up? Isn't it just in the state we want it to be in after a network error prevented us from verifying it on the first call? If we handled it, wouldn't it help prevent leaked commands? Obviously we still want to handle the possibility of other leaked commands more frequently that provision restart. (In reply to Jon Orris from comment #6) > It is an error condition, but isn't it one we're essentially expecting and > designing for? If I understand correctly, the basic pattern, unwrapped from > automatic retries, is: > > if command_is_queued: > mark_command_running(id) > # Timeout, try again > mark_command_running(id) > # Timeout, try again > mark_command_running(id) > # Succeeds, but raises ValueError since command is now running > > In the third mark_command_running, it turns out the command has already been > marked running by one of the previous tries. Shouldn't we just be continuing > at this point, instead of crashing the greenlet? No I don't think we should. If we have a queued command, and it turns out the queued command is already running, it really is an error (command is not in the state we thought it was in). We *could* assume that it's caused by this particular scenario (server received an earlier mark_command_running call and successfully processed it by the response was lost due to a network glitch and then we re-tried) but there are other possible causes, like someone is running two beaker-provisions in the same lab by mistake, or some other failure condition we have not anticipated. So I *don't* think it's safe to assume that the bad state is due to this (very rare) network glitch and then just run the command anyway -- even though the server has told us that someone else is already running it. I think it's better for the greenlet to just crash with an error message in the logs (as it does now). The idea here is just to do a bit of extra book-keeping to catch leaked commands faster and give us a greater chance of figuring out when and why they are getting leaked. There is an underlying issue here too, which is that the XMLRPC retry logic is implemented in a generic way that is totally oblivious to the actual semantics and behaviour of the call that it's making -- so that logic is not capable of deciding on subsequent tries, whether some particular response indicates that it already succeeded on a previous try or if it's just a completely unrelated error. Personally I think that's a bad approach, having explicit retries is more work but it means that the retry logic can account for the semantics of each call (whether it's idempotent or not, whether it might have already succeeded). Although the problem is still the same in the end -- beaker-provision is trying to transition a command from Queued to Running (with the implication that it is taking responsibility for handling that command) so if that state transition fails because the command is already Running it's not really safe to assume that it already owns the command. (In reply to Dan Callaghan from comment #7) > If we have a queued command, and it turns out the queued command is already > running, it really is an error (command is not in the state we thought it > was in). We *could* assume that it's caused by this particular scenario > (server received an earlier mark_command_running call and successfully > processed it by the response was lost due to a network glitch and then we > re-tried) but there are other possible causes, like someone is running two > beaker-provisions in the same lab by mistake, or some other failure > condition we have not anticipated. Ok, this a point that I wasn't thinking of. If there's a possibility of a race condition for ownership, then 'let it crash' is the right course of action here. This bug fix is included in beaker-server-24.0.git.210.01bf4bf which is currently available for testing here: https://beaker-devel.app.eng.bos.redhat.com/ Beaker 24.0 has been released. |