Bug 807237

Summary: Job still running when all of its parts are completed
Product: [Retired] Beaker Reporter: Jiri Pospisil <jpospisi>
Component: schedulerAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: high    
Version: 0.8CC: asaha, azelinka, bpeck, dahorak, dcallagh, ebenes, jbrier, jhutar, jkortus, jmikulka, jpazdziora, llim, mcsontos, psklenar, qwan, rglasz, rmancy, stl
Target Milestone: 0.12   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Scheduler
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-04-11 04:57:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 593663, 903935    
Attachments:
Description Flags
uncancelled screenshot of job
none
screenshot of webui showing unfinished job with all finished tasks none

Description Jiri Pospisil 2012-03-27 10:44:42 UTC
Description of problem:
My job ID 209811 constists of four recipe sets. All of those sets have already finished yesterday - their progress is 100% and their state is "Completed". State of the whole job remains "Running" with progress 87%, which is not arithmetic mean of all recipe sets' progresses.

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1. Go to https://beaker.engineering.redhat.com/jobs/209811
  
Actual results:
Job is running, progress of job is 87%.

Expected results:
Job is completed, progress of job is 100%.

Additional info:

Comment 1 John Brier 2012-05-18 13:41:52 UTC
I have run into this issue as well. before I cancel the job to remove the hold on machines I have taken a screenshot of the browser so you can see exactly what I mean. I'll attach it to the ticket. 

The job in question:

https://beaker.engineering.redhat.com/jobs/232734

Comment 2 John Brier 2012-05-18 13:42:32 UTC
Created attachment 585409 [details]
uncancelled screenshot of job

Comment 3 John Brier 2012-10-04 14:27:35 UTC
Created attachment 621676 [details]
screenshot of webui showing unfinished job with all finished tasks

Comment 4 John Brier 2012-10-04 14:28:51 UTC
I think this bug might be a duplicate of bug 741201

Comment 5 Dan Callaghan 2012-10-25 05:19:41 UTC
*** Bug 853005 has been marked as a duplicate of this bug. ***

Comment 6 Dan Callaghan 2012-10-25 05:19:45 UTC
*** Bug 864136 has been marked as a duplicate of this bug. ***

Comment 7 Dan Callaghan 2012-10-25 05:19:49 UTC
*** Bug 822887 has been marked as a duplicate of this bug. ***

Comment 8 Dan Callaghan 2012-10-25 05:20:56 UTC
(In reply to comment #4)

I'm not completely convinced it's the same issue as bug 741201, but it might be related. Either way, I think we have a problem with the update_status and _bubble_up/_bubble_down methods...

Comment 9 Dan Callaghan 2012-10-29 05:14:28 UTC
*** Bug 800780 has been marked as a duplicate of this bug. ***

Comment 10 Nick Coghlan 2012-11-26 01:29:26 UTC
Capturing a discussion dcallagh and I just had regarding this: there's definitely a race condition at least in cases where the final stage of a job has two or more recipes running in parallel in the same recipe set.

This wouldn't be an issue except that, for performance reasons, we don't calculate the completion status of the overall recipe, recipe set and job at read time, but only when the completion status of one of the underlying recipe tasks changes.

The race condition is that the status change at the task level and the updating of the cached information at the higher levels happens in the *same transaction*. This means if the final two recipe tasks in the job complete at roughly the same time, the status update transactions may run in parallel and both come to the conclusion that there is still at least one other task still running and thus leave the higher level components marked as incomplete.

A more robust approach would be to split the "recipe task status update" and "recipe task status propagation to higher level components" steps into *separate* transactions. That way, at least one of the status propagation transactions will correctly see that all the recipe tasks have been completed, and thus propagate the correct status to higher levels.

Comment 11 Nick Coghlan 2012-11-26 01:31:28 UTC
*** Bug 741201 has been marked as a duplicate of this bug. ***

Comment 12 Raymond Mancy 2012-11-26 03:37:42 UTC
(In reply to comment #10)
> A more robust approach would be to split the "recipe task status update" and
> "recipe task status propagation to higher level components" steps into
> *separate* transactions. That way, at least one of the status propagation
> transactions will correctly see that all the recipe tasks have been
> completed, and thus propagate the correct status to higher levels.

So this means you would be relying on every recipe checking every other recipe's task's status?

Comment 13 Nick Coghlan 2012-11-26 05:03:28 UTC
Interesting point - it looks like we need three transactions rather than two to ensure we don't miss an update to the cached status at the higher levels.

At the task -> recipe level, the tasks within the recipe execute in sequence, so there's no race condition. Thus it's OK if the first transaction handles the whole process of updating both the task and recipe status, as well as releasing the system back to the pool. If releasing the system back to the pool fails for some reason, then both the task and the recipe will remain flagged as incomplete.

The second transaction would handle the recipe set status update, to avoid the race condition between multiple recipes finishing at the same time. It would only check the recipe level statuses - the serialisation of the recipe level transactions ensures that one of the racing updates will go second and see that the other one has also finished.

The recipe set -> job propagation would then need to be done in a third transaction to handle the case of jobs that contain multiple recipe sets. If the recipe -> recipe set -> job updates were all handled in a single transaction, we'd still have a race condition at this layer when two recipe sets finished at the same time.

Comment 17 Raymond Mancy 2013-01-23 09:47:48 UTC
I was thinking how changing our isolation levels (or locking) might be the simplest solution to this problem. ncoghlan confirmed he was thinking along the same lines.


My initial thoughts are:

(In reply to comment #13)
> Interesting point - it looks like we need three transactions rather than two
> to ensure we don't miss an update to the cached status at the higher levels.
> 
> At the task -> recipe level, the tasks within the recipe execute in
> sequence, so there's no race condition. Thus it's OK if the first
> transaction handles the whole process of updating both the task and recipe
> status, as well as releasing the system back to the pool. If releasing the
> system back to the pool fails for some reason, then both the task and the
> recipe will remain flagged as incomplete.
> 
> The second transaction would handle the recipe set status update, to avoid
> the race condition between multiple recipes finishing at the same time. It
> would only check the recipe level statuses - the serialisation of the recipe
> level transactions ensures that one of the racing updates will go second and
> see that the other one has also finished.
> 
> The recipe set -> job propagation would then need to be done in a third
> transaction to handle the case of jobs that contain multiple recipe sets. If
> the recipe -> recipe set -> job updates were all handled in a single
> transaction, we'd still have a race condition at this layer when two recipe
> sets finished at the same time.

This problem has popped up again. I was thinking today of a potential solution:

1. Transaction for updating status of recipe, with write lock select on the recipe to be updated. This should mean that all locks will block waiting for this lock to be released

2. Transaction for updating status of recipeset, with a shared lock selects on its member recipes, and a write lock on itself. This means that if any of its member recipes are being updated it will block waiting for them to release the lock, once the lock is relaesed (and the transaction commited), the shared lock will ensure we get the latest data despite the fact we are in Repeatable read isolation level.

3. Transaction for updating jobs, with a shared lock select on its recipesets, same principal as above.


I haven't really looked into how easily sqlalchemy will let us do this, although I see there is a Query.with_lockmode() that may help.

Comment 18 Dan Callaghan 2013-01-24 01:47:36 UTC
I think all this stuff about multiple transactions and fiddling with isolation modes is overkill and unnecessary. SELECT ... FOR UPDATE gives us everything we need to do these updates in a race-free way.

The problem is that we are updating the state of the job and all its subcomponents, and the new state value depends on the current state of all those subcomponents (plus whatever change we are trying to make). So, the standard race-free way to do that is to SELECT ... FOR UPDATE all the values we need to know, then compute the new values, then issue UPDATEs for them. To avoid deadlocks it is also important to do SELECT FOR UPDATE the same set of rows *in the same order* every time. So a simplified version is something like:

SELECT ...
FROM recipe_task
INNER JOIN recipe ON recipe_task.recipe_id = recipe.id
INNER JOIN recipe_set ON recipe.recipe_set_id = recipe_set.id
WHERE recipe_set.job_id = 1234
FOR UPDATE;
-- do some computation in Python land,
-- then update the rows we have already locked:
UPDATE recipe_task
SET status = 'Whatever'...

The problem we have is how to make this happen in our code. Right now the whole _bubble_up/_bubble_down business makes this very hard to achieve, particularly the part about always locking the same rows in the same order which is crucial to avoid deadlocks.

Comment 19 Dan Callaghan 2013-01-24 01:49:15 UTC
(In reply to comment #18)

I already wrote a similar thing in bug 715226, which is a different problem but with the same underlying cause (racey status updates with _bubble_up/_bubble_down).

Comment 20 Nick Coghlan 2013-01-24 01:56:44 UTC
The problem isn't really SQL alchemy, it's the current structure of our own "bubble up" and "bubble down" code in the job/recipeset/recipe/task data model that makes it tricky to isolate the levels properly when interacting with the database. Since that code also has to correctly handle the cases of aborts correctly, there's some risk of introducing *new* problems if we alter it too much without correctly accounting for *all* of the cases that it needs to handle. (And, as we've seen with this issue, the potential for race conditions makes this part of the code hard to test comprehensively).

Dan's suggestion above sounds like a promising approach to investigate, but we also have a inelegant-but-simple fallback option we should keep in mind.

Specifically, we already use the main scheduling loop to abort queued recipes that rely on a distro tree that has since been removed from Beaker [1]. We could add a similar check that looked for incomplete recipesets where all recipes were marked complete, as well as incomplete jobs where all recipesets were marked complete, and updated the status of the parent recipeset or job appropriately.

[1] http://git.beaker-project.org/cgit/beaker/tree/Server/bkr/server/tools/beakerd.py?h=release-0.11#n277

Comment 22 Nick Coghlan 2013-01-24 05:01:18 UTC
I think I've figured out a way to *force* the race condition to be encountered in an automated, which should make this much easier to resolve (as well as test other similar cases).

Specifically:

1.Create two common threading.Event objects: 
- one for "enter transaction"
- one for "exit transaction"

2. Define a function that updates the state of an object inside a DB transaction along the following lines:

    def parallel_update(update_state, enter, updated, finish):
        enter.wait() # wait for the start event
        with transaction:
            update_state() # Do our state change
            updated.set() # Signal we're waiting to commit our transaction
            finish.wait() # wait for the signal to commit

3. Create a test case that creates a job with one recipe set and two recipes, and then uses the above construct to force both recipes to "finish" at the same time. Check both the recipe set and job are correctly marked as completed.

4. Create a test case that creates a job with two recipe sets and one recipe each, and then uses the above construct to force both recipes to "finish" at the same time. Check both the recipe sets and the overall job are correctly marked as completed.

Comment 23 Dan Callaghan 2013-01-24 05:45:47 UTC
After some discussion offline: the first step should be to try my suggestion in comment 18 (SELECT FOR UPDATE the entire job every time). Ray's idea of updating each layer of the hierarchy in a separate transaction would work too, but if a subsequent transaction fails it leaves the db in an inconsistent state so this approach also requires a cleanup loop in beakerd (or elsewhere) to catch and fix jobs that are inconsistent. Ray's approach might be necessary though, to avoid bottlenecks on very large jobs.

Either way, _bubble_up/_bubble_down is problematic and needs to be changed.

Comment 24 Nick Coghlan 2013-01-24 05:48:12 UTC
+1, sounds like a plan.

Either way, the threading.Event based testing scheme I describe above should let us break the current _bubble_up/down implementation reliably in the test suite, so we can have a lot more confidence that we've resolved the original problem.

Comment 25 Nick Coghlan 2013-01-31 08:19:51 UTC
Ray figured out that you don't actually need to use threads to trigger the race condition - you can use non-scope controlled sessions in order to have two different transactions open in a single thread (with a bit of monkey-patching to get the Beaker model code to use the custom sessions instead of the default thread local one).

Comment 26 Jaroslav Kortus 2013-02-06 11:24:20 UTC
From my experience solving the race was very problematic. I ended up with SELECT FOR UPDATE and then realized that I needed subtransactions and the whole code started to look a bit messy (and did not solve it in all cases after all).

What I did was a change where recipe returns the system when it considers itself finished. That was the only serious problem I had, because the machines were blocked forever. They were originally returned from recipeset which already suffers by the race (VERY easy to hit in multihost tests).

With that update the machines were returned and even though the UI was showing some <100% score, it was still processing recipes in the queue and was no longer blocking. Occasionaly I'd run a short python script to sync it.

All that the syncing script does (https://bugzilla.redhat.com/show_bug.cgi?id=903442#c4) is effectively following the bubble_down chain down from job level (correct me if I'm wrong here).

So my idea of approaching this racy condition was that there would be no bubble_up, but it would always be triggered from top level (job). There could be another loop in beakerd (like the ones that process, schedule and queue the recipes) that would update the scores.

Would you see any drawbacks in this approach?

Comment 27 Raymond Mancy 2013-02-07 23:02:06 UTC
(In reply to comment #26)
> From my experience solving the race was very problematic. I ended up with
> SELECT FOR UPDATE and then realized that I needed subtransactions and the
> whole code started to look a bit messy (and did not solve it in all cases
> after all).
> 
> What I did was a change where recipe returns the system when it considers
> itself finished.

We hold onto recipe systems until the recipeset is completed for the host/guest
scenario. Otherwise the host being released and reprovisioned would of course cause trouble for current recipe on the guest.

> That was the only serious problem I had, because the
> machines were blocked forever. They were originally returned from recipeset
> which already suffers by the race (VERY easy to hit in multihost tests).
> 
> With that update the machines were returned and even though the UI was
> showing some <100% score, it was still processing recipes in the queue and
> was no longer blocking. Occasionaly I'd run a short python script to sync it.
> 
> All that the syncing script does
> (https://bugzilla.redhat.com/show_bug.cgi?id=903442#c4) is effectively
> following the bubble_down chain down from job level (correct me if I'm wrong
> here).
> 

That's correct. It _should_ work, and although simple, for us it would be a bit of overkill. In 903442 we only target those specific recipesets that have the problem.

> So my idea of approaching this racy condition was that there would be no
> bubble_up, but it would always be triggered from top level (job). There
> could be another loop in beakerd (like the ones that process, schedule and
> queue the recipes) that would update the scores.
> 
> Would you see any drawbacks in this approach?

On the surface it seems like a good idea, and better yet, a simple solution. However, there are a couple of important drawbacks with this approach.

The only time you would actually be triggering it from the top level is if we are performing an action at the top level. I think the only time we would be doing this is if we 'Cancel' a job manually. So for the most part status would be updated via the job by the beakerd thread and this would mean a lot of jobs are going through bubble_down process regardless of whether they have statuses to update. It also means locking a lot of rows for potentially quite a long time (say if we are releasing systems, although we really need to seperate that action out from the update logic). 

More noticeably it also means we could be waiting quite a while between when a task has finished, to when the GUI actually tells us that its finished.

Only doing bubble_up would seem to make more sense. As recipe tasks are where status updates originate from anyway. The only problem is that we need to allow people to cancel jobs etc, which means that there would still have to be a bubble_down (and as Dan points out this could cause deadlock issues).

Comment 28 Jaroslav Kortus 2013-02-08 09:38:22 UTC
> > All that the syncing script does
> > (https://bugzilla.redhat.com/show_bug.cgi?id=903442#c4) is effectively
> > following the bubble_down chain down from job level (correct me if I'm wrong
> > here).
> > 
> 
> That's correct. It _should_ work, and although simple, for us it would be a
> bit of overkill. In 903442 we only target those specific recipesets that
> have the problem.

It's overkill now as it goes through all jobs. That works well in my case, where there are just thousands of them :). In your instances this might become a problem, that that could be easily solved by additional flag to job on which we could select (finished = 0 or state=running, I don't have the model before me right now)

> 
> > So my idea of approaching this racy condition was that there would be no
> > bubble_up, but it would always be triggered from top level (job). There
> > could be another loop in beakerd (like the ones that process, schedule and
> > queue the recipes) that would update the scores.
> > 
> > Would you see any drawbacks in this approach?
> 
> On the surface it seems like a good idea, and better yet, a simple solution.
> However, there are a couple of important drawbacks with this approach.
> 
> The only time you would actually be triggering it from the top level is if
> we are performing an action at the top level. I think the only time we would
> be doing this is if we 'Cancel' a job manually. So for the most part status
> would be updated via the job by the beakerd thread and this would mean a lot
> of jobs are going through bubble_down process regardless of whether they
> have statuses to update. It also means locking a lot of rows for potentially
> quite a long time (say if we are releasing systems, although we really need
> to seperate that action out from the update logic). 
> 
> More noticeably it also means we could be waiting quite a while between when
> a task has finished, to when the GUI actually tells us that its finished.

Currently there are hundreds of jubs running, that would have to be regulary examined. IMHO the locks required (automatic, right? I did not notice any explicit locking in the code) are more acceptable than having a scheduling deadlock.

> 
> Only doing bubble_up would seem to make more sense. As recipe tasks are
> where status updates originate from anyway. The only problem is that we need
> to allow people to cancel jobs etc, which means that there would still have
> to be a bubble_down (and as Dan points out this could cause deadlock issues).

bubble_up is the racy one. Getting rid of it, all problems are gone (well, at least old problems:)). It's racy because two recipes will try updating one recipeset (and from here it goes to jobs). Having one place to start update would be better. We can make the process less "hungry" by making the right choices (and maybe extending the model to allow us to make them better).

Comment 29 Jaroslav Kortus 2013-02-08 09:40:45 UTC
> We hold onto recipe systems until the recipeset is completed for the
> host/guest
> scenario. Otherwise the host being released and reprovisioned would of
> course cause trouble for current recipe on the guest.

What about a check that if I have guest recipes in the set, it would leave the returning of the sytems on recipeset and if not, it would return it as soon as it's finished.

It would also require a check in recipeset if the systems have already been returned.

Comment 31 Dan Callaghan 2013-03-20 05:31:40 UTC
*** Bug 715226 has been marked as a duplicate of this bug. ***

Comment 32 Dan Callaghan 2013-03-20 05:50:49 UTC
My patch for this bug is almost done, I just need to finish testing it fully and write some release notes.

The first thing I did was to remove the _bubble_up/_bubble_down complexity, and changed everything so that status updates *always* happen at the first unfinished task in each recipe (regardless of where the update started from). I also made the status updates execute conditional UPDATE statements (UPDATE recipe_task SET status = 'Completed' WHERE id = 123 AND status = 'Running') so that we can detect stale data in the transaction. That is enough to solve the related bug 715226, since it prevents concurrent transactions against the same recipe from trampling on each other.

However it still doesn't solve this bug, which is when two *different* recipes are both completed in two different transactions. Then each transaction uses a stale view of the other recipe status when updating the job status.

I experimented with various combinations of SELECT FOR UPDATE but they all ended up either prone to deadlocks, or completely killed performance for task status updates.

After discussing it with rmancy, the approach I have taken (one which has been floated before but never implemented) is to make status updates happen *only* at the task level, and also mark the job as "dirty". A new thread in beakerd will loop over dirty jobs and update their state based on the current task statuses. So we get good throughput for task updates (which is important for large jobs) while still ensuring the overall job state will eventually be consistent without races or deadlocks. The trade-off is that the status, result, and progress bars for jobs/recipes will now always lag behind their "true" state. Hopefully the lag is always < 20 seconds.

Comment 33 Dan Callaghan 2013-03-20 05:53:27 UTC
(In reply to comment #32)

In other words, more or less the solution suggested by Jaroslav in comment 26. :-)

Comment 34 Dan Callaghan 2013-03-20 07:41:36 UTC
On Gerrit: http://gerrit.beaker-project.org/1808

Jaroslav, I'd appreciate your comments on this patch too, since you have been giving this problem a lot of thought. You can comment in Gerrit, just sign in using any OpenID provider (e.g. the Fedora one).

Comment 37 Raymond Mancy 2013-04-04 06:04:36 UTC
Can't verify this fix, too hard.

Comment 38 Jaroslav Kortus 2013-04-04 12:12:25 UTC
reproducing it is pretty easy, ideally with virtual machines. If you set up the environment as described on beaker-project.org and schedule a multihost test, you are very very likely to hit it (as the machines will be returned all at once). The more, the better :).

Comment 39 Dan Callaghan 2013-04-11 04:57:19 UTC
Beaker 0.12 has been released.