Bug 1375035

Summary: Machine is not reserved if a task is finished too quickly
Product: [Retired] Beaker Reporter: Roman Joost <rjoost>
Component: schedulerAssignee: Roman Joost <rjoost>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 22CC: dcallagh, dowang, mjia, rjoost
Target Milestone: 23.3Keywords: Patch, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-07 06:44:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Joost 2016-09-12 01:37:35 UTC
Description of problem:

For a few users jobs with reserve requests are going straight to finished instead of actually reserving the machine.


Version-Release number of selected component (if applicable):

23.2


How reproducible:

100%

Steps to Reproduce:
(Take this with a grain of salt, since I'm not fully understanding the big picture here)

1. Create a job with a task (e.g. /distribution/command) 
2. Make sure the command is executed in a couple of seconds
3. Kill/Pause beaker-watchdog in order to keep the Recipe in a state of TaskStatus.waiting
4. The system should end up in a state where the recipe status in not installing or running. (see Server/bkr/server/model/scheduler.py:2509)


Actual results:
System goes straight to finished and is not reserved.

Expected results:
System is reserved.

Additional info:

Task completing in 24s:

server-debug.log.3:Sep  8 13:13:55 beaker-server beaker-server[52352]: bkr.server.xmlrpccontroller DEBUG Time: 0:00:00.011501 recipes.tasks.start ('45404156', 0)
server-debug.log.3:Sep  8 13:14:19 beaker-server beaker-server[52510]: bkr.server.xmlrpccontroller DEBUG Time: 0:00:00.008961 recipes.tasks.stop ('45404156', 'stop', 'OK')

On the lab controller:
watchdog.log.1.gz:Sep  8 13:17:13 lab-02 beaker-watchdog[19947]: bkr.labcontroller.proxy INFO Removed Monitor for labcontroller.beaker.example:3049406

Comment 2 Dan Callaghan 2016-09-12 05:47:42 UTC
(In reply to Roman Joost from comment #0)
> 3. Kill/Pause beaker-watchdog in order to keep the Recipe in a state of
> TaskStatus.waiting
> 4. The system should end up in a state where the recipe status in not
> installing or running. (see Server/bkr/server/model/scheduler.py:2509)

Right so the reason this is happening for us occasionally in production is that normally:

* while Anaconda is installing, recipe status is installing
* then, when Anaconda finishes installing and reboots, the next iteration of update_dirty_jobs will set recipe status to Waiting
* then, when the system has rebooted and beah starts the first task, the next iteration of update_dirty_jobs will set recipe status to Running
* finally, when beah finishes the final task in the recipe, the next iteration of update_dirty_jobs will set recipe status to Completed -- or Reserved, if the user requested a reservation

This bug is a regression in 23.0 because the above is new as of 23.0, due to the Installing status. Previously the status would be Running as soon as Anaconda starts and then it stays that way until the end of the recipe.

The problem here is that line of code, which is testing the recipe status against Installing or Running states (but not Waiting). However, in case there is only one task in the recipe and beah finishes it very quickly, it means there is only a very short space of time between beah starting the first task and beah stopping the last task (in the above example, 24 seconds). If beakerd doesn't finish a complete loop of update_dirty_jobs in that time, meaning that it never set the recipe to Running, then it will hit this bug.

Comment 3 Dan Callaghan 2016-09-12 05:48:36 UTC
Workaround for this bug would be to make the tasks take slightly longer -- even 5 minutes should be plenty of time. If the recipe has a single /distribution/command task then simply putting "; sleep 300" at the end of the command would be enough.

Comment 4 Roman Joost 2016-09-15 04:27:25 UTC
Patch available:

https://gerrit.beaker-project.org/#/c/5230/

Comment 7 Dan Callaghan 2016-11-07 06:44:30 UTC
Beaker 23.3 has been released.