Description of problem: If beakerd is taking a particularly long time to process dirty jobs, beaker-watchdog can end up attempting to expire the same watchdog multiple times. Worse still, if any of the expire calls after the first one happens to coincide with beakerd processing the dirty job it will cause a 500 with StaleDataError. bkr.labcontroller.proxy INFO Entering expire_watchdogs bkr.labcontroller.proxy INFO External Watchdog Expired for example.com bkr.labcontroller.proxy DEBUG recipe_stop 1905310 bkr.labcontroller.proxy INFO Entering active_watchdogs bkr.labcontroller.proxy INFO Removed Monitor for example.com:1905310 bkr.labcontroller.watchdog DEBUG -------------------------------------------------------------------------------- bkr.labcontroller.proxy INFO Entering expire_watchdogs bkr.labcontroller.proxy INFO External Watchdog Expired for example.com bkr.labcontroller.proxy DEBUG recipe_stop 1905310 bkr.labcontroller.proxy INFO Entering active_watchdogs bkr.labcontroller.watchdog DEBUG -------------------------------------------------------------------------------- bkr.labcontroller.proxy INFO Entering expire_watchdogs bkr.labcontroller.proxy INFO External Watchdog Expired for example.com bkr.labcontroller.proxy DEBUG recipe_stop 1905310 bkr.labcontroller.proxy INFO Entering active_watchdogs bkr.labcontroller.watchdog DEBUG -------------------------------------------------------------------------------- bkr.labcontroller.proxy INFO Entering expire_watchdogs bkr.labcontroller.proxy INFO External Watchdog Expired for example.com bkr.labcontroller.proxy DEBUG recipe_stop 1905310 bkr.labcontroller.proxy INFO Entering active_watchdogs bkr.labcontroller.watchdog DEBUG -------------------------------------------------------------------------------- bkr.labcontroller.proxy INFO Entering expire_watchdogs bkr.labcontroller.proxy INFO External Watchdog Expired for example.com bkr.labcontroller.proxy DEBUG recipe_stop 1905310 bkr.labcontroller.proxy INFO Entering active_watchdogs bkr.labcontroller.watchdog DEBUG -------------------------------------------------------------------------------- bkr.labcontroller.proxy INFO Entering expire_watchdogs bkr.labcontroller.proxy INFO External Watchdog Expired for example.com bkr.labcontroller.proxy DEBUG recipe_stop 1905310 bkr.labcontroller.watchdog ERROR Traceback (most recent call last): ... ProtocolError: <ProtocolError for beaker.example.com/client/: 500 INTERNAL SERVER ERROR> Server exception: ... StaleDataError: UPDATE statement on table 'watchdog' expected to update 1 row(s); 0 were matched. Version-Release number of selected component (if applicable): 19.3 How reproducible: The StaleDataError is too timing-sensitive to be reproducible (beakerd's process_dirty_jobs routine needs to handle the recipe at the same moment as the expire call from beaker-watchdog). However the underlying problem is that beaker-watchdog should not attempt to expire the same watchdog multiple times, and that is reproducible... Steps to Reproduce: 1. Fire off a recipe, wait for it to be scheduled and start running. 2. Stop beakerd on the server. This is to simulate beakerd falling far behind in processing dirty jobs. (In our case it was most likely just because beakerd was running very slow.) 3. Update the kill time so that beaker-watchdog sees the recipe as expired: bkr watchdog-extend --by=0 <taskid> Actual results: beaker-watchdog will repeatedly try to expire the watchdog over and over again while the dirty job is not processed yet. If beakerd is restarted, the dirty job will be processed and beaker-watchdog will no longer see the watchdog as expired. (There is a small chance of triggering the StaleDataError here if the timing is right.) Expected results: beaker-watchdog should only expire the watchdog once.
This can be fixed by excluding watchdogs with dirty job in the expired watchdogs. On Gerrit: https://bugzilla.redhat.com/show_bug.cgi?id=1210540
(In reply to matt jia from comment #1) > This can be fixed by excluding watchdogs with dirty job in the expired > watchdogs. > On Gerrit: > > https://bugzilla.redhat.com/show_bug.cgi?id=1210540 Sorry, copied wrong link. http://gerrit.beaker-project.org/#/c/4198/
Verify Steps: 1. install beaker-server-21.0-0.git.23.0983f62.el6eng on your dev environment 2. Fire off a recipe, wait for it to be scheduled and start running. 3. stop beakerd service 4. bkr watchdog-extend --by=0 <taskid> 5. check whether the recipe is expired only once in /var/log/beaker/watchdog.log
Beaker 21.0 has been released.