1210540 – beaker-watchdog can try to expire the same watchdog multiple times, possibly crashing

Bug 1210540 - beaker-watchdog can try to expire the same watchdog multiple times, possibly crashing

Summary: beaker-watchdog can try to expire the same watchdog multiple times, possibly ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	lab controller
Sub Component:
Version:	19
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	21.0
Assignee:	matt jia
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-04-10 02:29 UTC by Dan Callaghan
Modified:	2018-02-06 00:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-08-26 06:17:49 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2015-04-10 02:29:21 UTC

Description of problem:
If beakerd is taking a particularly long time to process dirty jobs, beaker-watchdog can end up attempting to expire the same watchdog multiple times. Worse still, if any of the expire calls after the first one happens to coincide with beakerd processing the dirty job it will cause a 500 with StaleDataError.

    bkr.labcontroller.proxy INFO Entering expire_watchdogs
    bkr.labcontroller.proxy INFO External Watchdog Expired for example.com
    bkr.labcontroller.proxy DEBUG recipe_stop 1905310
    bkr.labcontroller.proxy INFO Entering active_watchdogs
    bkr.labcontroller.proxy INFO Removed Monitor for example.com:1905310
    bkr.labcontroller.watchdog DEBUG --------------------------------------------------------------------------------
    bkr.labcontroller.proxy INFO Entering expire_watchdogs
    bkr.labcontroller.proxy INFO External Watchdog Expired for example.com
    bkr.labcontroller.proxy DEBUG recipe_stop 1905310
    bkr.labcontroller.proxy INFO Entering active_watchdogs
    bkr.labcontroller.watchdog DEBUG --------------------------------------------------------------------------------
    bkr.labcontroller.proxy INFO Entering expire_watchdogs
    bkr.labcontroller.proxy INFO External Watchdog Expired for example.com
    bkr.labcontroller.proxy DEBUG recipe_stop 1905310
    bkr.labcontroller.proxy INFO Entering active_watchdogs
    bkr.labcontroller.watchdog DEBUG --------------------------------------------------------------------------------
    bkr.labcontroller.proxy INFO Entering expire_watchdogs
    bkr.labcontroller.proxy INFO External Watchdog Expired for example.com
    bkr.labcontroller.proxy DEBUG recipe_stop 1905310
    bkr.labcontroller.proxy INFO Entering active_watchdogs
    bkr.labcontroller.watchdog DEBUG --------------------------------------------------------------------------------
    bkr.labcontroller.proxy INFO Entering expire_watchdogs
    bkr.labcontroller.proxy INFO External Watchdog Expired for example.com
    bkr.labcontroller.proxy DEBUG recipe_stop 1905310
    bkr.labcontroller.proxy INFO Entering active_watchdogs
    bkr.labcontroller.watchdog DEBUG --------------------------------------------------------------------------------
    bkr.labcontroller.proxy INFO Entering expire_watchdogs
    bkr.labcontroller.proxy INFO External Watchdog Expired for example.com
    bkr.labcontroller.proxy DEBUG recipe_stop 1905310
    bkr.labcontroller.watchdog ERROR Traceback (most recent call last):
    ...
    ProtocolError: <ProtocolError for beaker.example.com/client/: 500 INTERNAL SERVER ERROR>

Server exception:

    ...
    StaleDataError: UPDATE statement on table 'watchdog' expected to update 1 row(s); 0 were matched.

Version-Release number of selected component (if applicable):
19.3

How reproducible:
The StaleDataError is too timing-sensitive to be reproducible (beakerd's process_dirty_jobs routine needs to handle the recipe at the same moment as the expire call from beaker-watchdog). However the underlying problem is that beaker-watchdog should not attempt to expire the same watchdog multiple times, and that is reproducible...

Steps to Reproduce:
1. Fire off a recipe, wait for it to be scheduled and start running.
2. Stop beakerd on the server. This is to simulate beakerd falling far behind in processing dirty jobs. (In our case it was most likely just because beakerd was running very slow.)
3. Update the kill time so that beaker-watchdog sees the recipe as expired:
bkr watchdog-extend --by=0 <taskid>

Actual results:
beaker-watchdog will repeatedly try to expire the watchdog over and over again while the dirty job is not processed yet.
If beakerd is restarted, the dirty job will be processed and beaker-watchdog will no longer see the watchdog as expired. (There is a small chance of triggering the StaleDataError here if the timing is right.)

Expected results:
beaker-watchdog should only expire the watchdog once.

Comment 1 matt jia 2015-05-05 03:01:28 UTC

This can be fixed by excluding watchdogs with dirty job in the expired watchdogs.
On Gerrit:

  https://bugzilla.redhat.com/show_bug.cgi?id=1210540

Comment 2 matt jia 2015-05-12 01:18:59 UTC

(In reply to matt jia from comment #1)
> This can be fixed by excluding watchdogs with dirty job in the expired
> watchdogs.
> On Gerrit:
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1210540

Sorry, copied wrong link.

http://gerrit.beaker-project.org/#/c/4198/

Comment 4 matt jia 2015-05-12 03:43:54 UTC

Verify Steps:

1. install beaker-server-21.0-0.git.23.0983f62.el6eng on your dev environment

2. Fire off a recipe, wait for it to be scheduled and start running.

3. stop beakerd service

4. bkr watchdog-extend --by=0 <taskid>

5. check whether the recipe is expired only once in 
/var/log/beaker/watchdog.log

Comment 6 Dan Callaghan 2015-08-26 06:17:49 UTC

Beaker 21.0 has been released.

Note You need to log in before you can comment on or make changes to this bug.