Bug 1129020

Summary: Document the WATCHDOG_SCRIPT feature
Product: [Retired] Beaker Reporter: Nick Coghlan <ncoghlan>
Component: DocAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.17CC: aigao, asaha, dcallagh, junichi.nomura, rmancy, xma
Target Milestone: 0.18.1Keywords: Documentation, FutureFeature, Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-12 07:36:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nick Coghlan 2014-08-12 06:22:05 UTC
On the lab controller, a WATCHDOG_SCRIPT can be configured to run whenever an external watchdog timer fires.

It receives the system FQDN(?), recipe ID and currently running task ID as arguments, and should print as its sole output the number of seconds to extend the watchdog.

When it reports successful completion, the external watchdog script becomes responsible for stopping the recipe (the watchdog daemon will log an error and stop the recipe if the script returns a non-zero exit code)

Comment 1 Jun'ichi NOMURA 2014-08-18 05:24:31 UTC
(In reply to Nick Coghlan from comment #0)
> When it reports successful completion, the external watchdog script becomes
> responsible for stopping the recipe (the watchdog daemon will log an error
> and stop the recipe if the script returns a non-zero exit code)

Expected behavior of this feature is:
  - non-zero exit code means either WATCHDOG_SCRIPT has failed or no extension was requested.
    So stop the recipe.
  - zero exit code means WATCHDOG_SCRIPT has requested the timer extension.
    So extend the timeout, where the timeout value is read from the script stdout

That current code assumes 'self.extend_watchdog()' never fails without exception could be an oversight.
If the function may return failure, the code should be fixed to stop the recipe in that case.

Comment 2 Dan Callaghan 2014-09-02 05:45:38 UTC
(In reply to Nick Coghlan from comment #0)
> When it reports successful completion, the external watchdog script becomes
> responsible for stopping the recipe 

Well, it has to either abort the recipe or else just handle being invoked on the same recipe again, to avoid infinite loops. Assuming WATCHDOG_SCRIPT exits normally, the watchdog is extended (meaning it goes back to active) and it should go through the same expiry process when it expires again.

(In reply to Jun'ichi NOMURA from comment #1)
> That current code assumes 'self.extend_watchdog()' never fails without
> exception could be an oversight.
> If the function may return failure, the code should be fixed to stop the
> recipe in that case.

I think the code currently handles failures in WATCHDOG_SCRIPT correctly. The check_output() function is where the external script is actually executed, and that function raises an exception if the exit status is non-zero. There is an except: block which will catch that exception (as well as any other exceptions, like a failure to coerce the script output to int, or a failure to extend the watchdog) and fall through to aborting the recipe as normal. So I think the code matches the expected behaviour you are describing.

Of course, having said all that, we aren't testing WATCHDOG_SCRIPT anywhere currently so I can't prove it...

Comment 3 Dan Callaghan 2014-09-02 06:49:26 UTC
On Gerrit: http://gerrit.beaker-project.org/3305

Comment 4 Dan Callaghan 2014-09-09 01:03:17 UTC
This bug fix is available on the Beaker web site:

https://beaker-project.org/docs-release-0.18/admin-guide/watchdog-script.html

Comment 5 Dan Callaghan 2014-09-12 07:36:22 UTC
Beaker 0.18.1 has been released.