Bug 743819

Summary: Jobs killed by External Watchdog after prolonged server outage
Product: [Retired] Beaker Reporter: Marian Csontos <mcsontos>
Component: schedulerAssignee: Bill Peck <bpeck>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.7CC: bpeck, dcallagh, mcsontos, rmancy, stl
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-26 07:16:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Marian Csontos 2011-10-06 08:00:35 UTC
Description of problem:
We had some jobs killed by external watchdog the last outage - see Bug 637186#c6 and following comment(s).

We need to make sure after outage taking longer than 30 minutes all the watchdogs are updated to allow for machines to sync up with the host.

Version-Release number of selected component (if applicable):
0.7.2

How reproducible:
Not easy to reproduce.

Steps to Reproduce:
1. schedule a task taking X minutes
1. perform an outage longer than X + 1 hour
  
Actual results:
Job killed by external watchdog

Expected results:
Job resumes after the delay, submits all results and continues execution.

Additional info:

Comment 1 Bill Peck 2011-10-06 12:45:17 UTC
How about a command line which would allow the admins to extend current watchdogs by the length of the outage?

bkr watchdogs --add 30m

Comment 2 Bill Peck 2011-10-06 12:45:43 UTC
of course they would need to run that before starting beaker-watchdog services.

Comment 3 Marian Csontos 2011-10-06 14:06:21 UTC
Yes please. I think that would do.

Comment 4 Bill Peck 2012-01-11 16:10:24 UTC
moving to 0.8.2

Comment 5 Bill Peck 2012-03-27 14:44:52 UTC
pushed to gerrit for review