Bug 1303131 - restraint local watchdog expires too soon
restraint local watchdog expires too soon
Status: NEW
Product: Restraint
Classification: Community
Component: general (Show other bugs)
0.1.23
Unspecified Linux
medium Severity medium
: ---
: ---
Assigned To: beaker-dev-list
tools-bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-29 11:10 EST by Jeff Bastian
Modified: 2016-06-08 13:52 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jeff Bastian 2016-01-29 11:10:07 EST
Description of problem:
I'm trying to run a task in standalone mode with a 10 hour run time, and the watchdog expires only 3 minutes into the task.  The client doesn't realize this and it keeps trying to reconnect.

From the client:
========================================================================
~]# restraint --host=1=root@localhost --job=interbench.xml
Using ./interbench.01 for job run
Connecting to http://localhost:8091/run for host: localhost:8081, recipe id:1
*  T:   1 [/kernel/performance/interbench-0.31             ] Running
Socket I/O timed out [g-io-error-quark, 24]
Disconnected.. delaying 60 seconds.
Connecting to http://localhost:8091/continue for host: localhost:8081, recipe id:1
Could not parse HTTP response [soup_request_error_quark, 2]
Disconnected.. delaying 60 seconds.
Connecting to http://localhost:8091/continue for host: localhost:8081, recipe id:1
Socket I/O timed out [g-io-error-quark, 24]
Disconnected.. delaying 60 seconds.
Connecting to http://localhost:8091/continue for host: localhost:8081, recipe id:1
Socket I/O timed out [g-io-error-quark, 24]
Disconnected.. delaying 60 seconds.
...
...
...
^C
========================================================================

Meanwhile, on the serial console:
========================================================================
[ 1109.071246] restraintd[1949]: * Parsing recipe
[ 1109.072420] restraintd[1949]: * Running recipe
[ 1109.072992] restraintd[1949]: ** Fetching task: 1 [/mnt/tests/kernel/performance/interbench-0.31]
[ 1141.643588] restraintd[1949]: ** Parsing testinfo.desc
[ 1141.657910] restraintd[1949]: ** Updating env vars
[ 1141.682767] restraintd[1949]: ** Updating external watchdog: 37800 seconds
[ 1141.736770] restraintd[1949]: ** Installing dependencies
[ 1144.798887] restraintd[1949]: ** Running task: 1 [/kernel/performance/interbench-0.31]
[ 1241.960401] restraintd[1949]: *** Current Time: Fri Jan 29 10:50:11 2016 Localwatchdog at: Fri Jan 29 20:48:34 2016
[ 1303.723430] restraintd[1949]: *** Current Time: Fri Jan 29 10:51:13 2016 Localwatchdog at: Fri Jan 29 20:48:34 2016
[ 1385.779804] restraintd[1949]: *** Current Time: Fri Jan 29 10:52:35 2016 Localwatchdog at: Fri Jan 29 20:48:34 2016
[ 1413.734633] restraintd[1949]: ** ERROR: Local watchdog expired!
[ 1413.735283] restraintd[1949]: ** Completed Task : 1
[ 1416.734635] restraintd[1949]: ** Message: Aborted task 1 due to error: Local watchdog expired!
[ 1426.267061] restraintd[1949]: * Finished recipe
========================================================================

The TestTime is 10 hours in the testinfo.desc file:
========================================================================
~]# grep TestTime /mnt/tests/kernel/performance/interbench-0.31/testinfo.desc
TestTime:       10h
========================================================================

Version-Release number of selected component (if applicable):
restraint-0.1.23-2.aa7a.aarch64

How reproducible:
always

Steps to Reproduce:
1. run /kernel/performance/interbench-0.31 (from Red Hat's internal Beaker)
   as a standalone job
~]# cat interbench.xml
<job>
  <recipeSet>
    <recipe>
      <task name="/kernel/performance/interbench-0.31">
        <rpm name="kernel-kernel-performance-interbench-0.31"
             path="/mnt/tests/kernel/performance/interbench-0.31"/>
      </task>
    </recipe>
  </recipeSet>
</job>
~]# restraint --host=1=root@localhost --job=interbench.xml

Actual results:
the local watchdog expires after 3 minutes instead of 10 hours

Expected results:
local watchdog gives the task at least 10 hours to run

Additional info:
Comment 1 Jeff Bastian 2016-01-29 13:11:17 EST
I was running the above on a large system with 96 cores.  I cannot reproduce this on a smaller system with 8 cores.
Comment 2 Bill Peck 2016-01-29 13:23:01 EST
Thanks for the report Jeff.

Is it possible to get access to the system where the problem is shown?

I'm traveling to Brno Tomorrow so it might be next week before I can look at this.
Comment 3 Jeff Bastian 2016-01-29 14:36:28 EST
Yes, I'll send you an email with the details.
Comment 4 Jeff Bastian 2016-01-29 16:51:16 EST
I see the same problem with /kernel/stress/racer, however it makes it longer than 3 minutes.  The local watchdog was set for 17:48, but it killed the job at 16:23, a full 85 minutes shy of the expected time.

[  638.187591] restraintd[1921]: * Parsing recipe
[  638.188717] restraintd[1921]: * Running recipe
[  638.189284] restraintd[1921]: ** Fetching task: 1 [/mnt/tests/kernel/stress/racer]
[  678.296679] restraintd[1921]: ** Parsing testinfo.desc
[  678.320434] restraintd[1921]: ** Updating env vars
[  678.345231] restraintd[1921]: ** Updating external watchdog: 9000 seconds
[  678.388008] restraintd[1921]: ** Installing dependencies
[  680.119938] restraintd[1921]: ** Running task: 1 [/kernel/stress/racer]
[  741.744089] restraintd[1921]: *** Current Time: Fri Jan 29 15:49:26 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[  800.231933] restraintd[1921]: *** Current Time: Fri Jan 29 15:50:24 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[  862.465831] restraintd[1921]: *** Current Time: Fri Jan 29 15:51:27 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
<snip>
[ 2540.250521] restraintd[1921]: *** Current Time: Fri Jan 29 16:19:24 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[ 2633.952363] restraintd[1921]: *** Current Time: Fri Jan 29 16:20:58 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[ 2710.841001] restraintd[1921]: *** Current Time: Fri Jan 29 16:22:15 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[ 2752.812901] restraintd[1921]: ** ERROR: Local watchdog expired!
[ 2752.813763] restraintd[1921]: ** Completed Task : 1
[ 2759.694477] restraintd[1921]: ** Message: Aborted task 1 due to error: Local watchdog expired!
[ 2772.117863] restraintd[1921]: * Finished recipe
Comment 5 Jeff Bastian 2016-02-03 15:41:31 EST
Today I saw it with xfstests:

[  365.431481] run fstests generic/017 at 2016-02-03 15:19:59
[  365.956690] XFS (loop1): Unmounting Filesystem
[  366.610974] XFS (loop1): Mounting V4 Filesystem
[  366.618482] XFS (loop1): Ending clean mount
[  389.047432] restraintd[1916]: *** Current Time: Wed Feb 03 15:20:23 2016 Localwatchdog at: Thu Feb 04 02:36:22 2016
[  399.094982] restraintd[1916]: ** ERROR: Local watchdog expired!
[  399.095636] restraintd[1916]: ** Completed Task : 1
[  399.096212] restraintd[1916]: ** Message: Aborted task 1 due to error: Local watchdog expired!
[  399.134558] restraintd[1916]: * Finished recipe
Comment 6 PaulB 2016-06-08 13:46:35 EDT
All,
This issue is reproducible regularly when running as a client (outside of beaker)
with the following task:
 kernel/stress/racer/

Best,
-pbunyan
Comment 7 Bill Peck 2016-06-08 13:52:31 EDT
I think this issue is related to the heartbeat routine which also writes out to disk the current OUTPUTFILE offset.  On I/O intensive tests this gets blocked and in turn blocks us writing back to the client and we get disconnected.

One idea is to extend the timeout which does help but then makes it slower to respond to reboots.

Another idea is to not write the offset to disk every minute and simply use a sighup routine to write it out when we are being killed.

Note You need to log in before you can comment on or make changes to this bug.