Bug 1303131 - restraint local watchdog expires too soon
Summary: restraint local watchdog expires too soon
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Restraint
Classification: Retired
Component: general
Version: 0.1.23
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Carol Bouchard
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-29 16:10 UTC by Jeff Bastian
Modified: 2020-11-19 21:06 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-19 21:06:24 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Beaker Project Gerrit 4670 0 'None' ABANDONED Make timeout configurable and default to 5 minutes. We should get a heartbeat every minute but if system is under heavy... 2020-07-09 08:11:23 UTC
Red Hat Bugzilla 1303941 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 1303941

Description Jeff Bastian 2016-01-29 16:10:07 UTC
Description of problem:
I'm trying to run a task in standalone mode with a 10 hour run time, and the watchdog expires only 3 minutes into the task.  The client doesn't realize this and it keeps trying to reconnect.

From the client:
========================================================================
~]# restraint --host=1=root@localhost --job=interbench.xml
Using ./interbench.01 for job run
Connecting to http://localhost:8091/run for host: localhost:8081, recipe id:1
*  T:   1 [/kernel/performance/interbench-0.31             ] Running
Socket I/O timed out [g-io-error-quark, 24]
Disconnected.. delaying 60 seconds.
Connecting to http://localhost:8091/continue for host: localhost:8081, recipe id:1
Could not parse HTTP response [soup_request_error_quark, 2]
Disconnected.. delaying 60 seconds.
Connecting to http://localhost:8091/continue for host: localhost:8081, recipe id:1
Socket I/O timed out [g-io-error-quark, 24]
Disconnected.. delaying 60 seconds.
Connecting to http://localhost:8091/continue for host: localhost:8081, recipe id:1
Socket I/O timed out [g-io-error-quark, 24]
Disconnected.. delaying 60 seconds.
...
...
...
^C
========================================================================

Meanwhile, on the serial console:
========================================================================
[ 1109.071246] restraintd[1949]: * Parsing recipe
[ 1109.072420] restraintd[1949]: * Running recipe
[ 1109.072992] restraintd[1949]: ** Fetching task: 1 [/mnt/tests/kernel/performance/interbench-0.31]
[ 1141.643588] restraintd[1949]: ** Parsing testinfo.desc
[ 1141.657910] restraintd[1949]: ** Updating env vars
[ 1141.682767] restraintd[1949]: ** Updating external watchdog: 37800 seconds
[ 1141.736770] restraintd[1949]: ** Installing dependencies
[ 1144.798887] restraintd[1949]: ** Running task: 1 [/kernel/performance/interbench-0.31]
[ 1241.960401] restraintd[1949]: *** Current Time: Fri Jan 29 10:50:11 2016 Localwatchdog at: Fri Jan 29 20:48:34 2016
[ 1303.723430] restraintd[1949]: *** Current Time: Fri Jan 29 10:51:13 2016 Localwatchdog at: Fri Jan 29 20:48:34 2016
[ 1385.779804] restraintd[1949]: *** Current Time: Fri Jan 29 10:52:35 2016 Localwatchdog at: Fri Jan 29 20:48:34 2016
[ 1413.734633] restraintd[1949]: ** ERROR: Local watchdog expired!
[ 1413.735283] restraintd[1949]: ** Completed Task : 1
[ 1416.734635] restraintd[1949]: ** Message: Aborted task 1 due to error: Local watchdog expired!
[ 1426.267061] restraintd[1949]: * Finished recipe
========================================================================

The TestTime is 10 hours in the testinfo.desc file:
========================================================================
~]# grep TestTime /mnt/tests/kernel/performance/interbench-0.31/testinfo.desc
TestTime:       10h
========================================================================

Version-Release number of selected component (if applicable):
restraint-0.1.23-2.aa7a.aarch64

How reproducible:
always

Steps to Reproduce:
1. run /kernel/performance/interbench-0.31 (from Red Hat's internal Beaker)
   as a standalone job
~]# cat interbench.xml
<job>
  <recipeSet>
    <recipe>
      <task name="/kernel/performance/interbench-0.31">
        <rpm name="kernel-kernel-performance-interbench-0.31"
             path="/mnt/tests/kernel/performance/interbench-0.31"/>
      </task>
    </recipe>
  </recipeSet>
</job>
~]# restraint --host=1=root@localhost --job=interbench.xml

Actual results:
the local watchdog expires after 3 minutes instead of 10 hours

Expected results:
local watchdog gives the task at least 10 hours to run

Additional info:

Comment 1 Jeff Bastian 2016-01-29 18:11:17 UTC
I was running the above on a large system with 96 cores.  I cannot reproduce this on a smaller system with 8 cores.

Comment 2 Bill Peck 2016-01-29 18:23:01 UTC
Thanks for the report Jeff.

Is it possible to get access to the system where the problem is shown?

I'm traveling to Brno Tomorrow so it might be next week before I can look at this.

Comment 3 Jeff Bastian 2016-01-29 19:36:28 UTC
Yes, I'll send you an email with the details.

Comment 4 Jeff Bastian 2016-01-29 21:51:16 UTC
I see the same problem with /kernel/stress/racer, however it makes it longer than 3 minutes.  The local watchdog was set for 17:48, but it killed the job at 16:23, a full 85 minutes shy of the expected time.

[  638.187591] restraintd[1921]: * Parsing recipe
[  638.188717] restraintd[1921]: * Running recipe
[  638.189284] restraintd[1921]: ** Fetching task: 1 [/mnt/tests/kernel/stress/racer]
[  678.296679] restraintd[1921]: ** Parsing testinfo.desc
[  678.320434] restraintd[1921]: ** Updating env vars
[  678.345231] restraintd[1921]: ** Updating external watchdog: 9000 seconds
[  678.388008] restraintd[1921]: ** Installing dependencies
[  680.119938] restraintd[1921]: ** Running task: 1 [/kernel/stress/racer]
[  741.744089] restraintd[1921]: *** Current Time: Fri Jan 29 15:49:26 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[  800.231933] restraintd[1921]: *** Current Time: Fri Jan 29 15:50:24 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[  862.465831] restraintd[1921]: *** Current Time: Fri Jan 29 15:51:27 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
<snip>
[ 2540.250521] restraintd[1921]: *** Current Time: Fri Jan 29 16:19:24 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[ 2633.952363] restraintd[1921]: *** Current Time: Fri Jan 29 16:20:58 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[ 2710.841001] restraintd[1921]: *** Current Time: Fri Jan 29 16:22:15 2016 Localwatchdog at: Fri Jan 29 17:48:24 2016
[ 2752.812901] restraintd[1921]: ** ERROR: Local watchdog expired!
[ 2752.813763] restraintd[1921]: ** Completed Task : 1
[ 2759.694477] restraintd[1921]: ** Message: Aborted task 1 due to error: Local watchdog expired!
[ 2772.117863] restraintd[1921]: * Finished recipe

Comment 5 Jeff Bastian 2016-02-03 20:41:31 UTC
Today I saw it with xfstests:

[  365.431481] run fstests generic/017 at 2016-02-03 15:19:59
[  365.956690] XFS (loop1): Unmounting Filesystem
[  366.610974] XFS (loop1): Mounting V4 Filesystem
[  366.618482] XFS (loop1): Ending clean mount
[  389.047432] restraintd[1916]: *** Current Time: Wed Feb 03 15:20:23 2016 Localwatchdog at: Thu Feb 04 02:36:22 2016
[  399.094982] restraintd[1916]: ** ERROR: Local watchdog expired!
[  399.095636] restraintd[1916]: ** Completed Task : 1
[  399.096212] restraintd[1916]: ** Message: Aborted task 1 due to error: Local watchdog expired!
[  399.134558] restraintd[1916]: * Finished recipe

Comment 6 PaulB 2016-06-08 17:46:35 UTC
All,
This issue is reproducible regularly when running as a client (outside of beaker)
with the following task:
 kernel/stress/racer/

Best,
-pbunyan

Comment 7 Bill Peck 2016-06-08 17:52:31 UTC
I think this issue is related to the heartbeat routine which also writes out to disk the current OUTPUTFILE offset.  On I/O intensive tests this gets blocked and in turn blocks us writing back to the client and we get disconnected.

One idea is to extend the timeout which does help but then makes it slower to respond to reboots.

Another idea is to not write the offset to disk every minute and simply use a sighup routine to write it out when we are being killed.

Comment 8 Miloš Prchlík 2018-05-24 14:33:37 UTC
I believe we're observing the same issue when using restraint to run tests in CI environment. It is quite easy to reproduce in our CI environment, however, when running the scenario locally, everything works :/

We've been able to condense the reproducer to a small job XML which basically runs a single task that logs interesting information about the server, and then reboots the box - repeat that for 9 consecutive reboots. The whole action can fit into tens of seconds, it's just `env` and few other commands, followed by rhts-reboot - it can complete and initiate the reboot well before heartbeat triggers for the first time (if I understand it correctly). Unfortunately, we can rarely get all 9 reboots - after 2 or 3 reboots restraint "realizes" the max_time for the task (86400 seconds, BTW...) elapsed and kills it by the power vested in it by restraint's local watchdog. To make things even funnier, after killing the task, restraint often believes it's time to actually run it like it was never running - fetches the RPM, starts the task, makes a mess in logs (several Setup phases, later ones with REBOOTCOUNT set to 0, ...).

We have logs, I'd rather share them via the more private method as they contain RH private information, but the prematurely expiring max_time seems to be at least one of major issues.

Comment 9 Miroslav Vadkerti 2018-05-24 15:15:01 UTC
This issue is blocking proper testing of systemd on downstream in all RHEL releases. 

Róman, Bill would it be possible to investigate with us, we have a quite good reproducer around and are willing to help deep dive here.

Comment 10 Matt Tyson 🤬 2018-05-25 03:52:16 UTC
Hi Miroslav, Miloš,

I'll look into this and see if I can trigger it locally.  Would it be possible to get access to your CI environment to see if I can trigger it there?

Any extra logs or information & reproducers you can supply would be helpful.

Cheers,
Matt.

Comment 11 Miloš Prchlík 2018-05-25 10:43:50 UTC
Hi Matt, I've sent you details in an e-mail since it contains internal RH stuff. Maybe setting a comment to private would work just as well, I just wasn't sure (would it be sent out in e-mail notifications?). If it's fine to put it here, after all, let me know.

Comment 12 Miroslav Vadkerti 2018-06-07 08:22:03 UTC
Matt, is the provided data sufficient to you? Thanks

Comment 13 Matt Tyson 🤬 2018-06-08 09:06:55 UTC
It is for now, I'm still having trouble isolating the problem. I'll let you know if I need more.

Cheers,
Matt.

Comment 14 Matt Tyson 🤬 2018-06-12 06:03:58 UTC
I'm having trouble reproducing the problem as described.  I've done it once or twice by heavily loading my workstations (lots of CPU load and disk IO).  It's unreliable and hard to reproduce though.

In this case the client was disconnected and the server reported a watchdog timeout.

What I have found out is that we can kick off a test job, then kill -9 the restraint client.  The server will then die with a watchdog timeout message.

I suspect that the watchdog is not actually expiring and that the timeout message is spurious.  I think it's to do with the network going away and the restraint server shuts down as a result of that.

Comment 19 Miloš Prchlík 2018-06-20 12:18:48 UTC
Ah, I gave the archive wrong name, see "restraint logs, comment #17" archive for restraint logs :)

Comment 23 Matt Tyson 🤬 2018-06-25 04:07:00 UTC
(In reply to PaulB from comment #6)
> All,
> This issue is reproducible regularly when running as a client (outside of
> beaker)
> with the following task:
>  kernel/stress/racer/


I've grabbed a random 48 core machine out of beaker and run the racer test and can reproduce the issue.

My patched version of restraint in will, instead of aborting with the watchdog timeout, print out the warning ("Process has been cancelled but watchdog has not timed out") and continue execution of the test.

At this time the machine had a load average of approx 62.

The client attempted to reconnect for the remainder of the test, but could not do so.

So my guess is, the callback for the ssh connection going away is calling the watchdog timeout callback, and therefore aborting the test when it shouldn't.

Comment 25 Miloš Prchlík 2018-06-25 14:25:21 UTC
Hm, I must have made a mistake then. Nevermind, I tried again, from a scratch, and AFAICT, your experimental restraint build seems to work like a charm. I've been able to run our reboot-stress scenario several times, with exceptions caused by OpenStack instability it finished as expected \o/

Comment 26 Matt Tyson 🤬 2018-07-01 23:18:36 UTC
(In reply to Miloš Prchlík from comment #25)
> Hm, I must have made a mistake then. Nevermind, I tried again, from a
> scratch, and AFAICT, your experimental restraint build seems to work like a
> charm. I've been able to run our reboot-stress scenario several times, with
> exceptions caused by OpenStack instability it finished as expected \o/

I'm somewhat surprised that my patched version fixed your issue.  I wouldn't expected anything I changed to have made a difference for your use case.

Let me know if you still experience the issue and how you reproduced it.

Comment 27 Matt Tyson 🤬 2018-07-03 05:46:36 UTC
So this bug from what I can see has nothing to do with the watchdog and everything to do with the network connection being severed.

Doing a kill -9 on the restraint client will cause the timeout message to appear.
Yanking the network cable out will also cause the timeout message to appear.

I also replicated this by running the /kernel/stress/racer test on some large machines in the beaker lab. (comment 6)

I think the reboot issue experienced by Milos is different and unrelated, but I can't prove that yet as I have not been able to replicate it.

What isn't clear to me is why the connection is breaking.  I suspect it's load related, but my regular SSH login to the server never failed.

I think the fix here is to detect that the SSH connection between the client & server has been broken and to allow the client to reconnect.

Restraint has some code to allow the client to reconnect after the test machine has rebooted, but it does not appear to activate in the case that an errant backhoe operator breaks the network connection.

This presents the problem of what to do if you Ctrl-C the restraint client while a test is running.  Currently it terminates the running test on the remote machine, which is probably what the user intended in that particular instance.

Another problem is what to do if the test completes while the client is disconnected.  The server won't have anywhere to upload the results to and would probably need to wait around for the client to come back and get the results.

Comment 28 Dan Callaghan 2018-07-03 05:58:20 UTC
Is this maybe a fundamental limitation in how the restraint-client <-> restraint-server interaction works currently?

Before we go too far down the road of implementing a solution for this, would it be worth considering some alternative strategies (like invoking restraintd from Ansible and collecting the results from the system after the job is complete)?

Comment 29 Matt Tyson 🤬 2018-07-03 06:04:22 UTC
(In reply to Dan Callaghan from comment #28)
> Is this maybe a fundamental limitation in how the restraint-client <->
> restraint-server interaction works currently?

Yes I think so.
 
> Before we go too far down the road of implementing a solution for this,
> would it be worth considering some alternative strategies (like invoking
> restraintd from Ansible and collecting the results from the system after the
> job is complete)?

Yeah, running the restraint client in a tmux/screen session on the remote machine should work around this bug.

Comment 30 Miloš Prchlík 2018-07-03 06:22:43 UTC
(In reply to Matt Tyson from comment #26)
> (In reply to Miloš Prchlík from comment #25)
> > Hm, I must have made a mistake then. Nevermind, I tried again, from a
> > scratch, and AFAICT, your experimental restraint build seems to work like a
> > charm. I've been able to run our reboot-stress scenario several times, with
> > exceptions caused by OpenStack instability it finished as expected \o/
> 
> I'm somewhat surprised that my patched version fixed your issue.  I wouldn't
> expected anything I changed to have made a difference for your use case.
> 
> Let me know if you still experience the issue and how you reproduced it.

Well, I don't claim you fixed it, I just said that your build worked :) Apparently, it seems to be just a coincidence :/ I'll try again this morning - we're quite successful with reproducing the issue, all it takes is just an OpenStack instance and running reboot-stress task. It doesn't finish it's 10 reboots, usually, it's one or two, then it's killed by a watchdog. Often restraint "forgets" it's already running this test and after two reboots comes "reboot #0" (REBOOTCOUNT is suddenly 0, and restraint logs it downloaded and installed the test again, just like it never run before...), and the test is killed by a watchdog after this chapter. With your build, it simply worked as expected, the test made it to 10 reboots every time I tried (and I tried a lot...).

Any ideas what I could try to help you reproduce the issue? Maybe I could reserve an OpenStack instance for you to try out. We're not using restraint with Beaker boxes, hence I have no data on this machine pool, we didn't even try to reproduce it with beaker. The original reproducer I sent to you via e-mail (CI docker image) is using OpenStack, and as I said, so far it's quite easy to make restraint watchdog kill this particular task just by running it.

Comment 31 Matt Tyson 🤬 2018-07-03 06:31:13 UTC
(In reply to Miloš Prchlík from comment #30)
> Any ideas what I could try to help you reproduce the issue? Maybe I could
> reserve an OpenStack instance for you to try out.

This would help.  If you can set up an openstack machine and leave it running for a couple of days I could connect to it and do some testing there.

Comment 32 Matt Tyson 🤬 2018-07-04 01:26:13 UTC
So the way restraint works in client/server mode (IE, outside of beaker) is that the client will connect to the server over SSH and then issue commands to start the test, monitor execution, fetch results, etc.

When this SSH connection is broken, either via the client being terminated with Ctrl-C, or via an accidental means such as a network fault, the restraint server will abort the test.

Restraint currently does not distinguish between a intentional termination and an unintentional one.

Fixing this is essentially an RFE as we have not specified what the correct behaviour is when the network connection is broken (for whatever reason).

So I have some questions for the users of Restraint.

* What is the expected behaviour when you kill the restraint client with Ctrl-C?
  * Restraint server terminates the test?
  * Something else?
* What is the expected behaviour when the network connection dies?
  * Restraint server continues execution to the end of the job?
  * Restraint server finishes current test and blocks until reconnect?
  * Something else?
* At the end of the test, the client needs to fetch results from the server.
  What should the restraint server do if the client disconnects and does not reconnect?
  * Discard results and wait for new test job if client does not reconnect within job watchdog timeout period?
  * Wait forever.  Ignore watchdog and never discard results?
  * Something else?

I'd very much like to get feedback from anyone who uses restraint in client/server mode.

Comment 33 Matt Tyson 🤬 2018-07-16 00:51:34 UTC
I'm unassigning myself from this as I'm going on PTO shortly and won't be able to complete this bug before I leave.

I think that the correct solution to this bug is what dcallagh referred to in comment 28.  The flaw in the current client/server code is that it assumes that the network is reliable and will never break, except in the case of deliberate termination by the user, or as a reboot as part of a test.

There's also a significant amount of code that is only related to the client / server interactions, authentication etc and not part of restraint's core job of running the test and reporting results.  If we rearchitected how restraint worked when running outside of beaker we could potentially remove lots of code and make it more reliable in the process.

Comment 34 Miloš Prchlík 2018-08-17 08:25:25 UTC
My my 2¢ (inlined):

(In reply to Matt Tyson from comment #32)
> So the way restraint works in client/server mode (IE, outside of beaker) is
> that the client will connect to the server over SSH and then issue commands
> to start the test, monitor execution, fetch results, etc.
> 
> When this SSH connection is broken, either via the client being terminated
> with Ctrl-C, or via an accidental means such as a network fault, the
> restraint server will abort the test.
> 
> Restraint currently does not distinguish between a intentional termination
> and an unintentional one.
> 
> Fixing this is essentially an RFE as we have not specified what the correct
> behaviour is when the network connection is broken (for whatever reason).
> 
> So I have some questions for the users of Restraint.
> 
> * What is the expected behaviour when you kill the restraint client with
> Ctrl-C?
>   * Restraint server terminates the test?
>   * Something else?

I'd expect the client to notify server and terminate the test. Or, this could be controlled by an option, should there be more than one viable solution, but since it's the client that drives the testing process by telling the server what to start, terminating the test would by my answer.

> * What is the expected behaviour when the network connection dies?
>   * Restraint server continues execution to the end of the job?
>   * Restraint server finishes current test and blocks until reconnect?
>   * Something else?

Finishing what's left to finish and wait for the client to reconnect sounds good to me. With few options to tweak the behavior, e.g. to control timeouts and so on, this would be my choice - when the connection is lost, wait for the client to tell you what to do next, wait patiently but not forever (unless told explicitly by the user to wait forever...).

> * At the end of the test, the client needs to fetch results from the server.
>   What should the restraint server do if the client disconnects and does not
> reconnect?
>   * Discard results and wait for new test job if client does not reconnect
> within job watchdog timeout period?
>   * Wait forever.  Ignore watchdog and never discard results?
>   * Something else?
> 

Hm, with the wait-until-client-returns approach, the server should keep the results and hand them to the client when possible. Discard them when it's clear the client didn't come back (or never, when asked to wait forever...).

Most of our issues are caused by networking issues, when the connection between server and client is interrupted - server finishing its tasks, waiting for the client to reconnect and handing any results server piled up in the meantime would serve our use case well.


> I'd very much like to get feedback from anyone who uses restraint in
> client/server mode.

Comment 36 Roman Joost 2018-08-21 01:18:51 UTC
The tricky bit with this item is, that addressing it properly would involve quite a bit of refactoring on the client side of restraint.

Essentially, the client is very similar to the Ansible CLI. One idea we would like to investigate is to replace the restraint client internals with Ansible. The client would be the same, but would use Ansible underneath. How feasible that is and what possible gotchas there are we don't know yet.

Next step would be to write up a small design document which we can use to discuss this idea and then perhaps implement it.

I'm currently not aware of possible quick fixes we could ship in the mean time.


Note You need to log in before you can comment on or make changes to this bug.