This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 831527 - Tasks aborted without apparent reason
Tasks aborted without apparent reason
Product: Beaker
Classification: Community
Component: scheduler (Show other bugs)
Unspecified Unspecified
unspecified Severity high (vote)
: ---
: ---
Assigned To: Bill Peck
Depends On:
  Show dependency treegraph
Reported: 2012-06-13 04:51 EDT by Leonid Zhaldybin
Modified: 2014-11-09 17:38 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2012-06-13 14:33:00 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Leonid Zhaldybin 2012-06-13 04:51:20 EDT
Description of problem:
A number of tasks at the end of the job were aborted without any apparent reason. Beaker states that "External Watchdog Expired" for each of these tasks, but their start time is exactly the same, which to me looks like beaker killed them without even trying to actually run any tests.

Version-Release number of selected component (if applicable):
Version - 0.8.2 

How reproducible:
I saw it only once so far
Comment 1 Bill Peck 2012-06-13 14:33:00 EDT

This is what happened:

/distribution/MRG/Messaging/qpid_ptest_cluster_failover_soak was running and it failed to complete in the time alloted.  The local watchdog kicked in first and that means it tries to continue to the next test after it reboots.

When it booted back up and started running this test:

/distribution/MRG/Messaging/qpid_ptest_cluster_perftest it ended up running out of disk space (I looked at the console log).  I'm betting that once we ran out of disk space everything went south.

Thats when the external watchdog kicked in.  The external watchdog is kept track of from the lab controller and its the end of the line for a recipe.  All we do is abort every remaining task and put the machine back in the pool for the next recipe to run on it.

so, yes, we didn't even try and rn those remaining tests and thats by design.  The system is too broken at that point to do anything more.

Note You need to log in before you can comment on or make changes to this bug.