831527 – Tasks aborted without apparent reason

Bug 831527 - Tasks aborted without apparent reason

Summary: Tasks aborted without apparent reason

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Bill Peck
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-06-13 08:51 UTC by Leonid Zhaldybin
Modified:	2014-11-09 22:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-06-13 18:33:00 UTC
Embargoed:

Attachments	(Terms of Use)

Description Leonid Zhaldybin 2012-06-13 08:51:20 UTC

Description of problem:
A number of tasks at the end of the job https://beaker.engineering.redhat.com/jobs/240020 were aborted without any apparent reason. Beaker states that "External Watchdog Expired" for each of these tasks, but their start time is exactly the same, which to me looks like beaker killed them without even trying to actually run any tests.

Version-Release number of selected component (if applicable):
Version - 0.8.2 

How reproducible:
I saw it only once so far

Comment 1 Bill Peck 2012-06-13 18:33:00 UTC

Hello,

This is what happened:

/distribution/MRG/Messaging/qpid_ptest_cluster_failover_soak was running and it failed to complete in the time alloted.  The local watchdog kicked in first and that means it tries to continue to the next test after it reboots.

When it booted back up and started running this test:

/distribution/MRG/Messaging/qpid_ptest_cluster_perftest it ended up running out of disk space (I looked at the console log).  I'm betting that once we ran out of disk space everything went south.

Thats when the external watchdog kicked in.  The external watchdog is kept track of from the lab controller and its the end of the line for a recipe.  All we do is abort every remaining task and put the machine back in the pool for the next recipe to run on it.

so, yes, we didn't even try and rn those remaining tests and thats by design.  The system is too broken at that point to do anything more.

Note You need to log in before you can comment on or make changes to this bug.