Bug 626331 - Test harness' watchdog does not behave correctly
Summary: Test harness' watchdog does not behave correctly
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Beaker
Classification: Retired
Component: beah
Version: 0.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Marian Csontos
QA Contact:
URL:
Whiteboard:
Depends On: 467486 629025
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-23 08:27 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:12 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-09-02 07:59:14 UTC
Embargoed:


Attachments (Terms of Use)

Description Frantisek Reznicek 2010-08-23 08:27:58 UTC
Description of problem:

There are occasional problems with test watchdog.
There are cases when first test local watchdog is trigger incorrectly 'cancel' all next coming tests as in case:
https://beaker.engineering.redhat.com/jobs/13281
or in case:
https://beaker.engineering.redhat.com/recipes/16074#task199644

From above cases can be read that watchdog end time did not get always properly updated when test exceeds the maximum time.

Version-Release number of selected component (if applicable):

The issue persists at least from July 2010.


How reproducible:
quite hard.

Steps to Reproduce:
1. run repeatedly two jobs A and B
2. A should exceed maximum time reserved for A's run
3. Then B sometimes timeouts as well because watchdog time end did not get updated from timeouted test A run
  
Actual results:
Consecutive tests after one which timeouted may timeout with the same watchdog time end stamp.

Expected results:
Every test should have it's own watchdog and when a test is started then watchdog needs to get properly initialized.

Additional info:

Comment 1 Frantisek Reznicek 2010-08-23 10:39:28 UTC
An update to the issue above.

There is one more issue in the testharness watchdog which lead to the situation described above.

Test 'distribution/MRG/Messaging/qpid_ptest_msg_throttling' info:
  https://beaker.engineering.redhat.com/tasks/105

has set maximum test duration to 3 hours, but as you can see here:
https://beaker.engineering.redhat.com/logs/2010/81/13281/25112/315766///TESTOUT.log
test started at: 19:37:32
and timeouted at: 19:46:35

This does not reflect maximum duration of the test set to 3 hours.

i.e.

Beaker's test harness launches watchdog much sooner than expected.

After this watchdog action all consecutive tests failed with same timestamp as described above.

Comment 2 Marian Csontos 2010-08-23 13:40:18 UTC
There were some beaker outages recently and this may be the result of one of them.

Anyway the test has finished and it is External Watchdog who killed the job after 4+ hours. And it was killed as it did not manage to upload some-300MB+ file.

Are we talking about the same thing?

Comment 3 Marian Csontos 2010-08-31 03:53:49 UTC
Could you please confirm this is a real issue in test harness and not in the test?

Is the huge file (actually "only" 281MiB were uploaded) really necessary?
Could it be compressed?
Could the repeating lines collapse into one?

Comment 4 Marian Csontos 2010-08-31 04:46:16 UTC
Bill, do we have any qoutas for stored files? (640KiB ought to be enough for anyone.)

Comment 5 Bill Peck 2010-08-31 20:57:42 UTC
As per our discussion Marian we will be limiting uploads.

Comment 6 Marian Csontos 2010-09-01 04:48:56 UTC
And for record it is Bug 629025

Comment 7 Marian Csontos 2010-09-02 07:59:14 UTC
I am closing this as not a bug - if you disagree strongly enough comment and reopen.

== Rationale ==

Failing to upload 300MiB file is not an issue we are going to solve - we will be limiting file size anyway. Use compression or limit the size of file.

=== Re: Comment 1 ===

I think I understand now. The log shows the task was running for some 16 minutes.
But it was killed only after 4 hours and 16 minutes, which is the evidence the task has finished, as it is the harness who extends for 4 hours after task finished to upload the logs. This was added to help fight server performance issues.

=== Re: https://beaker.engineering.redhat.com/recipes/16074#task199644 ===

This was caused by server performance issues and buggy RPC repeater - see Bug 618123 - Many duplicate test results reported.

Comment 8 Frantisek Reznicek 2010-09-06 07:25:18 UTC
OK, if you see there upload of ~300MB file, then it should be something wrong with the test, I can agree on that.

If the case happened due to outage[s] or enormous load of the Beaker server it is also fine to me, because I'll not be phasing that regularly.

I'm not sure that my case falls to the above described baskets, but let's keep your proposed way. I'm keeping CLOSED/NOTABUG and watching the results for regressions.

Comment 9 Frantisek Reznicek 2010-09-06 12:23:09 UTC
Just an update, 
after discussion about the issue I fully agree on Marian's view.

~300MB file broker was found and there is ongoing task for me to make sure our tests are not uploading such huge amount of data.


Note You need to log in before you can comment on or make changes to this bug.