Description of problem: There are occasional problems with test watchdog. There are cases when first test local watchdog is trigger incorrectly 'cancel' all next coming tests as in case: https://beaker.engineering.redhat.com/jobs/13281 or in case: https://beaker.engineering.redhat.com/recipes/16074#task199644 From above cases can be read that watchdog end time did not get always properly updated when test exceeds the maximum time. Version-Release number of selected component (if applicable): The issue persists at least from July 2010. How reproducible: quite hard. Steps to Reproduce: 1. run repeatedly two jobs A and B 2. A should exceed maximum time reserved for A's run 3. Then B sometimes timeouts as well because watchdog time end did not get updated from timeouted test A run Actual results: Consecutive tests after one which timeouted may timeout with the same watchdog time end stamp. Expected results: Every test should have it's own watchdog and when a test is started then watchdog needs to get properly initialized. Additional info:
An update to the issue above. There is one more issue in the testharness watchdog which lead to the situation described above. Test 'distribution/MRG/Messaging/qpid_ptest_msg_throttling' info: https://beaker.engineering.redhat.com/tasks/105 has set maximum test duration to 3 hours, but as you can see here: https://beaker.engineering.redhat.com/logs/2010/81/13281/25112/315766///TESTOUT.log test started at: 19:37:32 and timeouted at: 19:46:35 This does not reflect maximum duration of the test set to 3 hours. i.e. Beaker's test harness launches watchdog much sooner than expected. After this watchdog action all consecutive tests failed with same timestamp as described above.
There were some beaker outages recently and this may be the result of one of them. Anyway the test has finished and it is External Watchdog who killed the job after 4+ hours. And it was killed as it did not manage to upload some-300MB+ file. Are we talking about the same thing?
Could you please confirm this is a real issue in test harness and not in the test? Is the huge file (actually "only" 281MiB were uploaded) really necessary? Could it be compressed? Could the repeating lines collapse into one?
Bill, do we have any qoutas for stored files? (640KiB ought to be enough for anyone.)
As per our discussion Marian we will be limiting uploads.
And for record it is Bug 629025
I am closing this as not a bug - if you disagree strongly enough comment and reopen. == Rationale == Failing to upload 300MiB file is not an issue we are going to solve - we will be limiting file size anyway. Use compression or limit the size of file. === Re: Comment 1 === I think I understand now. The log shows the task was running for some 16 minutes. But it was killed only after 4 hours and 16 minutes, which is the evidence the task has finished, as it is the harness who extends for 4 hours after task finished to upload the logs. This was added to help fight server performance issues. === Re: https://beaker.engineering.redhat.com/recipes/16074#task199644 === This was caused by server performance issues and buggy RPC repeater - see Bug 618123 - Many duplicate test results reported.
OK, if you see there upload of ~300MB file, then it should be something wrong with the test, I can agree on that. If the case happened due to outage[s] or enormous load of the Beaker server it is also fine to me, because I'll not be phasing that regularly. I'm not sure that my case falls to the above described baskets, but let's keep your proposed way. I'm keeping CLOSED/NOTABUG and watching the results for regressions.
Just an update, after discussion about the issue I fully agree on Marian's view. ~300MB file broker was found and there is ongoing task for me to make sure our tests are not uploading such huge amount of data.