Bug 683184 - [Beaker][Harness] External Watchdog killing "finished" recipes
Summary: [Beaker][Harness] External Watchdog killing "finished" recipes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: beah
Version: 0.6
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Marian Csontos
QA Contact:
URL:
Whiteboard:
: 679019 (view as bug list)
Depends On: 700761
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-08 19:08 UTC by Jeff Burke
Modified: 2019-05-22 13:40 UTC (History)
10 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2011-10-04 07:44:57 UTC
Embargoed:


Attachments (Terms of Use)

Description Jeff Burke 2011-03-08 19:08:15 UTC
Description of problem:
 When running kernel testing I have noticed recipes not completing properly.

How reproducible:
 Intermittent 

Actual results:
 If you look at the console log: 
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/03/597/59725/121746/console.log
Everything looks like it is fine. It looks like it reported the last test as
finished and it should start running the next test. But it never does.

Recipe Link:
https://beaker.engineering.redhat.com/recipes/121746#task1328306

Expected results:
 Testing should continue as expected.

Additional info:

Comment 1 Marian Csontos 2011-03-14 21:06:12 UTC
Trying to reproduce. Have got (and fixed) one case but it is different from this one.

Comment 2 Marian Csontos 2011-03-15 12:26:44 UTC
I have run a bunch of 10+ jobs but could not reproduce it and I do not want to waste all beaker power for myself.

May I ask you to use following repo with experimental harness build for your jobs, which should capture some info before external watchdog strikes:

  http://file.brq.redhat.com/~mcsontos/repos/beah-watchdog/el6

Adding extra repo as in https://beaker.engineering.redhat.com/jobs/62295:

  <repo name="beaker-harness-testing" url="http://file.brq.redhat.com/~mcsontos/repos/beah-watchdog/el6"/>

There are also repos for older RHELs.

Comment 3 Marian Csontos 2011-03-15 12:30:29 UTC
*** Bug 679019 has been marked as a duplicate of this bug. ***

Comment 4 Jeff Burke 2011-03-15 12:33:11 UTC
Marian,
 Had kicked off a job last night. https://beaker.engineering.redhat.com/matrix/?whiteboard_filter=&job_ids=62072 the results look good. I did not see those mid-run failures. The two systems that did warn were another issue.

Thanks,
Jeff

Comment 5 Marian Csontos 2011-03-15 12:53:55 UTC
It may be all right now but I am afraid it will be back... Shall I lower the priority and spend more effort on getting watchdog catcher upstream?

Could you use the harness from Comment 2 anyway, please?

Comment 6 Jeff Burke 2011-03-28 18:22:50 UTC
This looks like it maybe the same issue. https://beaker.engineering.redhat.com/recipes/137309#task1490916 The only thing that is different is that we are getting a blocked message also.

Comment 7 Jeff Burke 2011-03-28 20:07:46 UTC
[-- MARK -- Mon Mar 28 03:25:00 2011] 
logger: 2011-03-28 03:26:22 /usr/bin/rhts-test-runner.sh 9193 8220 hearbeat... 
logger: /usr/bin/rhts-test-runner.sh report_finish start... 
jobsl = ??1??+ 54955 Done make run >> /mnt/testarea/TESTOUT.log 2>&1 
status = Done 
no machines specified. exiting 
03/28/11 03:26:56  testID:1490913 finish: 
Nothing to restore. 
logger: /usr/bin/rhts-test-runner.sh report_finish stop... 
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from 
 CPU 01. 
cpu: Processor 1 stopped 
INFO: task beah-rhts-task:54847 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
beah-rhts-tas D 0000000000118976     0 54847   1633 0x00000200 
000000000238fcb0 000000000238fcb0 00000000008a5e00 00000000021aa9d8 
       0000000000000001 00000000010f4e00 00000000021aa540 000000000013cbae 
       000000000238fcb0 0000000000000000 00000000021aa540 000000000080ee98 
       00000000008a5e00 00000000021aa9d8 000000001f992040 00000000010f4e00 
       00000000004c7c80 00000000004bd7d6 000000000238fd00 000000000238feb8 
Call Trace: 
(??<00000000004bd7d6>?? schedule+0x5aa/0xf84) 
 ??<0000000000118976>?? io_reschedule+0xa/0x12 
 ??<0000020000868c5e>?? 0x20000868c5e 
[-- MARK -- Mon Mar 28 03:30:00 2011]

Comment 8 Jeff Burke 2011-04-13 12:37:39 UTC
We are still hitting this issue. 
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/04/719/71938/147402/console.log

Any updates or thoughts?

Comment 9 Bill Peck 2011-04-13 13:32:39 UTC
Marian,

What debugging can Jeff turn on to help diagnose this?

Comment 10 Jeff Burke 2011-04-13 13:42:03 UTC
Not to pile on but here is another one. Let me know if there is anything I can do.

http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/04/723/72382/148342/console.log

Comment 11 Marian Csontos 2011-04-13 13:47:18 UTC
I tried to reproduce, but without any success.
Approximately how often does it happen? (I do not know whether I need to run 10, 100 or 1000 jobs.)

Turning watchdog handler on and additional logging would be helpful. Add following to the recipe, please:

			<ks_appends>
				<ks_append>
<![CDATA[
%post
echo 'OPTIONS="$OPTIONS -vv"' >> /etc/sysconfig/beah-srv
echo 'OPTIONS="$OPTIONS -vv"' >> /etc/sysconfig/beah-beaker-backend

chkconfig --add beah-watchdog-backend || echo "BAD!"
chkconfig --level 345 beah-watchdog-backend on || echo "BAD!"

%end
]]>
				</ks_append>
			</ks_appends>

Comment 12 Jeff Burke 2011-04-13 14:02:03 UTC
Marian,
 How much overhead is this logging? Can we make this the default for all tests running? That way if we hit it in any test we should have the data.

 If not I am going to have to put a hack into the cmd_workflow_kernel.py to add
this ks_appends section to the recipes that the maintainers start.

Thoughts?

Comment 13 Bill Peck 2011-04-13 14:14:13 UTC
Jeff,

I was hoping we could just run one kernel workflow with these debug options.

Does it not happen every time?

Comment 14 Marian Csontos 2011-04-13 14:36:40 UTC
>  How much overhead is this logging? Can we make this the default for all tests
> running? That way if we hit it in any test we should have the data.

I have not measured the overhead. It's a good idea and I will run some tests.

Comment 15 Petr Šplíchal 2011-04-22 13:44:10 UTC
Probably the same problem: Task finished, successfully reported
results but still has been killed by the external watchdog:

https://beaker.engineering.redhat.com/recipes/156394#task1710791

Comment 16 Marian Csontos 2011-05-10 09:13:25 UTC
I have redirected logging to console to allow pinpointing the problem. Jeff, any EWDs recently?

Comment 17 Jeff Burke 2011-05-12 15:02:00 UTC
I have not seen the EWD since the addition of the extra output. Perhaps this is a timing issue, the additional logging is slowing it down enough so that it is not hitting the issue.

I will continue to run/monitor the results. Hopefully we will get some data soon.

Comment 18 Jeff Burke 2011-05-13 12:34:45 UTC
I think it may have happened, But with all the logging the console looks a little different. Please look at the following recipes:
 https://beaker.engineering.redhat.com/recipes/173489#task1885697

Looking at the second one that may have hit this doesn't help. The console log is broken.
 https://beaker.engineering.redhat.com/recipes/173499#task1885927

Comment 19 Marian Csontos 2011-05-16 13:06:14 UTC
(In reply to comment #18)
> I think it may have happened, But with all the logging the console looks a
> little different. Please look at the following recipes:
>  https://beaker.engineering.redhat.com/recipes/173489#task1885697
> 
> Looking at the second one that may have hit this doesn't help. The console log
> is broken.
>  https://beaker.engineering.redhat.com/recipes/173499#task1885927

Hi Jeff, thanks for reporting. Looked at it and from the clues in console.log it seems to be a different issue - Bug 705026.

If you have any other EWDs let me know please.

Comment 20 PaulB 2011-06-14 16:25:41 UTC
All,
Here is an instance of the BeahBug:
 https://beaker.engineering.redhat.com/jobs/97136
 Recipe ID 	R:197953
 T:2145230 	/distribution/kernelinstall
 http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/971/97136/197953//console.log
<-snip->
 06/13/11 15:15:31  JobID:97136 Test:/distribution/kernelinstall Response:1 
<-snip->
 Initializing cgroup subsys cpu 
 Linux version 2.6.32-71.32.1.el6.i686.debug  
<-snip->
 /mnt/tests/distribution/kernelinstall /mnt/tests/distribution/kernelinstall 
 yum used greatest stack depth: 4680 bytes left 
 logger: /usr/bin/rhts-test-runner.sh report_finish start... 
 jobsl = [1]+ 1679 Done make run >> /mnt/testarea/TESTOUT.log 2>&1 
 status = Done 
 no machines specified. exiting 
 2011-06-13 15:22:35,361 rhts_task checkin_finish: INFO resetting nohup 
 06/13/11 15:22:35  testID:2145230 finish: 
 Nothing to restore. 
 logger: /usr/bin/rhts-test-runner.sh report_finish stop... 
 2011-06-13 15:22:35,506 rhts_task task_exited: INFO task_exited([Failure   
 instance: Traceback (failure with no frames): <class   
 'twisted.internet.error.ProcessTerminated'>: A process has ended with a  
 probable error condition: process ended by signal 15. 
 ]) 
 2011-06-13 15:22:35,508 rhts_task on_exit: INFO quitting... 
 2011-06-13 15:22:36,511 rhts_task task_ended: INFO task_ended([Failure  
 instance: Traceback (failure with no frames): <class 
 'twisted.internet.error.ProcessTerminated'>: A process has ended with a 
 probable error condition: process ended by signal 15. 
 ]) 
 [-- MARK -- Mon Jun 13 15:25:00 2011] 
 readahead-colle used greatest stack depth: 4572 bytes left 
 readahead-collector: sorting 
 readahead-collector: finished 
 [-- MARK -- Mon Jun 13 15:30:00 2011] 
 [-- MARK -- Mon Jun 13 15:35:00 2011] 
 [-- MARK -- Mon Jun 13 15:40:00 2011] 
 [-- MARK -- Mon Jun 13 15:45:00 2011] 
<-snip->

Best, 
-pbunyan

Comment 21 PaulB 2011-06-25 17:29:36 UTC
All,
Here is an instance of the BeahBug:
The /kernel/misc/amtu test completes successfully, yet the testing does not move on to the next requested test:
https://beaker.engineering.redhat.com/recipes/208448
http://tinyurl.com/69ybxmg
http://tinyurl.com/67d5tne

-pbunyan

Comment 22 Marian Csontos 2011-06-26 14:12:33 UTC
Thanks Paul. The child process dies and the parent does not "wait" for it. No idea when and why this happens.

I will check twisted's bug tracker and will try to come up with something to prevent this.

Comment 23 Jeff Burke 2011-06-29 02:04:31 UTC
Marian, 

 Looks like I hot another instance of this:
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/1031/103199/211446/console.log

Thanks,
Jeff

Comment 24 PaulB 2011-07-08 14:24:05 UTC
All,
Here is another instance of the BeahBug:
The /distribution/sleep test completes successfully, but testing does not move on to complete recipeset.

https://beaker.engineering.redhat.com/recipes/221630
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/07/1081/108146/221630//console.log

-pbunyan

Comment 25 Marian Csontos 2011-07-08 14:39:01 UTC
(In reply to comment #24)
> All,
> Here is another instance of the BeahBug:
> The /distribution/sleep test completes successfully, but testing does not move
> on to complete recipeset.
> 
> https://beaker.engineering.redhat.com/recipes/221630
> http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/07/1081/108146/221630//console.log

This is a different issue: task.Stop failed

> 2011-07-07 13:09:30,755 backend handle_Stop: ERROR Task 2410626 done. Stop failed: [Failure instance: Traceback: <class 'xmlrpclib.Fault'>: <Fault 1: 'socket.timeout:timed out'>  

...and if there were other tasks it would go on regardless.

Comment 26 PaulB 2011-07-08 14:51:39 UTC
Marion,
Regarding Comments#23 and #24:
Is there a BZ for "task.Stop failed" issue?

The Recipe should have "finished" and reported success, following the success of the sleep test. 

-pbunyan

Comment 27 Marian Csontos 2011-07-11 06:30:37 UTC
Nope, there is not an open bug for the issue. The closest one is Bug 636093.

The calls are repeated in case of some failures - may be I will need to add this one to the list.

Comment 28 Petr Sklenar 2011-08-02 07:46:37 UTC
Hi,
I hit this issue again [1]
Is there some workaround which I would like to try until bug exists

for ex setting higher time for tests ...

https://beaker.engineering.redhat.com/recipes/238555#task2592325

Comment 29 Marian Csontos 2011-08-02 11:21:29 UTC
(In reply to comment #28)
> Hi,
> I hit this issue again [1]
> Is there some workaround which I would like to try until bug exists
> 
> for ex setting higher time for tests ...
> 
> https://beaker.engineering.redhat.com/recipes/238555#task2592325

This is not the same issue. See job 116394:

  https://beaker.engineering.redhat.com/jobs/116394

Search for following string in console:

  Red Hat Enterprise Linux Server 6.1

Console: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/08/1163/116394/239063//console.log

Looks like after task is finished run-level is changed. And after few attempts it breaks. Have no idea what may be causing that. If problem persists open a new BZ please.

Comment 30 Jeff Burke 2011-08-09 12:47:47 UTC
I spoke with Marian and he confirmed that we have hit the issue of EWD killing finished tasks on the following recipe:
https://beaker.engineering.redhat.com/recipes/242560

Thanks for looking at this
Jeff

Comment 31 Jeff Burke 2011-08-15 12:10:32 UTC
We have hit the issue again here. 
https://beaker.engineering.redhat.com/recipes/246982#task2675663

Regards,
Jeff

Comment 32 Marian Csontos 2011-08-16 11:12:01 UTC
Let's get this one closed.
So far looks on track for the next release.

Comment 33 Marian Csontos 2011-08-18 09:53:25 UTC
The patches submitted for review and the updated harness package is available on stage.

Jeff, could you insert following snippet to your tasks to test EWD does not happen.

  <repos>
    <repo name="beaker-harness-testing" url="http://beaker-stage.app.eng.bos.redhat.com/harness/*RedHatEnterpriseLinuxServer5*/"/>
  </repos>

(replace *RedHatEnterpriseLinuxServer5* with major version accordingly)

I know it proves nothing as the issue is not 100% reproducible, but I tested with similar scenario and the tasks timeouts instead of hanging indefinitely causing an EWD.

Comment 34 Jeff Burke 2011-08-22 19:31:27 UTC
Marian,
 I have run a test https://beaker.engineering.redhat.com/matrix/?whiteboard_filter=&job_ids=122072

 I did not see the signature of the failure, but as you know it doesn't always happen. I will run several more jobs.

Thanks,
Jeff

Comment 36 Petr Šplíchal 2011-09-05 13:08:27 UTC
During our latest tier testing we've observed quite a lot of
killed-in-the-middle recipes. Seems this was happening especially
on s390x architecture. Could it be related to this bug?

    https://beaker.engineering.redhat.com/recipes/261913
    https://beaker.engineering.redhat.com/recipes/262153
    https://beaker.engineering.redhat.com/recipes/262363
    https://beaker.engineering.redhat.com/recipes/262981

Furthermore, my colleagues encountered aborts in the following
recipes:

jscotka: R:263424 R:262895 R:261590 R:261398 R:261060 R:260959...

zmraz: R:263554 R:263555 R:263557 R:263558 R:263615 R:263552
R:263553 R:263415 R:263416 R:263418...

I've rescheduled the job with several s390x recipes with the
testing repo mentioned in comment #33:

    https://beaker.engineering.redhat.com/jobs/127524

Comment 37 Marian Csontos 2011-09-05 17:59:43 UTC
The jobs you submitted were all aborted for a reason unknown to me.

My clone with STABLE RHEL6 distro worked fine:

  https://beaker.engineering.redhat.com/jobs/127555

Comment 38 Raymond Mancy 2011-09-07 01:13:51 UTC
Marian, I'm taking your above comment to mean that it's verified

Comment 39 Marian Csontos 2011-09-07 07:05:00 UTC
Absence of failure does not prove absence of bug. See #c33 and #c34. But seems it has helped.

Comment 40 Dan Callaghan 2011-09-08 02:43:27 UTC
Beaker 0.7.1 has been released.

Comment 41 Petr Sklenar 2011-10-04 07:06:09 UTC
(In reply to comment #40)
> Beaker 0.7.1 has been released.

0.7.1 does not fix the issue:
see: 
https://beaker.engineering.redhat.com/jobs/138884
https://beaker.engineering.redhat.com/jobs/138883
https://beaker.engineering.redhat.com/jobs/138855

Comment 42 Marian Csontos 2011-10-04 07:44:57 UTC
That's a different issue. Not sure what's going on, but every test which is killed by external watchdog was also killed by local watchdog.

I see lot of local watchdogs there too, but these sometimes do recover.

One of the tests may be wrong - it has never worked before and always ends up with external watchdog:

  https://beaker.engineering.redhat.com/tasks/7352

Comment 43 Marian Csontos 2011-10-04 07:47:33 UTC
And also on one of the consoles I found following traceback:

Traceback (most recent call last): 
  File "/usr/bin/rhts-submit-log", line 88, in ? 
    main() 
  File "/usr/bin/rhts-submit-log", line 84, in main 
    report_log(recipetestid, logname, start=start) 
  File "/usr/bin/rhts-submit-log", line 34, it_log 
    rhts.uploadWrapper(session, logname, recipetestid, start=start) 
  File "/usr/lib/python2.4/site-packages/rhts/__init__.py", line 180, in uploadWrapper 
    if callMethod(session, 'results.uploadFile', recipetestid, name, sz, digest, offset, data): 
  File "/usr/lib/python2.4/site-packages/rhts/__init__.py", line 113, in callMethod 
    return _callMethod(session, name, args, opts) 
  File "/usr/lib/python2.4/site-packages/rhts/__init__.py", line 138, in _callMethod 
    raise RetryError, "reached maximum number of retries, last callailed with: %s" % ''.join(traceback.format_exception_only(*sys.exc_info()[:2])) 
rhts.RetryError: reached maximum number of retries, last call failed with: error: (101, 'Network is unreachable')


Note You need to log in before you can comment on or make changes to this bug.