Description of problem: When running kernel testing I have noticed recipes not completing properly. How reproducible: Intermittent Actual results: If you look at the console log: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/03/597/59725/121746/console.log Everything looks like it is fine. It looks like it reported the last test as finished and it should start running the next test. But it never does. Recipe Link: https://beaker.engineering.redhat.com/recipes/121746#task1328306 Expected results: Testing should continue as expected. Additional info:
Trying to reproduce. Have got (and fixed) one case but it is different from this one.
I have run a bunch of 10+ jobs but could not reproduce it and I do not want to waste all beaker power for myself. May I ask you to use following repo with experimental harness build for your jobs, which should capture some info before external watchdog strikes: http://file.brq.redhat.com/~mcsontos/repos/beah-watchdog/el6 Adding extra repo as in https://beaker.engineering.redhat.com/jobs/62295: <repo name="beaker-harness-testing" url="http://file.brq.redhat.com/~mcsontos/repos/beah-watchdog/el6"/> There are also repos for older RHELs.
*** Bug 679019 has been marked as a duplicate of this bug. ***
Marian, Had kicked off a job last night. https://beaker.engineering.redhat.com/matrix/?whiteboard_filter=&job_ids=62072 the results look good. I did not see those mid-run failures. The two systems that did warn were another issue. Thanks, Jeff
It may be all right now but I am afraid it will be back... Shall I lower the priority and spend more effort on getting watchdog catcher upstream? Could you use the harness from Comment 2 anyway, please?
This looks like it maybe the same issue. https://beaker.engineering.redhat.com/recipes/137309#task1490916 The only thing that is different is that we are getting a blocked message also.
[-- MARK -- Mon Mar 28 03:25:00 2011] logger: 2011-03-28 03:26:22 /usr/bin/rhts-test-runner.sh 9193 8220 hearbeat... logger: /usr/bin/rhts-test-runner.sh report_finish start... jobsl = ??1??+ 54955 Done make run >> /mnt/testarea/TESTOUT.log 2>&1 status = Done no machines specified. exiting 03/28/11 03:26:56 testID:1490913 finish: Nothing to restore. logger: /usr/bin/rhts-test-runner.sh report_finish stop... 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01. cpu: Processor 1 stopped INFO: task beah-rhts-task:54847 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. beah-rhts-tas D 0000000000118976 0 54847 1633 0x00000200 000000000238fcb0 000000000238fcb0 00000000008a5e00 00000000021aa9d8 0000000000000001 00000000010f4e00 00000000021aa540 000000000013cbae 000000000238fcb0 0000000000000000 00000000021aa540 000000000080ee98 00000000008a5e00 00000000021aa9d8 000000001f992040 00000000010f4e00 00000000004c7c80 00000000004bd7d6 000000000238fd00 000000000238feb8 Call Trace: (??<00000000004bd7d6>?? schedule+0x5aa/0xf84) ??<0000000000118976>?? io_reschedule+0xa/0x12 ??<0000020000868c5e>?? 0x20000868c5e [-- MARK -- Mon Mar 28 03:30:00 2011]
We are still hitting this issue. http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/04/719/71938/147402/console.log Any updates or thoughts?
Marian, What debugging can Jeff turn on to help diagnose this?
Not to pile on but here is another one. Let me know if there is anything I can do. http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/04/723/72382/148342/console.log
I tried to reproduce, but without any success. Approximately how often does it happen? (I do not know whether I need to run 10, 100 or 1000 jobs.) Turning watchdog handler on and additional logging would be helpful. Add following to the recipe, please: <ks_appends> <ks_append> <![CDATA[ %post echo 'OPTIONS="$OPTIONS -vv"' >> /etc/sysconfig/beah-srv echo 'OPTIONS="$OPTIONS -vv"' >> /etc/sysconfig/beah-beaker-backend chkconfig --add beah-watchdog-backend || echo "BAD!" chkconfig --level 345 beah-watchdog-backend on || echo "BAD!" %end ]]> </ks_append> </ks_appends>
Marian, How much overhead is this logging? Can we make this the default for all tests running? That way if we hit it in any test we should have the data. If not I am going to have to put a hack into the cmd_workflow_kernel.py to add this ks_appends section to the recipes that the maintainers start. Thoughts?
Jeff, I was hoping we could just run one kernel workflow with these debug options. Does it not happen every time?
> How much overhead is this logging? Can we make this the default for all tests > running? That way if we hit it in any test we should have the data. I have not measured the overhead. It's a good idea and I will run some tests.
Probably the same problem: Task finished, successfully reported results but still has been killed by the external watchdog: https://beaker.engineering.redhat.com/recipes/156394#task1710791
I have redirected logging to console to allow pinpointing the problem. Jeff, any EWDs recently?
I have not seen the EWD since the addition of the extra output. Perhaps this is a timing issue, the additional logging is slowing it down enough so that it is not hitting the issue. I will continue to run/monitor the results. Hopefully we will get some data soon.
I think it may have happened, But with all the logging the console looks a little different. Please look at the following recipes: https://beaker.engineering.redhat.com/recipes/173489#task1885697 Looking at the second one that may have hit this doesn't help. The console log is broken. https://beaker.engineering.redhat.com/recipes/173499#task1885927
(In reply to comment #18) > I think it may have happened, But with all the logging the console looks a > little different. Please look at the following recipes: > https://beaker.engineering.redhat.com/recipes/173489#task1885697 > > Looking at the second one that may have hit this doesn't help. The console log > is broken. > https://beaker.engineering.redhat.com/recipes/173499#task1885927 Hi Jeff, thanks for reporting. Looked at it and from the clues in console.log it seems to be a different issue - Bug 705026. If you have any other EWDs let me know please.
All, Here is an instance of the BeahBug: https://beaker.engineering.redhat.com/jobs/97136 Recipe ID R:197953 T:2145230 /distribution/kernelinstall http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/971/97136/197953//console.log <-snip-> 06/13/11 15:15:31 JobID:97136 Test:/distribution/kernelinstall Response:1 <-snip-> Initializing cgroup subsys cpu Linux version 2.6.32-71.32.1.el6.i686.debug <-snip-> /mnt/tests/distribution/kernelinstall /mnt/tests/distribution/kernelinstall yum used greatest stack depth: 4680 bytes left logger: /usr/bin/rhts-test-runner.sh report_finish start... jobsl = [1]+ 1679 Done make run >> /mnt/testarea/TESTOUT.log 2>&1 status = Done no machines specified. exiting 2011-06-13 15:22:35,361 rhts_task checkin_finish: INFO resetting nohup 06/13/11 15:22:35 testID:2145230 finish: Nothing to restore. logger: /usr/bin/rhts-test-runner.sh report_finish stop... 2011-06-13 15:22:35,506 rhts_task task_exited: INFO task_exited([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 15. ]) 2011-06-13 15:22:35,508 rhts_task on_exit: INFO quitting... 2011-06-13 15:22:36,511 rhts_task task_ended: INFO task_ended([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 15. ]) [-- MARK -- Mon Jun 13 15:25:00 2011] readahead-colle used greatest stack depth: 4572 bytes left readahead-collector: sorting readahead-collector: finished [-- MARK -- Mon Jun 13 15:30:00 2011] [-- MARK -- Mon Jun 13 15:35:00 2011] [-- MARK -- Mon Jun 13 15:40:00 2011] [-- MARK -- Mon Jun 13 15:45:00 2011] <-snip-> Best, -pbunyan
All, Here is an instance of the BeahBug: The /kernel/misc/amtu test completes successfully, yet the testing does not move on to the next requested test: https://beaker.engineering.redhat.com/recipes/208448 http://tinyurl.com/69ybxmg http://tinyurl.com/67d5tne -pbunyan
Thanks Paul. The child process dies and the parent does not "wait" for it. No idea when and why this happens. I will check twisted's bug tracker and will try to come up with something to prevent this.
Marian, Looks like I hot another instance of this: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/1031/103199/211446/console.log Thanks, Jeff
All, Here is another instance of the BeahBug: The /distribution/sleep test completes successfully, but testing does not move on to complete recipeset. https://beaker.engineering.redhat.com/recipes/221630 http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/07/1081/108146/221630//console.log -pbunyan
(In reply to comment #24) > All, > Here is another instance of the BeahBug: > The /distribution/sleep test completes successfully, but testing does not move > on to complete recipeset. > > https://beaker.engineering.redhat.com/recipes/221630 > http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/07/1081/108146/221630//console.log This is a different issue: task.Stop failed > 2011-07-07 13:09:30,755 backend handle_Stop: ERROR Task 2410626 done. Stop failed: [Failure instance: Traceback: <class 'xmlrpclib.Fault'>: <Fault 1: 'socket.timeout:timed out'> ...and if there were other tasks it would go on regardless.
Marion, Regarding Comments#23 and #24: Is there a BZ for "task.Stop failed" issue? The Recipe should have "finished" and reported success, following the success of the sleep test. -pbunyan
Nope, there is not an open bug for the issue. The closest one is Bug 636093. The calls are repeated in case of some failures - may be I will need to add this one to the list.
Hi, I hit this issue again [1] Is there some workaround which I would like to try until bug exists for ex setting higher time for tests ... https://beaker.engineering.redhat.com/recipes/238555#task2592325
(In reply to comment #28) > Hi, > I hit this issue again [1] > Is there some workaround which I would like to try until bug exists > > for ex setting higher time for tests ... > > https://beaker.engineering.redhat.com/recipes/238555#task2592325 This is not the same issue. See job 116394: https://beaker.engineering.redhat.com/jobs/116394 Search for following string in console: Red Hat Enterprise Linux Server 6.1 Console: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/08/1163/116394/239063//console.log Looks like after task is finished run-level is changed. And after few attempts it breaks. Have no idea what may be causing that. If problem persists open a new BZ please.
I spoke with Marian and he confirmed that we have hit the issue of EWD killing finished tasks on the following recipe: https://beaker.engineering.redhat.com/recipes/242560 Thanks for looking at this Jeff
We have hit the issue again here. https://beaker.engineering.redhat.com/recipes/246982#task2675663 Regards, Jeff
Let's get this one closed. So far looks on track for the next release.
The patches submitted for review and the updated harness package is available on stage. Jeff, could you insert following snippet to your tasks to test EWD does not happen. <repos> <repo name="beaker-harness-testing" url="http://beaker-stage.app.eng.bos.redhat.com/harness/*RedHatEnterpriseLinuxServer5*/"/> </repos> (replace *RedHatEnterpriseLinuxServer5* with major version accordingly) I know it proves nothing as the issue is not 100% reproducible, but I tested with similar scenario and the tasks timeouts instead of hanging indefinitely causing an EWD.
Marian, I have run a test https://beaker.engineering.redhat.com/matrix/?whiteboard_filter=&job_ids=122072 I did not see the signature of the failure, but as you know it doesn't always happen. I will run several more jobs. Thanks, Jeff
During our latest tier testing we've observed quite a lot of killed-in-the-middle recipes. Seems this was happening especially on s390x architecture. Could it be related to this bug? https://beaker.engineering.redhat.com/recipes/261913 https://beaker.engineering.redhat.com/recipes/262153 https://beaker.engineering.redhat.com/recipes/262363 https://beaker.engineering.redhat.com/recipes/262981 Furthermore, my colleagues encountered aborts in the following recipes: jscotka: R:263424 R:262895 R:261590 R:261398 R:261060 R:260959... zmraz: R:263554 R:263555 R:263557 R:263558 R:263615 R:263552 R:263553 R:263415 R:263416 R:263418... I've rescheduled the job with several s390x recipes with the testing repo mentioned in comment #33: https://beaker.engineering.redhat.com/jobs/127524
The jobs you submitted were all aborted for a reason unknown to me. My clone with STABLE RHEL6 distro worked fine: https://beaker.engineering.redhat.com/jobs/127555
Marian, I'm taking your above comment to mean that it's verified
Absence of failure does not prove absence of bug. See #c33 and #c34. But seems it has helped.
Beaker 0.7.1 has been released.
(In reply to comment #40) > Beaker 0.7.1 has been released. 0.7.1 does not fix the issue: see: https://beaker.engineering.redhat.com/jobs/138884 https://beaker.engineering.redhat.com/jobs/138883 https://beaker.engineering.redhat.com/jobs/138855
That's a different issue. Not sure what's going on, but every test which is killed by external watchdog was also killed by local watchdog. I see lot of local watchdogs there too, but these sometimes do recover. One of the tests may be wrong - it has never worked before and always ends up with external watchdog: https://beaker.engineering.redhat.com/tasks/7352
And also on one of the consoles I found following traceback: Traceback (most recent call last): File "/usr/bin/rhts-submit-log", line 88, in ? main() File "/usr/bin/rhts-submit-log", line 84, in main report_log(recipetestid, logname, start=start) File "/usr/bin/rhts-submit-log", line 34, it_log rhts.uploadWrapper(session, logname, recipetestid, start=start) File "/usr/lib/python2.4/site-packages/rhts/__init__.py", line 180, in uploadWrapper if callMethod(session, 'results.uploadFile', recipetestid, name, sz, digest, offset, data): File "/usr/lib/python2.4/site-packages/rhts/__init__.py", line 113, in callMethod return _callMethod(session, name, args, opts) File "/usr/lib/python2.4/site-packages/rhts/__init__.py", line 138, in _callMethod raise RetryError, "reached maximum number of retries, last callailed with: %s" % ''.join(traceback.format_exception_only(*sys.exc_info()[:2])) rhts.RetryError: reached maximum number of retries, last call failed with: error: (101, 'Network is unreachable')