Bug 839601

Summary: [Beaker][BEAH] Harness recipes not completing as expected.
Product: [Retired] Beaker Reporter: Jeff Burke <jburke>
Component: beahAssignee: beaker-dev-list
Status: CLOSED INSUFFICIENT_DATA QA Contact: tools-bugs <tools-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 0.9CC: azelinka, bpeck, dcallagh, dowang, ebaak, jburke, jhutar, jpazdziora, jscotka, jstancek, mcsontos, pbunyan, rjoost, stl, tools-bugs, xtian, yshao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: Misc
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-05 13:05:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 545868    

Description Jeff Burke 2012-07-12 12:16:43 UTC
Description of problem:
 When running kernel testing I have noticed recipes not completing as we expect.
Lookign at the tasks logs the system is running normally. It completes the current
task. It should move on to the next task but it does not. The system gets a EWD

Version-Release number of selected component (if applicable):
0.9.0 (But this issue has been around since before the 0.9.0)

How reproducible:
Intermittent

Actual results:
 From serial console log:

07/11/12 14:35:48  JobID:260045 Test:/kernel/tracepoints/operational Response:1  
2012-07-11 14:35:48,086 backend async_proc: INFO Extending Watchdog for task 6849856 by 10800..  
logger: /usr/bin/rhts-test-runner.sh rhts-test-update 127.0.0.1:7086 6849856 start  rh-tests-kernel-tracepoints-operational  
2012-07-11 14:35:48,149 rhts_task checkin_start: INFO setting nohup  
07/11/12 14:35:48  testID:6849856 start:  
/mnt/tests/kernel/tracepoints/operational /mnt/tests/kernel/tracepoints/operational  
2012-07-11 14:35:48,217 backend _send_cmd: INFO Command ['Command', 'forward', '72b4b606-df2e-43f3-9ebe-1f457b859bb4', {'event': ['Event', 'extend_watchdog', '61781720-4910-4e37-bf7a-e93deafa3160', {'source': 'beah_beaker_backend', 'id': '0b1bdd73-cae0-4677-9b9f-a1c308caaff7'}, None, {'timeout': 10794}]}] sent.  
logger: 2012-07-11 14:36:04 /usr/bin/rhts-test-runner.sh 11924 2940 hearbeat...  
logger: 2012-07-11 14:37:03 /usr/bin/rhts-test-runner.sh 11924 3000 hearbeat...  
logger: 2012-07-11 14:39:03 /usr/bin/rhts-test-runner.sh 11924 3120 hearbeat...  
[-- MARK -- Wed Jul 11 14:40:00 2012] 
logger: 2012-07-11 14:40:04 /usr/bin/rhts-test-runner.sh 11924 3180 hearbeat...  
logger: 2012-07-11 14:41:03 /usr/bin/rhts-test-runner.sh 11924 3240 hearbeat...  
logger: 2012-07-11 14:42:04 /usr/bin/rhts-test-runner.sh 11924 3300 hearbeat...  
logger: 2012-07-11 14:43:03 /usr/bin/rhts-test-runner.sh 11924 3360 hearbeat...  
logger: 2012-07-11 14:44:03 /usr/bin/rhts-test-runner.sh 11924 3420 hearbeat...  
[-- MARK -- Wed Jul 11 14:45:00 2012] 
logger: 2012-07-11 14:45:03 /usr/bin/rhts-test-runner.sh 11924 3480 hearbeat...  
logger: 2012-07-11 14:46:03 /usr/bin/rhts-test-runner.sh 11924 3540 hearbeat...  
logger: 2012-07-11 14:47:03 /usr/bin/rhts-test-runner.sh 11924 3600 hearbeat...  
logger: 2012-07-11 14:48:03 /usr/bin/rhts-test-runner.sh 11924 3660 hearbeat...  
logger: 2012-07-11 14:49:03 /usr/bin/rhts-test-runner.sh 11924 3720 hearbeat...  
[-- MARK -- Wed Jul 11 14:50:00 2012] 
logger: 2012-07-11 14:50:03 /usr/bin/rhts-test-runner.sh 11924 3780 hearbeat...  
logger: 2012-07-11 14:51:03 /usr/bin/rhts-test-runner.sh 11924 3840 hearbeat...  
logger: 2012-07-11 14:52:03 /usr/bin/rhts-test-runner.sh 11924 3900 hearbeat...  
logger: 2012-07-11 14:53:03 /usr/bin/rhts-test-runner.sh 11924 3960 hearbeat...  
logger: 2012-07-11 14:54:03 /usr/bin/rhts-test-runner.sh 11924 4020 hearbeat...  
[-- MARK -- Wed Jul 11 14:55:00 2012] 
logger: 2012-07-11 14:55:03 /usr/bin/rhts-test-runner.sh 11924 4080 hearbeat...  
logger: 2012-07-11 14:56:03 /usr/bin/rhts-test-runner.sh 11924 4140 hearbeat...  
logger: 2012-07-11 14:57:03 /usr/bin/rhts-test-runner.sh 11924 4200 hearbeat...  
logger: 2012-07-11 14:58:03 /usr/bin/rhts-test-runner.sh 11924 4260 hearbeat...  
logger: 2012-07-11 14:59:03 /usr/bin/rhts-test-runner.sh 11924 4320 hearbeat...  
[-- MARK -- Wed Jul 11 15:00:00 2012] 
logger: 2012-07-11 15:00:03 /usr/bin/rhts-test-runner.sh 11924 4380 hearbeat...  
logger: 2012-07-11 15:01:03 /usr/bin/rhts-test-runner.sh 11924 4440 hearbeat...  
logger: /usr/bin/rhts-test-runner.sh report_finish start...  
jobsl = [1]+ 22482 Done make run >> /mnt/testarea/TESTOUT.log 2>&1  
status = Done  
no machines specified. exiting  
2012-07-11 15:01:43,242 rhts_task checkin_finish: INFO resetting nohup  
07/11/12 15:01:43  testID:6849856 finish:  
Nothing to restore.  
logger: /usr/bin/rhts-test-runner.sh report_finish stop...  
+ rc=0  
+ '[' -n '' ']'  
+ exit 0  
2012-07-11 15:01:43,322 rhts_task task_exited: INFO task_exited([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 15.  
])  
2012-07-11 15:01:43,323 rhts_task on_exit: INFO quitting...  
2012-07-11 15:01:44,328 rhts_task task_ended: INFO task_ended([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 15.  
])  
2012-07-11 15:01:44,362 beah processExited: INFO TaskStdoutProtocol:processExited([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0.  
])  
2012-07-11 15:01:44,362 beah processEnded: INFO TaskStdoutProtocol:processEnded([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0.  
])  
2012-07-11 15:01:44,370 beah task_finished: INFO Task 0b1bdd73-cae0-4677-9b9f-a1c308caaff7 has finished.  
2012-07-11 15:01:44,371 backend async_proc: INFO Task 6849856 done. Submitting logs...  
[-- MARK -- Wed Jul 11 15:05:00 2012] 
[-- MARK -- Wed Jul 11 15:10:00 2012] 
[-- MARK -- Wed Jul 11 15:15:00 2012] 
[-- MARK -- Wed Jul 11 15:20:00 2012] 
[-- MARK -- Wed Jul 11 15:25:00 2012] 
[-- MARK -- Wed Jul 11 15:30:00 2012] 
[-- MARK -- Wed Jul 11 15:35:00 2012] 

Expected results:
It should have continued to the next task

Additional info:

Comment 2 Jeff Burke 2012-07-12 12:27:08 UTC
One thing I did notice when this happens is that we don't get all of the BEAH debug logs submitted. For example we typically get the following logs:
 debug/.task_beah_raw
 debug/task_output_stdout
 debug/task_beah_unexpected
 debug/task_log

When I see this issue we only get these two logs:
 debug/.task_beah_raw
 debug/task_log

In the WebUI looking at the results I see the last subresult for that tasks was 
./jbd2_commit_locking with a score of 101. In this instance the score 101 indicates the count. We had 101 subreults so far. Looking at the debug/task_log it only goes up to 97. Looking at the debug/.task_beah_raw it goes to 102. Not sure if any of that actually means anything, it was just my observation.

Comment 7 Nick Coghlan 2012-10-17 04:33:01 UTC
Bulk reassignment of issues as Bill has moved to another team.

Comment 11 Amit Saha 2013-03-11 03:18:27 UTC
(In reply to comment #2)
> One thing I did notice when this happens is that we don't get all of the
> BEAH debug logs submitted. For example we typically get the following logs:
>  debug/.task_beah_raw
>  debug/task_output_stdout
>  debug/task_beah_unexpected
>  debug/task_log
> 
> When I see this issue we only get these two logs:
>  debug/.task_beah_raw
>  debug/task_log
> 
> In the WebUI looking at the results I see the last subresult for that tasks
> was 
> ./jbd2_commit_locking with a score of 101. In this instance the score 101
> indicates the count. We had 101 subreults so far. Looking at the
> debug/task_log it only goes up to 97. Looking at the debug/.task_beah_raw it
> goes to 102. Not sure if any of that actually means anything, it was just my
> observation.

This could be interesting from the point of view of what is causing the problem. I had a look at the console logs of all the failed recipes, and the last line is similar to:

"2012-07-25 20:22:49,944 backend async_proc: INFO Task 7108104 done. Submitting logs...  "

for all of them.

Comment 13 Amit Saha 2013-03-11 04:31:30 UTC
(In reply to comment #11)
> (In reply to comment #2)
> > One thing I did notice when this happens is that we don't get all of the
> > BEAH debug logs submitted. For example we typically get the following logs:
> >  debug/.task_beah_raw
> >  debug/task_output_stdout
> >  debug/task_beah_unexpected
> >  debug/task_log
> > 
> > When I see this issue we only get these two logs:
> >  debug/.task_beah_raw
> >  debug/task_log
> > 
> > In the WebUI looking at the results I see the last subresult for that tasks
> > was 
> > ./jbd2_commit_locking with a score of 101. In this instance the score 101
> > indicates the count. We had 101 subreults so far. Looking at the
> > debug/task_log it only goes up to 97. Looking at the debug/.task_beah_raw it
> > goes to 102. Not sure if any of that actually means anything, it was just my
> > observation.
> 
> This could be interesting from the point of view of what is causing the
> problem. I had a look at the console logs of all the failed recipes, and the
> last line is similar to:
> 
> "2012-07-25 20:22:49,944 backend async_proc: INFO Task 7108104 done.
> Submitting logs...  "

It seems to me that the extend_watchdog event is failing. In beakerlc.py:async_proc(), there is a extend_watchdog operation:

log.info('Task %s done. Submitting logs...', id)
d = self.proxy.callRemote('extend_watchdog', id, 4*3600)


In the console.log of successful jobs, this is visible - there is a extend_watchdog event registered, just when submitting_logs starts. In the recipes where this bug is seen, I don't see that.

Just speculations, right now.






> 
> for all of them.

Comment 23 Dan Callaghan 2014-07-14 04:29:31 UTC
Jeff Burke reported hitting this same issue again on R:1436032.