950646 – OverflowError in beah test harness poll call

Bug 950646 - OverflowError in beah test harness poll call

Summary: OverflowError in beah test harness poll call

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	beaker-dev-list
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	798718
TreeView+	depends on / blocked

Reported:	2013-04-10 14:50 UTC by Petr Sklenar
Modified:	2018-02-06 00:41 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-10-02 04:36:46 UTC
Embargoed:

Attachments	(Terms of Use)

Description Petr Sklenar 2013-04-10 14:50:20 UTC

Description of problem:
Beaker job is aborted when test rpm is not downloaded

Version-Release number of selected component (if applicable):
Version - 0.11.3

How reproducible:
rarely

Steps to Reproduce:
1, sometime (for unknown reason to me) yum doesn't download rpm:
 https://beaker.engineering.redhat.com/recipes/844894#task11845642
#Dont know why rpm cannot be downloaded. It's possible that someone could create another rpm in that time but beaker shouldn't abort the whole job.

Actual results:
console.log
2006-12-31 10:00:40,254 backend.twisted emit: ERROR Unhandled Error 
Traceback (most recent call last): 
  File "/usr/bin/beah-beaker-backend", line 9, in <module> 
    load_entry_point('beah==0.6.43.dev201303102204', 'console_scripts', 'beah-beaker-backend')() 
  File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 2007, in main 
    debug.runcall(reactor.run) 
  Fi                                                                              

? wait_for_xmitr+0xa0/0xa0 
[
266                                                                             [-- MARK -- Wed Apr 10 03:55:00 2013] 
[-- MARK -- Wed Apr 10 04:00:00 2013]
...
[-- MARK -- Wed Apr 10 06:10:00 2013] 
[-- MARK -- Wed Apr 10 06:15:00 2013]
--------
job aborted

Expected results:
no abort for the whole job

Additional info:
https://beaker.engineering.redhat.com/recipes/842046#task11796119
https://beaker.engineering.redhat.com/recipes/844894

Comment 1 Dan Callaghan 2013-04-11 08:17:38 UTC

(In reply to comment #0)
> Description of problem:
> Beaker job is aborted when test rpm is not downloaded
> 
> Version-Release number of selected component (if applicable):
> Version - 0.11.3
> 
> How reproducible:
> rarely
> 
> Steps to Reproduce:
> 1, sometime (for unknown reason to me) yum doesn't download rpm:
>  https://beaker.engineering.redhat.com/recipes/844894#task11845642
> #Dont know why rpm cannot be downloaded. It's possible that someone could
> create another rpm in that time but beaker shouldn't abort the whole job.

The util-linux-ng package wasn't installed because it's not present in the RHEL7 tree you used.

$ repoquery --disablerepo=* --enablerepo=RHEL-7.0-20130306.0 --repofrompath=RHEL-7.0-20130306.0,http://download.eng.bos.redhat.com/rel-eng/RHEL-7.0-20130306.0/compose/Server/x86_64/os/ util-linux-ng
$ repoquery --disablerepo=* --enablerepo=RHEL-7.0-20130306.0 --repofrompath=RHEL-7.0-20130306.0,http://download.eng.bos.redhat.com/rel-eng/RHEL-7.0-20130306.0/compose/Server/x86_64/os/ util-linux
util-linux-0:2.22.1-2.4.el7.x86_64

But that didn't abort your job.

The actual error seems to be here:

2013-04-09 21:49:25,697 backend async_proc: INFO Extending Watchdog for task 11845649 by 9000.. 
04/09/13 21:49:25  JobID:402207 Test:/CoreOS/vixie-cron/Regression/bug-232439_fail_on_first_Jan Response:1 
2013-04-09 21:49:25,804 rhts_task checkin_start: INFO setting nohup 
04/09/13 21:49:25  testID:11845649 start: 
2006-12-31 09:56:00,282 backend.twisted emit: ERROR Unhandled Error 
Traceback (most recent call last): 
  File "/usr/bin/beah-beaker-backend", line 9, in <module> 
    load_entry_point('beah==0.6.43.dev201303102204', 'console_scripts', 'beah-beaker-backend')() 
  File "/usr/lib/python2.7/site-packages/beah/backends/beakerlc.py", line 2007, in main 
    debug.runcall(reactor.run) 
  File "/usr/lib/python2.7/site-packages/beah/core/debug.py", line 11, in runcall 
    a_callable(*args, **kwargs) 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1169, in run 
    self.mainLoop() 
--- <exception caught here> --- 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1181, in mainLoop 
    self.doIteration(t) 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py", line 362, in doPoll 
    l = self._poller.poll(timeout, len(self._selectables)) 
exceptions.OverflowError: timeout is too large 

The OverflowError is repeated forever until the watchdog aborted the job. I'm not sure why this would happen, it seems like it must be a harness bug. Particularly since you had the same thing happen at the same point in your recipe on another system.

I also noticed on the console log for R:844894 a very large number of RAID and SCSI offline errors from the kernel. Are those expected as part of the util-linux tests?

Comment 3 Nick Coghlan 2013-09-30 01:22:52 UTC

Hi Petr, as per Dan's question above, could you provide a bit more info on the expected impact of the util-linux tests?

Comment 4 Petr Sklenar 2013-09-30 07:59:30 UTC

Hi,
I thing that this is not due to util-linux(-ng ) on the rhel7.
We have set of tier tests with +-100 tests for the whole team.

Some user will create more updates in one of the test during scheduling job and bump the version more times ....
Then the whole job is aborted instead of one fail.

I will try it to be sure, I let you know.

Comment 5 Nick Coghlan 2013-09-30 08:15:56 UTC

Petr, bug 880855 affected versions prior to Beaker 0.13 and could result in jobs failing due to new task versions being uploaded. That's not the bug covered by this issue though - we're interested in the OverflowError noted above.

Comment 6 Petr Sklenar 2013-10-01 09:14:44 UTC

I was trying to reproduce but I didn't succeed with it. I tried the same sets of tests and I works now. ( J:506814 or J:506813 )

FYI util-linux(|-ng) test cases does not expect any raid/scsi error.

Comment 7 Nick Coghlan 2013-10-02 04:36:46 UTC

OK, we made a few reliability improvements to both beah and task repo creation over the last few releases, so it's quite plausible that this has been fixed since it was first encountered.

Closing this one - please file a new bug report if you have anything similar recur.

Note You need to log in before you can comment on or make changes to this bug.