Bug 849587

Summary:	exceptions.OverflowError: timeout is too large
Product:	[Retired] Beaker	Reporter:	Jan Stancek <jstancek>
Component:	beah	Assignee:	Bill Peck <bpeck>
Status:	CLOSED DUPLICATE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	0.9	CC:	bpeck, dcallagh, gozen, jburke, kbaker, mcsontos, pbunyan, rmancy
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-09-14 12:55:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Stancek 2012-08-20 09:40:22 UTC

Description of problem:
While running RHEL7 recipe, harness appeared to be stuck in a loop and kept emitting following error to console.log:

2012-08-20 03:34:49,093 rhts_task.twisted emit: ERROR Unhandled Error 
Traceback (most recent call last): 
  File "/bin/beah-rhts-task", line 9, in <module> 
    load_entry_point('beah==0.6.40.dev201208031240', 'console_scripts', 'beah-rhts-task')() 
  File "/usr/lib/python2.7/site-packages/beah/tasks/rhts_xmlrpc.py", line 677, in main 
    reactor.run() 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1169, in run 
    self.mainLoop() 
--- <exception caught here> --- 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1181, in mainLoop 
    self.doIteration(t) 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py", line 362, in doPoll 
    l = self._poller.poll(timeout, len(self._selectables)) 
exceptions.OverflowError: timeout is too large 

After ~30min I rebooted host from WEBUI and recipe started to progress.

Version-Release number of selected component (if applicable):
rhts-test-env.noarch 0:4.51-1.el7
RHEL-7.0-20120711.2 Server ppc64

How reproducible:
sporadic

Steps to Reproduce:
1. provision RHEL-7.0-20120711.2 Server ppc64

Actual results:
harness was emitting same error for ~30min (presumably stuck)

Expected results:
harness should not get stuck and keep filling logs with this message

Additional info:

Comment 2 Marian Csontos 2012-09-14 12:51:30 UTC

Likely the problem is related to "unstable" system time:

    02/14/14 05:40:09  testID:7516670 start: 
    2012-08-20 03:34:49,093 rhts_task.twisted emit: ERROR Unhandled Error 
    ...

So the time was shift 2 years back. Apparently it is too much for select/epoll...

We had seen the issue in other job where the console log got stuck on:

    09/14/12 12:42:55  testID:7940822 start: 
    2012-09-14 12:42:55,774 backend _send_cmd: INFO Command ['Command', 'forward', 'efec2e09-190b-462b-a349-188d3c2a25a0', {'event': ['Event', 'extend_watchdog', '5ab5aa1a-dc94-4be2-aa39-7b4781eaed1b', {'source': 'beah_beaker_backend', 'id': 'c5b0053b-1a4a-4b79-8d1b-eec56280399f'}, None, {'timeout': 2994}]}] sent. 
    [-- MARK -- Fri Sep 14 04:45:00 2012] 
    2012-09-14 12:45:08,723 rhts_task checkin_finish: INFO resetting nohup 
    09/14/12 12:45:08  testID:7940822 finish: 
    2012-09-14 12:45:08,772 rhts_task task_exited: INFO task_exited([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0. 
]) 
    2012-09-14 12:45:08,773 rhts_task on_exit: INFO quitting... 
    2012-09-14 12:45:08,773 rhts_task task_ended: INFO task_ended([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0. 
]) 

`beah-rhts-task` process was hang here, waiting for "1 second", but the time was shifted few hours back during the second, so it was actually waiting for few hours.

Shifting time 4 hours to the future has unstucked the job.

*Solution?*

twisted handles delayedCall using system time (time.time) - see _timeFunction in twisted.python.runtime.

Not sure changing the _timeFunction for unix systems to use uptime would be safe.

Function getting uptime in seconds has to be (somehow?) passed to:
- all DelayedCall (easy: we could pass it as seconds argument?)[1]
  - but it does not deal with any internal ocurences
- all reactor.callLater calls :-/
  - by setting reactor.seconds to the function?

Also time.time is used elsewhere to get time-delta.

[1] http://twistedmatrix.com/documents/8.1.0/api/twisted.internet.base.DelayedCall.html#__init__

Comment 3 Marian Csontos 2012-09-14 12:55:14 UTC


*** This bug has been marked as a duplicate of bug 639255 ***