Bug 1063647 - Twisted's callLater() method is called with too large a value, resulting in OverflowError: timeout is too large
Summary: Twisted's callLater() method is called with too large a value, resulting in O...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Beaker
Classification: Retired
Component: beah
Version: 0.15
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: beaker-dev-list
QA Contact: tools-bugs
URL:
Whiteboard:
: 1201642 1272538 1545394 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-11 07:52 UTC by Nick Coghlan
Modified: 2018-02-23 03:41 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-23 03:41:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 639255 0 high CLOSED setting date few days into the future make test freezes 2021-02-22 00:41:40 UTC

Internal Links: 639255

Description Nick Coghlan 2014-02-11 07:52:51 UTC
A Beaker system in a /reservesys task started spewing the following in the console log and beah error log:

=======================================
Unhandled Error
Traceback (most recent call last):
  File "/usr/bin/beah-rhts-task", line 9, in <module>
    load_entry_point('beah==0.7.dev201401300133', 'console_scripts', 'beah-rhts-task')()
  File "/usr/lib/python2.7/site-packages/beah/tasks/rhts_xmlrpc.py", line 698, in main
    reactor.run()
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1169, in run
    self.mainLoop()
--- <exception caught here> ---
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1181, in mainLoop
    self.doIteration(t)
  File "/usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py", line 362, in doPoll
    l = self._poller.poll(timeout, len(self._selectables))
exceptions.OverflowError: timeout is too large
=======================================

This looks like we may be triggering: https://twistedmatrix.com/trac/ticket/6259

We should be able to avoid doing that - there are only a few places where we use a configurable timeout, and given the daemon that is failing, the likely culprit here is a large setting for TIMEOUT.

Comment 4 Nick Coghlan 2014-02-13 00:48:04 UTC
I was overly optimistic as to how quickly we'd be able to track down the *exact* cause of the problem and come up with a simpler reproducer. Bumping to 0.16 - we'll just have to keep an eye out for this one in the meantime.

Comment 5 Dan Callaghan 2014-02-14 00:54:24 UTC
So we have seen this OverflowError in a few previous bug reports: bug 950646, bug 849587.

The upstream Twisted ticket #6259 is about the epoll reactor going into an infinite loop when an excessively large delay is passed to callLater, but doesn't cover *where* such a large value could be coming from.

On bug 849587 Marian determined that in our case it was due to shifting system clock: bug 639255.

Comment 6 Nick Coghlan 2014-02-26 07:22:24 UTC
This is still under investigation, but we need to understand it better before we can commit to fixing it (and one possible fix may end up being to get restraint up to the point where it can be used as a replacement harness)

Comment 7 Amit Saha 2014-03-20 01:37:26 UTC
I could reproduce this on Python 2.7/Fedora 19.

Here is a reproducer script based off the Twisted ticket:

from twisted.internet import reactor
import time

def myfun():
    print 'Called'
application = service.Application('blah')
reactor.callLater(10, myfun)

In one terminal:

$ twistd -noy reproducer.py

In another terminal:

# date -s "100 days ago"

And soon I can see:

exceptions.OverflowError: timeout is too large


So indeed, as per comment:5 above, this is most likely due to a shifting system clock in the test.

Comment 8 Amit Saha 2014-03-24 04:26:30 UTC
Some more investigative feedback follows.

*System time change theory*

The test in this case doesn't explicitly change the system clock. However, something else may have caused the system clock to change. 

/distribution/reservesys task was aborted:

filling the disk on BRQ conserver 	2014-02-11 17:29:54 +10:00

This will be most likely 2014-02-11 03:29:54 as per the local time zone for the test system. However, the last console log entry is currently at 2014-02-10 20:26:59,939. The console log was truncated, else it would have been easy to see the time going backwards. However, even assuming this is the case (8h time lag), I cannot trigger this error when using the reproducer above in comment:7 .

So likely this theory is flawed or I haven't got the various time zone calculation correct.

Depending on when (w.r.t beah's task execution machinery) the system time may have been changed, the call trace obtained for this error may be different. For example, here is another:

2013-12-13 17:29:33,456 beah.twisted emit: ERROR Unhandled Error 
Traceback (most recent call last): 
  File "/usr/bin/beah-srv", line 9, in <module> 
    load_entry_point('beah==0.7.3.dev201402232050', 'console_scripts', 'beah-srv')() 
  File "/usr/lib/python2.7/site-packages/beah/bin/srv.py", line 39, in main 
    debug.runcall(reactor.run) 
  File "/usr/lib/python2.7/site-packages/beah/core/debug.py", line 11, in runcall 
    a_callable(*args, **kwargs) 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1169, in run 
    self.mainLoop() 
--- <exception caught here> --- 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1181, in mainLoop 
    self.doIteration(t) 
  File "/usr/lib64/python2.7/site-packages/twisted/internet/epollreactor.py", line 362, in doPoll 
    l = self._poller.poll(timeout, len(self._selectables)) 
exceptions.OverflowError: timeout is too large 


In this case, the system date was set to "100 days back" at the beginning of this task. The errors start appearing when when a call to extend_watchdog() was made in the current task:

2014-03-23 18:27:14,050 backend _send_cmd: INFO Command ['Command', 'forward', '38e9d71f-98ef-4312-98ef-4b4af5a27a83', {'event': ['Event', 'extend_watchdog', 'e93a371d-ac02-4f25-bef0-517b83a1468c', {'source': 'beah_beaker_backend', 'id': '2d989b8e-dba7-44d9-aeeb-2690a6e917e7'}, None, {'timeout': 2994}]}] sent. 

The errors stop showing up when the next task starts execution (most likely because 'beah-rhts-task' process exits and a new process starts in the next task - I am not very sure here).


*Bug in Beah theory*

The second possibility of this happening is a bug within beah itself. We make use of callLater() at several places and one of these calls may have been with an exceptionally large value for the time, and hence trigger the error messages. There are only the following instances of callLater() where we have a time delay which can assume different values (the other calls are always with fixed values, barring one random integral value but the min and max limits are fixed)

(Hint: use git grep "callLater")

beah/wires/internals/repeatingproxy.py:

reactor.callLater(delay, self.resend): If a connection to the beaker proxy could not be established for some reason, it keeps retrying using an increasing delay. The console log would have been useful to verify if at all this may have happened. However, it's truncated.

beah/backends/watchdog.py:

_set_watchdog(): self.watchdog_call = reactor.callLater(timeout, self.on_watchdog) #pylint: disable=E1101: This function is called by set_watchdog(), which in turn is called by the "extend_watchdog" event processing function, "proc_evt_extend_watchdog". This event is generated by extend_watchdog() in beah/core/event.py and is invoked at a few places. Theoretically, a sufficiently large value of "timeout" here can cause such errors. However, I don't see that case here. 


beah/wires/internals/twtask.py:

self.delayed_call = reactor.callLater(timeout, self.kill): This call handles the timeout event. However, I can't see any event being generated with a variable value which may be a large number.

*Summary*

I am keen to go with the system change from an external source theory, but it's really difficult to reason esp. in the absence of the entire console log. The system time change would have been clearly seen there. I haven't been able to get the *exact* traceback that the original report describes. This is likely because the traceback will depend on which specific service/process that run as part of beah first has to deal with the time change.

Comment 9 David Spurek 2014-07-14 13:09:08 UTC
Hi, any progress here?
I have this problem very often with fedora 20 (fedora installed on beaker machine).

Comment 10 Amit Saha 2014-07-14 13:33:06 UTC
(In reply to David Spurek from comment #9)
> Hi, any progress here?
> I have this problem very often with fedora 20 (fedora installed on beaker
> machine).

Does the test perform any system time change? can you point to an example recipe/job?

Comment 12 Nick Coghlan 2014-07-15 21:45:55 UTC
I've set the logs in the linked jobs to be retained for longer than normal. We'll aim to allocate some time in the next sprint to have another investigation now we have a couple of new instances of the problem to look at in detail.

Comment 13 Dan Callaghan 2015-03-15 23:53:07 UTC
*** Bug 1201642 has been marked as a duplicate of this bug. ***

Comment 14 Dan Callaghan 2015-10-18 22:59:48 UTC
*** Bug 1272538 has been marked as a duplicate of this bug. ***

Comment 15 Roman Joost 2018-02-14 22:53:25 UTC
This happened again as reported by Bug 1545394. Looking in the console log I saw quite a huge jump in the time stamp:

[  579.154644] beah-beaker-backend[8107]: 2018-02-14 13:07:06,570 backend _send_cmd: INFO Command ['Command', 'forward', '21ba47a5-8bf3-4eff-bdb6-366a1dbefb45', {'event': ['Event', 'extend_watchdog', 'e6d306d1-5520-45c7-ae19-3ee190963ce5', {'source': 'beah_beaker_backend', 'id': 'c19e69b7-7a4d-45d1-8dcf-9eef3fb4ba46'}, None, {'timeout': 2094}]}] sent.·                                                                        

and next line is:

[4294967883.399424] beah-srv[8105]: 2066-03-31 08:14:22,960 beah processExited: INFO TaskStdoutProtocol:processExited([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0.·                                                                                                                               [4294967883.400595] beah-srv[8105]: ])·                                                                                                       [4294967883.400847] beah-srv[8105]: 2066-03-31 08:14:22,960 beah processEnded: INFO TaskStdoutProtocol:processEnded([Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessDone'>: A process has ended without apparent errors: process finished with exit code 0.·

Comment 16 Dalibor Pospíšil 2018-02-15 11:27:05 UTC
Is there any chance to fix this issue quickly? Let's say in next four years?

Comment 17 Dalibor Pospíšil 2018-02-15 11:28:50 UTC
*** Bug 1545394 has been marked as a duplicate of this bug. ***

Comment 18 Dan Callaghan 2018-02-15 22:12:42 UTC
Our best understanding is that it is a bug in Twisted. It relies on clock time for callLater instead of a monotonic time source. Given the Twisted bug's long history, does not appear likely to be fixed any time soon.

If you have tests which need to warp the system clock, we recommend using the Restraint harness instead of Beah. At least if it has similar issues there is a good chance we will be able to fix them.

Comment 19 Roman Joost 2018-02-15 22:45:12 UTC
I agree with Dan to move to restraint if that is feasible.

There might be a some hope for this tho. I've looked up the corresponding upstream bug and traced it back to the release notes:

https://github.com/twisted/twisted/blob/twisted-17.9.0/NEWS.rst#bugfixes-61
(Bug # 6259)

We're using twisted-core 12.2 while this is fixed in twisted-core 13.0. I'll grab the reproducer from comment 8 and see if it would fix this. The problem then would shift to how we could update twisted-core for RHEL7.

Comment 20 Roman Joost 2018-02-20 07:32:37 UTC
I've tested with a more recent version of Fedora (26 to be precise) here:

https://beaker-devel.app.eng.bos.redhat.com/jobs/13912

F26 comes with twisted 16.3 including the fix. The reproducer didn't go in an infinite loop.

The only problem I can see with such a big clock change is the fact that the scheduled function would not be called until the scheduled time + clock change. Looking at the job in Bug 1545394 the change was quite significant which was 52 years. The chances are high that the watchdog would have just aborted the job without beah reporting anything.

Comment 21 Dan Callaghan 2018-02-23 03:36:02 UTC
So this is no longer an issue on recent Fedora, which has a recent Twisted which has the bug fix.

RHEL5 and RHEL6 ship Twisted 8.

RHEL7 ships Twisted 12 in Optional.

In general we avoid shipping packages in the Beaker harness repo which replace RHEL packages, because it "poisons" the test system (in the sense of, making it different to what a real custom would get) and has the potential to ruin or invalidate test results.

In particular, suddenly shipping a much newer Twisted has a high potential to break any tests or other software in RHEL which was using Twisted. Even though they are good at backwards compatibility I'm sure it is not perfect.

Comment 22 Roman Joost 2018-02-23 03:41:00 UTC
I'm closing this as WONTFIX for RHEL < 8 and suggest for these versions to use restraint.

Furthermore ensure that your tests don't change the system time introducing huge jumps.


Note You need to log in before you can comment on or make changes to this bug.