Bug 1368076
Summary: | threading wait(timeout) doesn't return after timeout if system clock is set backward | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Steve Best <sbest> | ||||||||||
Component: | python | Assignee: | Charalampos Stratakis <cstratak> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Branislav Náter <bnater> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 7.3 | CC: | bnater, cstratak, jeder, jskarvad, jstancek, pviktori | ||||||||||
Target Milestone: | rc | Keywords: | Patch | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | ppc64le | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | python-2.7.5-52.el7 | Doc Type: | No Doc Update | ||||||||||
Doc Text: |
undefined
|
Story Points: | --- | ||||||||||
Clone Of: | |||||||||||||
: | 1565560 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2017-08-01 20:25:03 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1380359, 1393868 | ||||||||||||
Attachments: |
|
Description
Steve Best
2016-08-18 09:59:53 UTC
I requested loan of the machine. (In reply to Jaroslav Škarvada from comment #1) > I requested loan of the machine. it fails on other systems Created attachment 1193359 [details]
Reproducer
The problem is in python threading library, which doesn't use monotonic timer when handling wait timeouts. Reproducer:
$ date -s "2016-08-19 12:59:42 EDT"
$ ./test.py &
$ date -s "2016-08-19 09:10:26 EDT"
And it stops counting, i.e. the wait doesn't return after the timeout as it should.
python-libs-2.7.5-48.el7.x86_64 Also reproducible with python-2.7.12-2.fc24.x86_64 and python3-3.5.1-13.fc24.x86_64. Created attachment 1195414 [details]
Proposed fix
Workaround for the reproducer is the following (regarding unblocking the process, for e.g. shutdown):
- setup signal handler
- set the threading event inside the signal handler when the signal is recevied
- send signal to the python process which will interrupt the syscall
Unfortunately it's going complicated with the multithreading - you cannot setup signal handler for the thread, so AFAIK there is no way how to interrupt the syscall. By default Python sleeps no more than 50 ms, i.e. it wakes up, but waking the process more than 20x per second is very bad for power consumption. That's why we have experimental low power consumption patch in RHEL which allows applications aware of this feature to sleep more than 50 ms. While this improves power consumption dramatically this may cause the thread to block.
Proposed patch is a fix which shouldn't break anything while ensuring that the wait will not sleep more than 2x timeout (in the worst case) which is much better then sleeping for hours or days which can happen now.
Created attachment 1251887 [details]
Reproducer
Fixed reproducer.
How to reproduce:
# date -s "2016-08-19 12:59:42 EDT"
# ./test.py &
# date -s "2016-08-19 09:10:26 EDT"
But maybe better to use two terminals - the first for foreground run of ./test.py, the second for date manipulation.
Created attachment 1256676 [details]
Proposed fix
Well, I probably finally sorted it out, sorry for confusion which was caused by my "special" testing environment (is described bellow).
With the previous patch it correctly set the delay to the timeout in case (endtime - _time()) > timeout, i.e. when the time shift happen, this guaranteed that it will never wait longer than timeout if in non-balancing mode (the RHEL specific mode). So let's say it should sleep 5 seconds, but the time shift to the past caused that endtime - _time() is e.g. 900 seconds. With the previous patch it correctly set the delay to 5 seconds. Then it looped to the remaining time calculation and the endtime - _time() is now 895 seconds (i.e. 900 - 5), so it set the timout again to min(895, 5) which is 5. This means again 5 second sleep. The effect is that instead of one long sleep(900) we have multiple sleep(5), so the result seems the same as without the patch.
This new patch improves the situation. It recalculates the endtime if in non-balancing mode. For the previous example it would behave the following way:
endtime - _time() = 900, timeout = 5, so remaining = 5
now new endtime will be set to _time() + 5, then it waits 5 seconds, then it recalculates remaining which will become:
endtime - _time() = 0, timeout 5, so remaining = 0 and it will correctly exit
The patch should be safe because it only modifies non-balancing (RHEL specific) behavior, balancing mode (the default) is not touched. I think the drawback could be small lost of precision due to the arithmetic especially if the sleep is interrupted multiple times. This shouldn't be an issue because the non-balancing mode is targeted to low power consumption and high precision is not the goal.
Now why it previously worked in my testing environment. The explanation is simple, I used VirtualBox which by default periodically synchronizes guest clock with host clock. This means that in few seconds the endtime - _time() becomes lower than timeout and due to the multiple sleep the loop is correctly exited. You can only notice slight increase in the delay caused by delay of the time synchronization, which unfortunately didn't catch my attention to realize that it's not working as expected.
Thanks Jaroslav for the fix. It works now and I can reproduce it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1868 |