Bug 1368076 - threading wait(timeout) doesn't return after timeout if system clock is set backward
Summary: threading wait(timeout) doesn't return after timeout if system clock is set b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: python
Version: 7.3
Hardware: ppc64le
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Charalampos Stratakis
QA Contact: Branislav Náter
URL:
Whiteboard:
Depends On:
Blocks: 1380359 1393868
TreeView+ depends on / blocked
 
Reported: 2016-08-18 09:59 UTC by Steve Best
Modified: 2017-08-01 20:25 UTC (History)
6 users (show)

Fixed In Version: python-2.7.5-52.el7
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
: 1565560 (view as bug list)
Environment:
Last Closed: 2017-08-01 20:25:03 UTC
Target Upstream Version:


Attachments (Terms of Use)
Reproducer (154 bytes, text/x-python)
2016-08-23 16:15 UTC, Jaroslav Škarvada
no flags Details
Proposed fix (504 bytes, patch)
2016-08-29 15:30 UTC, Jaroslav Škarvada
no flags Details | Diff
Reproducer (161 bytes, text/x-python)
2017-02-17 12:26 UTC, Jaroslav Škarvada
no flags Details
Proposed fix (657 bytes, patch)
2017-02-22 18:43 UTC, Jaroslav Škarvada
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Python 12822 0 None None None 2016-08-24 14:11:27 UTC
Python 23428 0 None None None 2016-08-24 14:10:40 UTC
Red Hat Product Errata RHSA-2017:1868 0 normal SHIPPED_LIVE Moderate: python security and bug fix update 2017-08-01 18:24:21 UTC

Description Steve Best 2016-08-18 09:59:53 UTC
Description of problem:
On reboot for power  system I see the following message

The job below doesn't complete before time out value 1 minute 30 secs

A stop job is running for Dynamic S...m Tuning Daemon

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Install RHEL 7.3 nightly I used 817 I've seen it in other nightly also  
2.reboot the system
3.

Actual results:

A stop job is running for Dynamic S...m Tuning Daemon times out at 1 minute 30 sec

Expected results:
The job should stop successful

Additional info:
[  OK  ] Started Show Plymouth Reboot Screen.
[  OK  ] Started Restore /run/initramfs.
[  OK  ] Stopped Postfix Mail Transport Agent.
[  OK  ] Stopped Dynamic System Tuning Daemon.
[  OK  ] Stopped target Network.
         Stopping Network Manager...

I was using the following system
ibm-p8-garrison-04.lab4.eng.bos.redhat.com

Comment 1 Jaroslav Škarvada 2016-08-18 10:05:26 UTC
I requested loan of the machine.

Comment 3 Steve Best 2016-08-18 11:21:19 UTC
(In reply to Jaroslav Škarvada from comment #1)
> I requested loan of the machine.

it fails on other systems

Comment 11 Jaroslav Škarvada 2016-08-23 16:15:22 UTC
Created attachment 1193359 [details]
Reproducer

The problem is in python threading library, which doesn't use monotonic timer when handling wait timeouts. Reproducer:

$ date -s "2016-08-19 12:59:42 EDT"
$ ./test.py &
$ date -s "2016-08-19 09:10:26 EDT"

And it stops counting, i.e. the wait doesn't return after the timeout as it should.

Comment 12 Jaroslav Škarvada 2016-08-23 16:40:04 UTC
python-libs-2.7.5-48.el7.x86_64

Comment 13 Jaroslav Škarvada 2016-08-24 08:18:16 UTC
Also reproducible with python-2.7.12-2.fc24.x86_64 and python3-3.5.1-13.fc24.x86_64.

Comment 17 Jaroslav Škarvada 2016-08-29 15:30:02 UTC
Created attachment 1195414 [details]
Proposed fix

Workaround for the reproducer is the following (regarding unblocking the process, for e.g. shutdown):
- setup signal handler
- set the threading event inside the signal handler when the signal is recevied
- send signal to the python process which will interrupt the syscall

Unfortunately it's going complicated with the multithreading - you cannot setup signal handler for the thread, so AFAIK there is no way how to interrupt the syscall. By default Python sleeps no more than 50 ms, i.e. it wakes up, but waking the process more than 20x per second is very bad for power consumption. That's why we have experimental low power consumption patch in RHEL which allows applications aware of this feature to sleep more than 50 ms. While this improves power consumption dramatically this may cause the thread to block.

Proposed patch is a fix which shouldn't break anything while ensuring that the wait will not sleep more than 2x timeout (in the worst case) which is much better then sleeping for hours or days which can happen now.

Comment 23 Jaroslav Škarvada 2017-02-17 12:26:21 UTC
Created attachment 1251887 [details]
Reproducer

Fixed reproducer.

How to reproduce:

# date -s "2016-08-19 12:59:42 EDT"
# ./test.py &
# date -s "2016-08-19 09:10:26 EDT"

But maybe better to use two terminals - the first for foreground run of ./test.py, the second for date manipulation.

Comment 24 Jaroslav Škarvada 2017-02-22 18:43:09 UTC
Created attachment 1256676 [details]
Proposed fix

Well, I probably finally sorted it out, sorry for confusion which was caused by my "special" testing environment (is described bellow).

With the previous patch it correctly set the delay to the timeout in case (endtime - _time()) > timeout, i.e. when the time shift happen, this guaranteed that it will never wait longer than timeout if in non-balancing mode (the RHEL specific mode). So let's say it should sleep 5 seconds, but the time shift to the past caused that endtime - _time() is e.g. 900 seconds. With the previous patch it correctly set the delay to 5 seconds. Then it looped to the remaining time calculation and the endtime - _time() is now 895 seconds (i.e. 900 - 5), so it set the timout again to min(895, 5) which is 5. This means again 5 second sleep. The effect is that instead of one long sleep(900) we have multiple sleep(5), so the result seems the same as without the patch.

This new patch improves the situation. It recalculates the endtime if in non-balancing mode. For the previous example it would behave the following way:
endtime - _time() = 900, timeout = 5, so remaining = 5
now new endtime will be set to _time() + 5, then it waits 5 seconds, then it recalculates remaining which will become:
endtime - _time() = 0, timeout 5, so remaining = 0 and it will correctly exit

The patch should be safe because it only modifies non-balancing (RHEL specific) behavior, balancing mode (the default) is not touched. I think the drawback could be small lost of precision due to the arithmetic especially if the sleep is interrupted multiple times. This shouldn't be an issue because the non-balancing mode is targeted to low power consumption and high precision is not the goal.

Now why it previously worked in my testing environment. The explanation is simple, I used VirtualBox which by default periodically synchronizes guest clock with host clock. This means that in few seconds the endtime - _time() becomes lower than timeout and due to the multiple sleep the loop is correctly exited. You can only notice slight increase in the delay caused by delay of the time synchronization, which unfortunately didn't catch my attention to realize that it's not working as expected.

Comment 25 Charalampos Stratakis 2017-02-23 10:12:56 UTC
Thanks Jaroslav for the fix. It works now and I can reproduce it.

Comment 28 errata-xmlrpc 2017-08-01 20:25:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1868


Note You need to log in before you can comment on or make changes to this bug.