1368076 – threading wait(timeout) doesn't return after timeout if system clock is set backward

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1368076 - threading wait(timeout) doesn't return after timeout if system clock is set backward

Summary: threading wait(timeout) doesn't return after timeout if system clock is set b...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	python
Sub Component:
Version:	7.3
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Charalampos Stratakis
QA Contact:	Branislav Náter
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1380359 1393868
TreeView+	depends on / blocked

Reported:	2016-08-18 09:59 UTC by Steve Best
Modified:	2017-08-01 20:25 UTC (History)
CC List:	6 users (show)
Fixed In Version:	python-2.7.5-52.el7
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1565560 (view as bug list)
Environment:
Last Closed:	2017-08-01 20:25:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Reproducer (154 bytes, text/x-python) 2016-08-23 16:15 UTC, Jaroslav Škarvada	no flags	Details
Proposed fix (504 bytes, patch) 2016-08-29 15:30 UTC, Jaroslav Škarvada	no flags	Details \| Diff
Reproducer (161 bytes, text/x-python) 2017-02-17 12:26 UTC, Jaroslav Škarvada	no flags	Details
Proposed fix (657 bytes, patch) 2017-02-22 18:43 UTC, Jaroslav Škarvada	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Python	12822	None	None	None	2016-08-24 14:11:27 UTC
Python	23428	None	None	None	2016-08-24 14:10:40 UTC
Red Hat Product Errata	RHSA-2017:1868	normal	SHIPPED_LIVE	Moderate: python security and bug fix update	2017-08-01 18:24:21 UTC

Description Steve Best 2016-08-18 09:59:53 UTC

Description of problem:
On reboot for power  system I see the following message

The job below doesn't complete before time out value 1 minute 30 secs

A stop job is running for Dynamic S...m Tuning Daemon

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Install RHEL 7.3 nightly I used 817 I've seen it in other nightly also  
2.reboot the system
3.

Actual results:

A stop job is running for Dynamic S...m Tuning Daemon times out at 1 minute 30 sec

Expected results:
The job should stop successful

Additional info:
[  OK  ] Started Show Plymouth Reboot Screen.
[  OK  ] Started Restore /run/initramfs.
[  OK  ] Stopped Postfix Mail Transport Agent.
[  OK  ] Stopped Dynamic System Tuning Daemon.
[  OK  ] Stopped target Network.
         Stopping Network Manager...

I was using the following system
ibm-p8-garrison-04.lab4.eng.bos.redhat.com

Comment 1 Jaroslav Škarvada 2016-08-18 10:05:26 UTC

I requested loan of the machine.

Comment 3 Steve Best 2016-08-18 11:21:19 UTC

(In reply to Jaroslav Škarvada from comment #1)
> I requested loan of the machine.

it fails on other systems

Comment 11 Jaroslav Škarvada 2016-08-23 16:15:22 UTC

Created attachment 1193359 [details]
Reproducer

The problem is in python threading library, which doesn't use monotonic timer when handling wait timeouts. Reproducer:

$ date -s "2016-08-19 12:59:42 EDT"
$ ./test.py &
$ date -s "2016-08-19 09:10:26 EDT"

And it stops counting, i.e. the wait doesn't return after the timeout as it should.

Comment 12 Jaroslav Škarvada 2016-08-23 16:40:04 UTC

python-libs-2.7.5-48.el7.x86_64

Comment 13 Jaroslav Škarvada 2016-08-24 08:18:16 UTC

Also reproducible with python-2.7.12-2.fc24.x86_64 and python3-3.5.1-13.fc24.x86_64.

Comment 17 Jaroslav Škarvada 2016-08-29 15:30:02 UTC

Created attachment 1195414 [details]
Proposed fix

Workaround for the reproducer is the following (regarding unblocking the process, for e.g. shutdown):
- setup signal handler
- set the threading event inside the signal handler when the signal is recevied
- send signal to the python process which will interrupt the syscall

Unfortunately it's going complicated with the multithreading - you cannot setup signal handler for the thread, so AFAIK there is no way how to interrupt the syscall. By default Python sleeps no more than 50 ms, i.e. it wakes up, but waking the process more than 20x per second is very bad for power consumption. That's why we have experimental low power consumption patch in RHEL which allows applications aware of this feature to sleep more than 50 ms. While this improves power consumption dramatically this may cause the thread to block.

Proposed patch is a fix which shouldn't break anything while ensuring that the wait will not sleep more than 2x timeout (in the worst case) which is much better then sleeping for hours or days which can happen now.

Comment 23 Jaroslav Škarvada 2017-02-17 12:26:21 UTC

Created attachment 1251887 [details]
Reproducer

Fixed reproducer.

How to reproduce:

# date -s "2016-08-19 12:59:42 EDT"
# ./test.py &
# date -s "2016-08-19 09:10:26 EDT"

But maybe better to use two terminals - the first for foreground run of ./test.py, the second for date manipulation.

Comment 24 Jaroslav Škarvada 2017-02-22 18:43:09 UTC

Created attachment 1256676 [details]
Proposed fix

Well, I probably finally sorted it out, sorry for confusion which was caused by my "special" testing environment (is described bellow).

With the previous patch it correctly set the delay to the timeout in case (endtime - _time()) > timeout, i.e. when the time shift happen, this guaranteed that it will never wait longer than timeout if in non-balancing mode (the RHEL specific mode). So let's say it should sleep 5 seconds, but the time shift to the past caused that endtime - _time() is e.g. 900 seconds. With the previous patch it correctly set the delay to 5 seconds. Then it looped to the remaining time calculation and the endtime - _time() is now 895 seconds (i.e. 900 - 5), so it set the timout again to min(895, 5) which is 5. This means again 5 second sleep. The effect is that instead of one long sleep(900) we have multiple sleep(5), so the result seems the same as without the patch.

This new patch improves the situation. It recalculates the endtime if in non-balancing mode. For the previous example it would behave the following way:
endtime - _time() = 900, timeout = 5, so remaining = 5
now new endtime will be set to _time() + 5, then it waits 5 seconds, then it recalculates remaining which will become:
endtime - _time() = 0, timeout 5, so remaining = 0 and it will correctly exit

The patch should be safe because it only modifies non-balancing (RHEL specific) behavior, balancing mode (the default) is not touched. I think the drawback could be small lost of precision due to the arithmetic especially if the sleep is interrupted multiple times. This shouldn't be an issue because the non-balancing mode is targeted to low power consumption and high precision is not the goal.

Now why it previously worked in my testing environment. The explanation is simple, I used VirtualBox which by default periodically synchronizes guest clock with host clock. This means that in few seconds the endtime - _time() becomes lower than timeout and due to the multiple sleep the loop is correctly exited. You can only notice slight increase in the delay caused by delay of the time synchronization, which unfortunately didn't catch my attention to realize that it's not working as expected.

Comment 25 Charalampos Stratakis 2017-02-23 10:12:56 UTC

Thanks Jaroslav for the fix. It works now and I can reproduce it.

Comment 28 errata-xmlrpc 2017-08-01 20:25:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1868

Note You need to log in before you can comment on or make changes to this bug.