Description of problem:
On reboot for power system I see the following message
The job below doesn't complete before time out value 1 minute 30 secs
A stop job is running for Dynamic S...m Tuning Daemon
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Install RHEL 7.3 nightly I used 817 I've seen it in other nightly also
2.reboot the system
A stop job is running for Dynamic S...m Tuning Daemon times out at 1 minute 30 sec
The job should stop successful
[ OK ] Started Show Plymouth Reboot Screen.
[ OK ] Started Restore /run/initramfs.
[ OK ] Stopped Postfix Mail Transport Agent.
[ OK ] Stopped Dynamic System Tuning Daemon.
[ OK ] Stopped target Network.
Stopping Network Manager...
I was using the following system
I requested loan of the machine.
(In reply to Jaroslav Škarvada from comment #1)
> I requested loan of the machine.
it fails on other systems
Created attachment 1193359 [details]
The problem is in python threading library, which doesn't use monotonic timer when handling wait timeouts. Reproducer:
$ date -s "2016-08-19 12:59:42 EDT"
$ ./test.py &
$ date -s "2016-08-19 09:10:26 EDT"
And it stops counting, i.e. the wait doesn't return after the timeout as it should.
Also reproducible with python-2.7.12-2.fc24.x86_64 and python3-3.5.1-13.fc24.x86_64.
Created attachment 1195414 [details]
Workaround for the reproducer is the following (regarding unblocking the process, for e.g. shutdown):
- setup signal handler
- set the threading event inside the signal handler when the signal is recevied
- send signal to the python process which will interrupt the syscall
Unfortunately it's going complicated with the multithreading - you cannot setup signal handler for the thread, so AFAIK there is no way how to interrupt the syscall. By default Python sleeps no more than 50 ms, i.e. it wakes up, but waking the process more than 20x per second is very bad for power consumption. That's why we have experimental low power consumption patch in RHEL which allows applications aware of this feature to sleep more than 50 ms. While this improves power consumption dramatically this may cause the thread to block.
Proposed patch is a fix which shouldn't break anything while ensuring that the wait will not sleep more than 2x timeout (in the worst case) which is much better then sleeping for hours or days which can happen now.
Created attachment 1251887 [details]
How to reproduce:
# date -s "2016-08-19 12:59:42 EDT"
# ./test.py &
# date -s "2016-08-19 09:10:26 EDT"
But maybe better to use two terminals - the first for foreground run of ./test.py, the second for date manipulation.
Created attachment 1256676 [details]
Well, I probably finally sorted it out, sorry for confusion which was caused by my "special" testing environment (is described bellow).
With the previous patch it correctly set the delay to the timeout in case (endtime - _time()) > timeout, i.e. when the time shift happen, this guaranteed that it will never wait longer than timeout if in non-balancing mode (the RHEL specific mode). So let's say it should sleep 5 seconds, but the time shift to the past caused that endtime - _time() is e.g. 900 seconds. With the previous patch it correctly set the delay to 5 seconds. Then it looped to the remaining time calculation and the endtime - _time() is now 895 seconds (i.e. 900 - 5), so it set the timout again to min(895, 5) which is 5. This means again 5 second sleep. The effect is that instead of one long sleep(900) we have multiple sleep(5), so the result seems the same as without the patch.
This new patch improves the situation. It recalculates the endtime if in non-balancing mode. For the previous example it would behave the following way:
endtime - _time() = 900, timeout = 5, so remaining = 5
now new endtime will be set to _time() + 5, then it waits 5 seconds, then it recalculates remaining which will become:
endtime - _time() = 0, timeout 5, so remaining = 0 and it will correctly exit
The patch should be safe because it only modifies non-balancing (RHEL specific) behavior, balancing mode (the default) is not touched. I think the drawback could be small lost of precision due to the arithmetic especially if the sleep is interrupted multiple times. This shouldn't be an issue because the non-balancing mode is targeted to low power consumption and high precision is not the goal.
Now why it previously worked in my testing environment. The explanation is simple, I used VirtualBox which by default periodically synchronizes guest clock with host clock. This means that in few seconds the endtime - _time() becomes lower than timeout and due to the multiple sleep the loop is correctly exited. You can only notice slight increase in the delay caused by delay of the time synchronization, which unfortunately didn't catch my attention to realize that it's not working as expected.
Thanks Jaroslav for the fix. It works now and I can reproduce it.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.