Red Hat Bugzilla – Bug 984420
Skewed time on one node results of flapping happiness of all agents
Last modified: 2016-04-26 19:07:37 EDT
Description of problem:
When one server has incorrect time, it causes all agents switching between XXX and :-) in the
watch -n1 quantum agent-list
Version-Release number of selected component (if applicable):
(:, when the time is not 2013-07-15 07:57:27+00:00
Steps to Reproduce:
1. Have multiagent deployment
2. date -s '2013-07-15 07:57:27+00:00' on one node
all agents flapping
nothing in logs
one agent XXX
WARNING about clock skew in logs
I think stack should compute median over the timestamps and XXX those agents which are some threshold away from the median.
I don't think it would be good to use average, because average is prone to drift by outliers.
Also, as Eoghan Glynn noted, the algorithm could be made even smarter by recording agents drift and accept the agent if the long-time drift is same as the drifted timestamps. But then we would have problem after correcting the time on the drifted server.
I think it is a decent assumption that NTP is installed on the hypervisors.
I assume that not having NTP can cause more issues but I wouldn't focus on fixing them.
There is an option to configure NTP in PackStack.
I'm closing this bug, feel free to reopen if you think I missed something.
I have two issues with closing this as WONTFIX:
* I think there should be some informative message in some log about that the time skew is the issue.
* Only one node which has time skewed for whatever reason, like some problem when connecting some external NTP server, causes malfunction of whole stack.
The second item in comment 4 seems like a real bug. I think we should at least try to reproduce this behaviour.
I would check what happens if the administrator fixes the skewed time on the node, does everything goes back to normal?
I think answering the above question could help us determine the priority of this bug.
Agreed. Jaroslav, can you confirm if setting the time back to being in sync resolves the flapping. Also it would be interesting to retest. I couldn't get things to "flap" in a simple vm based multinode environment using Havana and upstream trunk code. Some of the agent state update code has been modified quite a bit so this might have been resolved.
Since this is not critical we can look into this in the Icehouse time frame.