Bug 984420 - Skewed time on one node results of flapping happiness of all agents [NEEDINFO]
Skewed time on one node results of flapping happiness of all agents
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
x86_64 Linux
low Severity medium
: ---
: 6.0 (Juno)
Assigned To: RHOS Maint
Jaroslav Henner
: Reopened, ZStream
Depends On:
  Show dependency treegraph
Reported: 2013-07-15 04:12 EDT by Jaroslav Henner
Modified: 2016-04-26 19:07 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2015-04-02 05:11:59 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
jhenner: needinfo-
beagles: needinfo? (jhenner)

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1201316 None None None Never

  None (edit)
Description Jaroslav Henner 2013-07-15 04:12:17 EDT
Description of problem:
When one server has incorrect time, it causes all agents switching between XXX and :-) in the 
watch -n1 quantum agent-list

Version-Release number of selected component (if applicable):

How reproducible:
(:, when the time is not 2013-07-15 07:57:27+00:00

Steps to Reproduce:
1. Have multiagent deployment
2. date -s '2013-07-15 07:57:27+00:00' on one node

Actual results:
all agents flapping
nothing in logs

Expected results:
one agent XXX
WARNING about clock skew in logs

Additional info:
I think stack should compute median over the timestamps and XXX those agents which are some threshold away from the median.

I don't think it would be good to use average, because average is prone to drift by outliers.

Also, as Eoghan Glynn noted, the algorithm could be made even smarter by recording agents drift and accept the agent if the long-time drift is same as the drifted timestamps. But then we would have problem after correcting the time on the drifted server.
Comment 3 lpeer 2013-08-13 07:02:55 EDT
I think it is a decent assumption that NTP is installed on the hypervisors.
I assume that not having NTP can cause more issues but I wouldn't focus on fixing them.
There is an option to configure NTP in PackStack.

I'm closing this bug, feel free to reopen if you think I missed something.
Comment 4 Jaroslav Henner 2013-08-13 14:34:24 EDT
I have two issues with closing this as WONTFIX:
 * I think there should be some informative message in some log about that the time skew is the issue.
 * Only one node which has time skewed for whatever reason, like some problem when connecting some external NTP server, causes malfunction of whole stack.
Comment 5 Bob Kukura 2013-11-14 14:26:44 EST
The second item in comment 4 seems like a real bug. I think we should at least try to reproduce this behaviour.
Comment 6 lpeer 2013-11-17 07:04:49 EST
I would check what happens if the administrator fixes the skewed time on the node, does everything goes back to normal?

I think answering the above question could help us determine the priority of this bug.
Comment 7 Brent Eagles 2013-11-18 15:41:41 EST
Agreed. Jaroslav, can you confirm  if setting the time back to being in sync resolves the flapping. Also it would be interesting to retest. I couldn't get things to "flap" in a simple vm based multinode environment using Havana and upstream trunk code. Some of the agent state update code has been modified quite a bit so this might have been resolved.
Comment 8 lpeer 2013-12-03 07:55:11 EST
Since this is not critical we can look into this in the Icehouse time frame.

Note You need to log in before you can comment on or make changes to this bug.