Bug 984420

Summary:	Skewed time on one node results of flapping happiness of all agents
Product:	Red Hat OpenStack	Reporter:	Jaroslav Henner <jhenner>
Component:	openstack-neutron	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Jaroslav Henner <jhenner>
Severity:	medium	Docs Contact:
Priority:	low
Version:	3.0	CC:	beagles, chrisw, jhenner, lpeer, nyechiel
Target Milestone:	---	Keywords:	Reopened, ZStream
Target Release:	6.0 (Juno)	Flags:	jhenner: needinfo-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-04-02 09:11:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jaroslav Henner 2013-07-15 08:12:17 UTC

Description of problem:
When one server has incorrect time, it causes all agents switching between XXX and :-) in the 
watch -n1 quantum agent-list

Version-Release number of selected component (if applicable):
openstack-quantum-2013.1.2-4.el6ost.noarch

How reproducible:
(:, when the time is not 2013-07-15 07:57:27+00:00

Steps to Reproduce:
1. Have multiagent deployment
2. date -s '2013-07-15 07:57:27+00:00' on one node


Actual results:
all agents flapping
nothing in logs


Expected results:
one agent XXX
WARNING about clock skew in logs

Additional info:
I think stack should compute median over the timestamps and XXX those agents which are some threshold away from the median.

I don't think it would be good to use average, because average is prone to drift by outliers.

Also, as Eoghan Glynn noted, the algorithm could be made even smarter by recording agents drift and accept the agent if the long-time drift is same as the drifted timestamps. But then we would have problem after correcting the time on the drifted server.

Comment 3 lpeer 2013-08-13 11:02:55 UTC

I think it is a decent assumption that NTP is installed on the hypervisors.
I assume that not having NTP can cause more issues but I wouldn't focus on fixing them.
There is an option to configure NTP in PackStack.

I'm closing this bug, feel free to reopen if you think I missed something.

Comment 4 Jaroslav Henner 2013-08-13 18:34:24 UTC

I have two issues with closing this as WONTFIX:
 * I think there should be some informative message in some log about that the time skew is the issue.
 * Only one node which has time skewed for whatever reason, like some problem when connecting some external NTP server, causes malfunction of whole stack.

Comment 5 Bob Kukura 2013-11-14 19:26:44 UTC

The second item in comment 4 seems like a real bug. I think we should at least try to reproduce this behaviour.

Comment 6 lpeer 2013-11-17 12:04:49 UTC

Brent, 
I would check what happens if the administrator fixes the skewed time on the node, does everything goes back to normal?

I think answering the above question could help us determine the priority of this bug.

Comment 7 Brent Eagles 2013-11-18 20:41:41 UTC

Agreed. Jaroslav, can you confirm  if setting the time back to being in sync resolves the flapping. Also it would be interesting to retest. I couldn't get things to "flap" in a simple vm based multinode environment using Havana and upstream trunk code. Some of the agent state update code has been modified quite a bit so this might have been resolved.

Comment 8 lpeer 2013-12-03 12:55:11 UTC

Since this is not critical we can look into this in the Icehouse time frame.