984420 – Skewed time on one node results of flapping happiness of all agents

Bug 984420 - Skewed time on one node results of flapping happiness of all agents

Summary: Skewed time on one node results of flapping happiness of all agents

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	3.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	6.0 (Juno)
Assignee:	RHOS Maint
QA Contact:	Jaroslav Henner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-15 08:12 UTC by Jaroslav Henner
Modified:	2019-01-17 13:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-04-02 09:11:59 UTC
Target Upstream Version:
Embargoed:
Flags:	jhenner: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1201316	0	None	None	None	Never

Description Jaroslav Henner 2013-07-15 08:12:17 UTC

Description of problem:
When one server has incorrect time, it causes all agents switching between XXX and :-) in the 
watch -n1 quantum agent-list

Version-Release number of selected component (if applicable):
openstack-quantum-2013.1.2-4.el6ost.noarch

How reproducible:
(:, when the time is not 2013-07-15 07:57:27+00:00

Steps to Reproduce:
1. Have multiagent deployment
2. date -s '2013-07-15 07:57:27+00:00' on one node


Actual results:
all agents flapping
nothing in logs


Expected results:
one agent XXX
WARNING about clock skew in logs

Additional info:
I think stack should compute median over the timestamps and XXX those agents which are some threshold away from the median.

I don't think it would be good to use average, because average is prone to drift by outliers.

Also, as Eoghan Glynn noted, the algorithm could be made even smarter by recording agents drift and accept the agent if the long-time drift is same as the drifted timestamps. But then we would have problem after correcting the time on the drifted server.

Comment 3 lpeer 2013-08-13 11:02:55 UTC

I think it is a decent assumption that NTP is installed on the hypervisors.
I assume that not having NTP can cause more issues but I wouldn't focus on fixing them.
There is an option to configure NTP in PackStack.

I'm closing this bug, feel free to reopen if you think I missed something.

Comment 4 Jaroslav Henner 2013-08-13 18:34:24 UTC

I have two issues with closing this as WONTFIX:
 * I think there should be some informative message in some log about that the time skew is the issue.
 * Only one node which has time skewed for whatever reason, like some problem when connecting some external NTP server, causes malfunction of whole stack.

Comment 5 Bob Kukura 2013-11-14 19:26:44 UTC

The second item in comment 4 seems like a real bug. I think we should at least try to reproduce this behaviour.

Comment 6 lpeer 2013-11-17 12:04:49 UTC

Brent, 
I would check what happens if the administrator fixes the skewed time on the node, does everything goes back to normal?

I think answering the above question could help us determine the priority of this bug.

Comment 7 Brent Eagles 2013-11-18 20:41:41 UTC

Agreed. Jaroslav, can you confirm  if setting the time back to being in sync resolves the flapping. Also it would be interesting to retest. I couldn't get things to "flap" in a simple vm based multinode environment using Havana and upstream trunk code. Some of the agent state update code has been modified quite a bit so this might have been resolved.

Comment 8 lpeer 2013-12-03 12:55:11 UTC

Since this is not critical we can look into this in the Icehouse time frame.

Note You need to log in before you can comment on or make changes to this bug.