1418938 – Openshift doesn't detected openstack VM outage

Bug 1418938 - Openshift doesn't detected openstack VM outage

Summary: Openshift doesn't detected openstack VM outage

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Pod
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Maru Newby
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-03 08:33 UTC by a.vorozhbieva
Modified:	2017-04-12 18:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-12 18:04:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description a.vorozhbieva 2017-02-03 08:33:57 UTC

Openshift doesn't detected openstack VM outage for 20 minutes
After the host was switched off - Openshift for 20 minutes treated VMs hosted on this host as ready. So it didn't started pod evacuation , pods was shown as Running, nodes was shown as ready in Openshift console. 

Could you please explain such behaviour ?

Version
OpenShift Master: v1.3.1
Kubernetes Master: v1.3.0+52492b4

Comment 1 Derek Carr 2017-02-03 19:01:26 UTC

i would have expected the ~5 min wait in the node controller.

maru - can you help diagnose?

Comment 2 Maru Newby 2017-02-03 21:10:03 UTC

Against HEAD (not 1.3) I've just demonstrated that node removal is detected in ~30s.  I 'll see if I can figure out what might be causing the problem you're seeing.

Comment 3 Maru Newby 2017-02-03 22:48:34 UTC

I can't replicate your issue with the detail provided.   

The node controller should be marking a node that hasn't checked in for 40s as NotReady, and then as Derek says the controller will evict pods from the node if the node has remained in that state for longer than the eviction timeout (defaults to 5m).

For clarity, are you saying that openshift didn't detect that the node wasn't healthy until 20m had passed, but that it did detect the problem and evict the pods at that point?  

Or did the vm outage last 20m and for the duration of the outage openshift didn't react?  If so, is it possible that the kubelet on the vm in question retained connectivity to the master somehow?  It may be worth checking the master logs for the outage interval for mention of the affected node to figure out whether the node controller was receiving status updates from the kubelet.

Comment 4 a.vorozhbieva 2017-02-08 11:01:12 UTC

We have OpenShift 1.3.1 with enabled integration with OpenStack. /etc/origin/cloudprovider/openstack.conf file is exist on each OpenShift node. 
Configurations of all Openshift nodes reference to this file. We use it to be able to work with cinder volumes.

Could you please answer the following questions:

1) How often and what exactly does OpenShift request from Openstack about all its nodes ?
2) Does each node do requests independently to receive information about itself ? Or Do masters additionally request information about all nodes ?
3) What can influence decision-making that nodes are no longer available and set NotReady state for them ?

In our case hardware node was off and all its Openshift nodes became unavailable. 
But information about these Openshift nodes get NotReady stage appeared in log files of 1 (of 3) Openshift masters in ~20m.
Then pod evacuation started. 

4) Could the absence of a quorum between remaining 2 masters(of 3) influence decision-making ?
5) Or could absence adequate response from OpenStack influence decision-making ?

Comment 5 Maru Newby 2017-02-08 14:43:01 UTC

(In reply to a.vorozhbieva from comment #4)
> We have OpenShift 1.3.1 with enabled integration with OpenStack.
> /etc/origin/cloudprovider/openstack.conf file is exist on each OpenShift
> node. 
> Configurations of all Openshift nodes reference to this file. We use it to
> be able to work with cinder volumes.
> 
> Could you please answer the following questions:
> 
> 1) How often and what exactly does OpenShift request from Openstack about
> all its nodes ?

OpenStack doesn't have a concept of 'nodes', so OpenShift can't query OpenStack for details about nodes.  The node controller does check with the cloud provider to know when the vm instance associated with a node no longer exists so that an unhealthy node on a missing vm can be deleted.

For details about how the node controller works, see: https://kubernetes.io/docs/admin/node/#node-controller

> 2) Does each node do requests independently to receive information about
> itself ? Or Do masters additionally request information about all nodes ?

The master receives heartbeats from each node individually.


> 3) What can influence decision-making that nodes are no longer available and
> set NotReady state for them ?

The master will mark a node as unhealthy if heartbeats are not received from it for a configured minimum interval.


> In our case hardware node was off and all its Openshift nodes became
> unavailable. 
> But information about these Openshift nodes get NotReady stage appeared in
> log files of 1 (of 3) Openshift masters in ~20m.
> Then pod evacuation started. 
> 
> 4) Could the absence of a quorum between remaining 2 masters(of 3) influence
> decision-making ?

Are you running 'smashed' masters (i.e. etcd in the same process as openshift), and the current master was on the same hardware node that went down?  Then quorum failure would be likely, and that might explain the issue you're seeing.  

The leader election mechanism relies on etcd, and without a quorum (minimum 3) it would not be possible to elect a new master when the current master went offline.  The node controller runs as part of the master process, so handling of the node failure would have to wait until quorum was restored, a new master was elected, and the node controller had a chance to respond to the node state.

Basically, if you're running only 3 instances of etcd (or only 3 smashed masters) and one instance goes offline, the cluster control plane is effectively down.  Workloads on healthy nodes will continue to run, but anything requiring control plane coordination (like workload migration) won't happen until the control plane comes back up.  The quorum requirement is a fundamental property of etcd, and production use requires either HA or more instances to provide some guarantee of resiliency.


> 5) Or could absence adequate response from OpenStack influence
> decision-making ?

As per the explanation to 1), I think that's unlikely.

Comment 6 a.vorozhbieva 2017-02-13 12:17:19 UTC

1) > Are you running 'smashed' masters (i.e. etcd in the same process as openshift)
We don't have external etcd, we have 3 etcd installed along with 3 masters. Is leader etcd always on the same hardware node as leader master?

2) > the current master was on the same hardware node that went down?
How can we find out if the current master was on the same hardware node ?

3) > if you're running only 3 instances of etcd (or only 3 smashed masters) and one instance goes offline, the cluster control plane is effectively down. Workloads on healthy nodes will continue to run, but anything requiring control plane coordination (like workload migration) won't happen until the control plane comes back up.
Do you realy mean that for high availability 3 instances of etcd is needed (i.e. 2 instances is not enough)? 
If 1 etcd and 1 master went offline - they were still offline after 20m passed, but despite this Openshift was able detect openstack VM outage and evacuate pods.

Comment 7 Maru Newby 2017-02-13 15:56:09 UTC

(In reply to a.vorozhbieva from comment #6)
> 1) > Are you running 'smashed' masters (i.e. etcd in the same process as
> openshift)
> We don't have external etcd, we have 3 etcd installed along with 3 masters.
> Is leader etcd always on the same hardware node as leader master?

There is a chance that they are.  You'd have to check the logs for evidence of an etcd leadership election to tell whether that was true in your case.


> 
> 2) > the current master was on the same hardware node that went down?
> How can we find out if the current master was on the same hardware node ?

Check the logs for evidence of the election of a new openshift master on the remaining masters.

> 
> 3) > if you're running only 3 instances of etcd (or only 3 smashed masters)
> and one instance goes offline, the cluster control plane is effectively
> down. Workloads on healthy nodes will continue to run, but anything
> requiring control plane coordination (like workload migration) won't happen
> until the control plane comes back up.
> Do you realy mean that for high availability 3 instances of etcd is needed
> (i.e. 2 instances is not enough)? 
> If 1 etcd and 1 master went offline - they were still offline after 20m
> passed, but despite this Openshift was able detect openstack VM outage and
> evacuate pods.

I realize I was conflating fault tolerance with quorum.  If a 3 node etcd cluster loses one member, it still has quorum it just can't survive the loss of another member:

https://github.com/coreos/etcd/blob/master/Documentation/faq.md#what-is-failure-tolerance

Recovery from failure should not take 20m for a properly configured openshift cluster.  One reason it could take that long is clock skew.  Was ntp running on all the master hosts, and are there clocks able to stay in close proximity as a result?

If you care about the reliability of your cluster, consider running an etcd cluster separately from the origin masters rather than using smashed masters.  I'd also recommend using anti-afinity placement on openstack vm's to avoid putting masters and nodes on the same vm's.  Separating failure domains is best practice for distributed systems - it ensures that when (not if) a failure occurs, the impact is minimized.

Comment 8 Maru Newby 2017-03-24 16:38:31 UTC

Have there been further developments on this issue?

Comment 9 a.vorozhbieva 2017-04-03 12:02:59 UTC

It was all right with our clock. So unfortunately we couldn't find out what happend. The problem has not reproduced anymore. And we decided not to spend time on this issue

Note You need to log in before you can comment on or make changes to this bug.