1400574 – OpenShift cluster fails when OpenStack api is down

Bug 1400574 - OpenShift cluster fails when OpenStack api is down

Summary: OpenShift cluster fails when OpenStack api is down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Seth Jennings
QA Contact:	Gan Huang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-01 14:09 UTC by Brendan Mchugh
Modified:	2020-05-14 15:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Fixes an issue where OpenShift nodes configured with Openstack as the cloud provider move into NotReady state if contact with the Openstack API is lost. Now nodes remain in Ready state even if the Openstack API is not responding. It should be noted, a new node process configured to use Openstack cloud integration can not start without the Openstack API being responsive.
Clone Of:
Environment:
Last Closed:	2017-04-12 19:17:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0884	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Brendan Mchugh 2016-12-01 14:09:33 UTC

Description of problem:

If the OpenStack API is down the OpenShift cluster goes in "not ready" (all nodes and masters).
We are using cinder volumes from OpenStack. 
We only use the Openstack integration for Cinder volumes.
We expect to be able to use OpenShift even if OpenStack API is down, with the already imported cinder volumes.

Version-Release number of selected component (if applicable):
Openshift 3.3

We are experiencing this behavior only in our env that has the cloud provider specified (openshift_cloudprovider_kind=openstack).

How reproducible:
Always

Steps to Reproduce:
1. Openstack API is down
2.
3.

Actual results:
Nodes and Masters go into state Not Ready

Expected results:
Nodes and Masters should stay in state Ready

Additional info:
Seems related to following issue https://github.com/kubernetes/kubernetes/issues/34455

Comment 1 Seth Jennings 2016-12-01 18:51:00 UTC

I am able to recreate.

This is caused by setNodeAddress(), which is a function in the defaultNodeStatusFuncs() chain and called from setNodeStatus(), is returning an error causing the entire node status update to abort.

origin-node[19569]: W1201 18:39:05.090399   19620 openstack.go:285] Failed to find compute flavors: Get http://10.42.10.33:8774/v2.1/ec3b48e1bb3448e6a1348ccf82854277/flavors
origin-node[19569]: E1201 18:39:05.090430   19620 kubelet.go:2971] Error updating node status, will retry: failed to get instances from cloud provider

Since the node status is not being updated, the master eventually moves them to NotReady state as they are not reporting status.

The failure of setNodeAddress() should not abort the entire node status update chain.

Working on a fix.

Comment 3 Seth Jennings 2017-01-04 19:17:14 UTC

Fix merged upstream kube:
https://github.com/kubernetes/kubernetes/pull/37846

Comment 4 Seth Jennings 2017-01-19 19:41:31 UTC

Origin PR:
https://github.com/openshift/origin/pull/12570

Comment 5 Troy Dawson 2017-01-27 17:34:05 UTC

This has been merged into ocp and is in OCP v3.5.0.10 or newer.

Comment 6 Gan Huang 2017-02-07 09:30:01 UTC

1) Reproduced with 
openshift v3.5.0.9+e84be2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Node became "NotReady" after stopping openstack-nova-*

2) Test against 
openshift v3.5.0.17+c55cf2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Node was still in "Ready" status after stopping openstack-nova-*
, and able to access router and docker-registry.

But failed to restart atomic-openshift-node

Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: I0207 04:26:39.956311   67595 openstack_instances.go:42] openstack.I
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: W0207 04:26:39.957012   67595 openstack_instances.go:75] Failed to f
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: F0207 04:26:39.957050   67595 node.go:323] failed to run Kubelet: fa
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: Failed to start Atomic OpenShift Node.

Comment 7 Gan Huang 2017-02-07 09:33:42 UTC

Paste the detailed log:

Feb  7 04:30:33 openshift-200 atomic-openshift-node: I0207 04:30:33.978698   70028 openstack_instances.go:42] openstack.Instances() called
Feb  7 04:30:33 openshift-200 atomic-openshift-node: W0207 04:30:33.979632   70028 openstack_instances.go:75] Failed to find compute flavors: Get http://10.66.147.11:8774/v2/640d994684f3480baf62328d55de6ae7/flavors/detail: dial tcp 10.66.147.11:8774: getsockopt: connection refused
Feb  7 04:30:33 openshift-200 atomic-openshift-node: F0207 04:30:33.979684   70028 node.go:323] failed to run Kubelet: failed to get instances from cloud provider
Feb  7 04:30:33 openshift-200 systemd: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Feb  7 04:30:33 openshift-200 systemd: Failed to start Atomic OpenShift Node

Comment 8 Seth Jennings 2017-02-08 15:22:51 UTC

Gan,

It is true that the Openstack API being down still prevents the node process from starting.  That is a separate issue though.

This PR fixes the issue with already running node processes transitioning into NotReady state if the Openstack API stops responding.

I seems that you confirmed that is working.

Comment 9 Gan Huang 2017-02-09 07:32:31 UTC

Thanks for your clarification, Seth!

Yes, I have confirmed that it's working after OpenStack API was down.

Move to verified per comment 6 and comment 8.

Comment 11 errata-xmlrpc 2017-04-12 19:17:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.