Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1400574 - OpenShift cluster fails when OpenStack api is down
OpenShift cluster fails when OpenStack api is down
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.3.0
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Seth Jennings
Gan Huang
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-12-01 09:09 EST by Brendan Mchugh
Modified: 2017-07-24 10 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Fixes an issue where OpenShift nodes configured with Openstack as the cloud provider move into NotReady state if contact with the Openstack API is lost. Now nodes remain in Ready state even if the Openstack API is not responding. It should be noted, a new node process configured to use Openstack cloud integration can not start without the Openstack API being responsive.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-04-12 15:17:34 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 18:50:07 EDT

  None (edit)
Description Brendan Mchugh 2016-12-01 09:09:33 EST
Description of problem:

If the OpenStack API is down the OpenShift cluster goes in "not ready" (all nodes and masters).
We are using cinder volumes from OpenStack. 
We only use the Openstack integration for Cinder volumes.
We expect to be able to use OpenShift even if OpenStack API is down, with the already imported cinder volumes.

Version-Release number of selected component (if applicable):
Openshift 3.3

We are experiencing this behavior only in our env that has the cloud provider specified (openshift_cloudprovider_kind=openstack).

How reproducible:
Always

Steps to Reproduce:
1. Openstack API is down
2.
3.

Actual results:
Nodes and Masters go into state Not Ready

Expected results:
Nodes and Masters should stay in state Ready

Additional info:
Seems related to following issue https://github.com/kubernetes/kubernetes/issues/34455
Comment 1 Seth Jennings 2016-12-01 13:51:00 EST
I am able to recreate.

This is caused by setNodeAddress(), which is a function in the defaultNodeStatusFuncs() chain and called from setNodeStatus(), is returning an error causing the entire node status update to abort.

origin-node[19569]: W1201 18:39:05.090399   19620 openstack.go:285] Failed to find compute flavors: Get http://10.42.10.33:8774/v2.1/ec3b48e1bb3448e6a1348ccf82854277/flavors
origin-node[19569]: E1201 18:39:05.090430   19620 kubelet.go:2971] Error updating node status, will retry: failed to get instances from cloud provider

Since the node status is not being updated, the master eventually moves them to NotReady state as they are not reporting status.

The failure of setNodeAddress() should not abort the entire node status update chain.

Working on a fix.
Comment 3 Seth Jennings 2017-01-04 14:17:14 EST
Fix merged upstream kube:
https://github.com/kubernetes/kubernetes/pull/37846
Comment 4 Seth Jennings 2017-01-19 14:41:31 EST
Origin PR:
https://github.com/openshift/origin/pull/12570
Comment 5 Troy Dawson 2017-01-27 12:34:05 EST
This has been merged into ocp and is in OCP v3.5.0.10 or newer.
Comment 6 Gan Huang 2017-02-07 04:30:01 EST
1) Reproduced with 
openshift v3.5.0.9+e84be2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Node became "NotReady" after stopping openstack-nova-*

2) Test against 
openshift v3.5.0.17+c55cf2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Node was still in "Ready" status after stopping openstack-nova-*
, and able to access router and docker-registry.

But failed to restart atomic-openshift-node

Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: I0207 04:26:39.956311   67595 openstack_instances.go:42] openstack.I
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: W0207 04:26:39.957012   67595 openstack_instances.go:75] Failed to f
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: F0207 04:26:39.957050   67595 node.go:323] failed to run Kubelet: fa
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: Failed to start Atomic OpenShift Node.
Comment 7 Gan Huang 2017-02-07 04:33:42 EST
Paste the detailed log:

Feb  7 04:30:33 openshift-200 atomic-openshift-node: I0207 04:30:33.978698   70028 openstack_instances.go:42] openstack.Instances() called
Feb  7 04:30:33 openshift-200 atomic-openshift-node: W0207 04:30:33.979632   70028 openstack_instances.go:75] Failed to find compute flavors: Get http://10.66.147.11:8774/v2/640d994684f3480baf62328d55de6ae7/flavors/detail: dial tcp 10.66.147.11:8774: getsockopt: connection refused
Feb  7 04:30:33 openshift-200 atomic-openshift-node: F0207 04:30:33.979684   70028 node.go:323] failed to run Kubelet: failed to get instances from cloud provider
Feb  7 04:30:33 openshift-200 systemd: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Feb  7 04:30:33 openshift-200 systemd: Failed to start Atomic OpenShift Node
Comment 8 Seth Jennings 2017-02-08 10:22:51 EST
Gan,

It is true that the Openstack API being down still prevents the node process from starting.  That is a separate issue though.

This PR fixes the issue with already running node processes transitioning into NotReady state if the Openstack API stops responding.

I seems that you confirmed that is working.
Comment 9 Gan Huang 2017-02-09 02:32:31 EST
Thanks for your clarification, Seth!

Yes, I have confirmed that it's working after OpenStack API was down.

Move to verified per comment 6 and comment 8.
Comment 11 errata-xmlrpc 2017-04-12 15:17:34 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.