Description of problem: If the OpenStack API is down the OpenShift cluster goes in "not ready" (all nodes and masters). We are using cinder volumes from OpenStack. We only use the Openstack integration for Cinder volumes. We expect to be able to use OpenShift even if OpenStack API is down, with the already imported cinder volumes. Version-Release number of selected component (if applicable): Openshift 3.3 We are experiencing this behavior only in our env that has the cloud provider specified (openshift_cloudprovider_kind=openstack). How reproducible: Always Steps to Reproduce: 1. Openstack API is down 2. 3. Actual results: Nodes and Masters go into state Not Ready Expected results: Nodes and Masters should stay in state Ready Additional info: Seems related to following issue https://github.com/kubernetes/kubernetes/issues/34455
I am able to recreate. This is caused by setNodeAddress(), which is a function in the defaultNodeStatusFuncs() chain and called from setNodeStatus(), is returning an error causing the entire node status update to abort. origin-node[19569]: W1201 18:39:05.090399 19620 openstack.go:285] Failed to find compute flavors: Get http://10.42.10.33:8774/v2.1/ec3b48e1bb3448e6a1348ccf82854277/flavors origin-node[19569]: E1201 18:39:05.090430 19620 kubelet.go:2971] Error updating node status, will retry: failed to get instances from cloud provider Since the node status is not being updated, the master eventually moves them to NotReady state as they are not reporting status. The failure of setNodeAddress() should not abort the entire node status update chain. Working on a fix.
Fix merged upstream kube: https://github.com/kubernetes/kubernetes/pull/37846
Origin PR: https://github.com/openshift/origin/pull/12570
This has been merged into ocp and is in OCP v3.5.0.10 or newer.
1) Reproduced with openshift v3.5.0.9+e84be2b kubernetes v1.5.2+43a9be4 etcd 3.1.0 Node became "NotReady" after stopping openstack-nova-* 2) Test against openshift v3.5.0.17+c55cf2b kubernetes v1.5.2+43a9be4 etcd 3.1.0 Node was still in "Ready" status after stopping openstack-nova-* , and able to access router and docker-registry. But failed to restart atomic-openshift-node Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: I0207 04:26:39.956311 67595 openstack_instances.go:42] openstack.I Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: W0207 04:26:39.957012 67595 openstack_instances.go:75] Failed to f Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: F0207 04:26:39.957050 67595 node.go:323] failed to run Kubelet: fa Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: Failed to start Atomic OpenShift Node.
Paste the detailed log: Feb 7 04:30:33 openshift-200 atomic-openshift-node: I0207 04:30:33.978698 70028 openstack_instances.go:42] openstack.Instances() called Feb 7 04:30:33 openshift-200 atomic-openshift-node: W0207 04:30:33.979632 70028 openstack_instances.go:75] Failed to find compute flavors: Get http://10.66.147.11:8774/v2/640d994684f3480baf62328d55de6ae7/flavors/detail: dial tcp 10.66.147.11:8774: getsockopt: connection refused Feb 7 04:30:33 openshift-200 atomic-openshift-node: F0207 04:30:33.979684 70028 node.go:323] failed to run Kubelet: failed to get instances from cloud provider Feb 7 04:30:33 openshift-200 systemd: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Feb 7 04:30:33 openshift-200 systemd: Failed to start Atomic OpenShift Node
Gan, It is true that the Openstack API being down still prevents the node process from starting. That is a separate issue though. This PR fixes the issue with already running node processes transitioning into NotReady state if the Openstack API stops responding. I seems that you confirmed that is working.
Thanks for your clarification, Seth! Yes, I have confirmed that it's working after OpenStack API was down. Move to verified per comment 6 and comment 8.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884