Bug 1400574
| Summary: | OpenShift cluster fails when OpenStack api is down | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Brendan Mchugh <bmchugh> |
| Component: | Node | Assignee: | Seth Jennings <sjenning> |
| Status: | CLOSED ERRATA | QA Contact: | Gan Huang <ghuang> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.3.0 | CC: | aos-bugs, decarr, ghuang, jialiu, jokerman, mmccomas, tdawson |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Fixes an issue where OpenShift nodes configured with Openstack as the cloud provider move into NotReady state if contact with the Openstack API is lost. Now nodes remain in Ready state even if the Openstack API is not responding. It should be noted, a new node process configured to use Openstack cloud integration can not start without the Openstack API being responsive.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-04-12 19:17:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Brendan Mchugh
2016-12-01 14:09:33 UTC
I am able to recreate. This is caused by setNodeAddress(), which is a function in the defaultNodeStatusFuncs() chain and called from setNodeStatus(), is returning an error causing the entire node status update to abort. origin-node[19569]: W1201 18:39:05.090399 19620 openstack.go:285] Failed to find compute flavors: Get http://10.42.10.33:8774/v2.1/ec3b48e1bb3448e6a1348ccf82854277/flavors origin-node[19569]: E1201 18:39:05.090430 19620 kubelet.go:2971] Error updating node status, will retry: failed to get instances from cloud provider Since the node status is not being updated, the master eventually moves them to NotReady state as they are not reporting status. The failure of setNodeAddress() should not abort the entire node status update chain. Working on a fix. Fix merged upstream kube: https://github.com/kubernetes/kubernetes/pull/37846 This has been merged into ocp and is in OCP v3.5.0.10 or newer. 1) Reproduced with openshift v3.5.0.9+e84be2b kubernetes v1.5.2+43a9be4 etcd 3.1.0 Node became "NotReady" after stopping openstack-nova-* 2) Test against openshift v3.5.0.17+c55cf2b kubernetes v1.5.2+43a9be4 etcd 3.1.0 Node was still in "Ready" status after stopping openstack-nova-* , and able to access router and docker-registry. But failed to restart atomic-openshift-node Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: I0207 04:26:39.956311 67595 openstack_instances.go:42] openstack.I Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: W0207 04:26:39.957012 67595 openstack_instances.go:75] Failed to f Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com atomic-openshift-node[67595]: F0207 04:26:39.957050 67595 node.go:323] failed to run Kubelet: fa Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Feb 07 04:26:39 openshift-200.lab.eng.nay.redhat.com systemd[1]: Failed to start Atomic OpenShift Node. Paste the detailed log: Feb 7 04:30:33 openshift-200 atomic-openshift-node: I0207 04:30:33.978698 70028 openstack_instances.go:42] openstack.Instances() called Feb 7 04:30:33 openshift-200 atomic-openshift-node: W0207 04:30:33.979632 70028 openstack_instances.go:75] Failed to find compute flavors: Get http://10.66.147.11:8774/v2/640d994684f3480baf62328d55de6ae7/flavors/detail: dial tcp 10.66.147.11:8774: getsockopt: connection refused Feb 7 04:30:33 openshift-200 atomic-openshift-node: F0207 04:30:33.979684 70028 node.go:323] failed to run Kubelet: failed to get instances from cloud provider Feb 7 04:30:33 openshift-200 systemd: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Feb 7 04:30:33 openshift-200 systemd: Failed to start Atomic OpenShift Node Gan, It is true that the Openstack API being down still prevents the node process from starting. That is a separate issue though. This PR fixes the issue with already running node processes transitioning into NotReady state if the Openstack API stops responding. I seems that you confirmed that is working. Thanks for your clarification, Seth! Yes, I have confirmed that it's working after OpenStack API was down. Move to verified per comment 6 and comment 8. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884 |