Description of problem: Installer failed at task "Wait for Node Registration". The failed node logs observered: <--snip--> Jul 16 04:24:00 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: W0716 04:24:00.640021 123167 sdn_controller.go:48] Could not find an allocated subnet for node: qe-smoke37-master-registry-router-1, Waiting... Jul 16 04:24:03 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: I0716 04:24:03.319693 123167 cloud_request_manager.go:80] Waiting for 5s for cloud provider to provide node addresses <--snip--> It's likely the issue was new introduced by: https://github.com/kubernetes/kubernetes/pull/65226 No such issue in v3.7.57-1 Version-Release number of selected component (if applicable): openshift-ansible-3.7.58-1.git.37.6db1e6f.el7.noarch.rpm # oc version oc v3.7.58 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO How reproducible: always Steps to Reproduce: 1.Install OCP-3.7.58-1.git.37.6db1e6f with redhat/openshift-ovs-subnet plugin on GCE or OpenStack 2. 3. Actual results: TASK [openshift_manage_node : Wait for Node Registration] ********************** Sunday 15 July 2018 20:24:32 -0400 (0:01:04.076) 0:13:56.730 *********** <--snip--> fatal: [qe-smoke37-master-registry-router-1.0715-5ha.qe.rhcloud.com -> qe-smoke37-master-registry-router-1.0715-5ha.qe.rhcloud.com]: FAILED! => {"attempts": 50, "changed": false, "failed": true, "results": {"cmd": "/usr/local/bin/oc get node qe-smoke37-master-registry-router-1 -o json -n default", "results": [{}], "returncode": 0, "stderr": "Error from server (NotFound): nodes \"qe-smoke37-master-registry-router-1\" not found\n", "stdout": ""}, "state": "list"} #journalctl -u atomic-openshift-node <--snip--> Jul 16 04:24:00 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: W0716 04:24:00.640021 123167 sdn_controller.go:48] Could not find an allocated subnet for node: qe-smoke37-master-registry-router-1, Waiting... Jul 16 04:24:03 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: I0716 04:24:03.319693 123167 cloud_request_manager.go:80] Waiting for 5s for cloud provider to provide node addresses Expected results: Install success Additional info: Please attach logs from ansible-playbook with the -vvv flag
As far as QE know, this could be reproduced on GCE and OpenStack, this is blocking testing on GCE and OpenStack. Realsed verson v3.7.54 does not have such issue, so it is a regression.
I don't see any "Requesting node addresses from cloud provider for node ..." log 2 level message in the node logs. So, I would assume the openshift node just does not start the cloud provider manager. Lemme check the code.
How much time it would take to build another node image and run it through the test?
FYI. https://bugzilla.redhat.com/show_bug.cgi?id=1583129#c62
Hi Jan, This condition: https://github.com/kubernetes/kubernetes/pull/65226/files#diff-6a7b3a253c1cbcc3470d325d4a448e19R79 seems a bit weird to me. May be I am wrong but just trying to understand. You are only retrying when the following is true: if len(nodeAddresses) == 0 && err == nil { .. } Why are you not trying atleast for some N (maybe 5) times when there is an error?
If the condition holds it means there is no node address buffered so we need to wait until the first request for node addresses succeeds. Otherwise, it's pointless to return an empty list of node addresses.
FYI, I will replacing openshift binary on those machines with my custom built openshift binary for testing. Please let me know if there is any issues with it.
(In reply to Avesh Agarwal from comment #14) > FYI, I will replacing openshift binary on those machines with my custom > built openshift binary for testing. Please let me know if there is any > issues with it. Sure, go ahead, QE is okay with it.
Just FYI, I have got a fix and it seems to be working fine. I am going to send a PR upstream.
Here is upstream PR: https://github.com/kubernetes/kubernetes/pull/66350
The issue affects only the containerized deployment. Upstream PR: https://github.com/openshift/ose/pull/1361 Weihua, feel free to destroy the cluster. Thanks for it.
got it. cluster for debug terminated. Thanks for quick action.
PR merged upstream
Additional Enviroment (cloud-provider): Passed AWS: AH + container GCE: AH + container
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2337
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days