Bug 1601378

Summary:	[3.7] Could not find an allocated subnet for node
Product:	OpenShift Container Platform	Reporter:	sheng.lao <shlao>
Component:	Node	Assignee:	Avesh Agarwal <avagarwa>
Status:	CLOSED ERRATA	QA Contact:	sheng.lao <shlao>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.7.1	CC:	akostadi, aos-bugs, avagarwa, ghuang, jchaloup, jialiu, jokerman, mifiedle, mmccomas, wmeng
Target Milestone:	---	Keywords:	Regression, TestBlocker
Target Release:	3.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Recently, cloudResourceSyncManager was implemented which continuously fetched node addresses from cloud providers. Kubelet then received node addresses from the cloudResourceSyncManager. At the time of node registration or kubelet start, kubelet fetches node addresses in a blocking loop from cloudResourceSyncManager. The issue was that cloudResourceSyncManager was not started before kubelet had started fetching node addresses from it for the first time, and due to this, kubelet got stuck in the blocking loop and never returned. It caused node failures at network level, and no node could be registered. Also as kubelet blocked early, the cloudResourceSyncManager never got a chance to start. Solution: CloudResourceSyncManager is now started early in the kubelet startup process so that kubelet does not get blocked on it and cloudResourceSyncManager is always started.	Story Points:	---
Clone Of:
Clones:	1601749 1601813 1603611 1603612 (view as bug list)		Environment:
Last Closed:	2018-08-09 22:14:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1538616, 1603611, 1603612

Description sheng.lao 2018-07-16 08:52:14 UTC

Description of problem:
Installer failed at task "Wait for Node Registration". 

The failed node logs observered: 

<--snip-->
Jul 16 04:24:00 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: W0716 04:24:00.640021  123167 sdn_controller.go:48] Could not find an allocated subnet for node: qe-smoke37-master-registry-router-1, Waiting...
Jul 16 04:24:03 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: I0716 04:24:03.319693  123167 cloud_request_manager.go:80] Waiting for 5s for cloud provider to provide node addresses
<--snip-->

It's likely the issue was new introduced by:
https://github.com/kubernetes/kubernetes/pull/65226

No such issue in v3.7.57-1

Version-Release number of selected component (if applicable):
openshift-ansible-3.7.58-1.git.37.6db1e6f.el7.noarch.rpm
# oc version
oc v3.7.58
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:
always

Steps to Reproduce:
1.Install OCP-3.7.58-1.git.37.6db1e6f with redhat/openshift-ovs-subnet plugin on GCE or OpenStack
2.
3.

Actual results:
TASK [openshift_manage_node : Wait for Node Registration] **********************
Sunday 15 July 2018  20:24:32 -0400 (0:01:04.076)       0:13:56.730 *********** 
<--snip-->
fatal: [qe-smoke37-master-registry-router-1.0715-5ha.qe.rhcloud.com -> qe-smoke37-master-registry-router-1.0715-5ha.qe.rhcloud.com]: FAILED! => {"attempts": 50, "changed": false, "failed": true, "results": {"cmd": "/usr/local/bin/oc get node qe-smoke37-master-registry-router-1 -o json -n default", "results": [{}], "returncode": 0, "stderr": "Error from server (NotFound): nodes \"qe-smoke37-master-registry-router-1\" not found\n", "stdout": ""}, "state": "list"}

#journalctl -u atomic-openshift-node
<--snip-->
Jul  16 04:24:00 qe-smoke37-master-registry-router-1  atomic-openshift-node[123100]: W0716 04:24:00.640021  123167  sdn_controller.go:48] Could not find an allocated subnet for node:  qe-smoke37-master-registry-router-1, Waiting...
Jul  16 04:24:03 qe-smoke37-master-registry-router-1  atomic-openshift-node[123100]: I0716 04:24:03.319693  123167  cloud_request_manager.go:80] Waiting for 5s for cloud provider to  provide node addresses


Expected results:
Install success

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Johnny Liu 2018-07-16 09:11:00 UTC

As far as QE know, this could be reproduced on GCE and OpenStack, this is blocking testing on GCE and OpenStack.

Realsed verson v3.7.54 does not have such issue, so it is a regression.

Comment 4 Jan Chaloupka 2018-07-17 08:00:57 UTC

I don't see any "Requesting node addresses from cloud provider for node ..." log 2 level message in the node logs. So, I would assume the openshift node just does not start the cloud provider manager. Lemme check the code.

Comment 5 Jan Chaloupka 2018-07-17 09:04:58 UTC

How much time it would take to build another node image and run it through the test?

Comment 8 Weihua Meng 2018-07-17 09:55:33 UTC

FYI.
https://bugzilla.redhat.com/show_bug.cgi?id=1583129#c62

Comment 10 Avesh Agarwal 2018-07-17 14:04:02 UTC

Hi Jan,

This condition: https://github.com/kubernetes/kubernetes/pull/65226/files#diff-6a7b3a253c1cbcc3470d325d4a448e19R79 seems a bit weird to me. May be I am wrong but just trying to understand. You are only retrying when the following is true:

if len(nodeAddresses) == 0 && err == nil {
..
}

Why are you not trying atleast for some N (maybe 5) times when there is an error?

Comment 11 Jan Chaloupka 2018-07-17 14:16:23 UTC

If the condition holds it means there is no node address buffered so we need to wait until the first request for node addresses succeeds. Otherwise, it's pointless to return an empty list of node addresses.

Comment 14 Avesh Agarwal 2018-07-18 13:23:59 UTC

FYI, I will replacing openshift binary on those machines with my custom built openshift binary for testing. Please let me know if there is any issues with it.

Comment 15 Johnny Liu 2018-07-18 13:36:05 UTC

(In reply to Avesh Agarwal from comment #14)
> FYI, I will replacing openshift binary on those machines with my custom
> built openshift binary for testing. Please let me know if there is any
> issues with it.

Sure, go ahead, QE is okay with it.

Comment 16 Avesh Agarwal 2018-07-18 18:29:17 UTC

Just FYI, I have got a fix and it seems to be working fine. I am going to send a PR upstream.

Comment 17 Avesh Agarwal 2018-07-18 19:35:19 UTC

Here is upstream PR: https://github.com/kubernetes/kubernetes/pull/66350

Comment 18 Jan Chaloupka 2018-07-18 23:03:45 UTC

The issue affects only the containerized deployment. Upstream PR: https://github.com/openshift/ose/pull/1361

Weihua, feel free to destroy the cluster. Thanks for it.

Comment 19 Weihua Meng 2018-07-19 00:34:16 UTC

got it.
cluster for debug terminated.

Thanks for quick action.

Comment 20 Jan Chaloupka 2018-07-19 15:05:26 UTC

PR merged upstream

Comment 23 sheng.lao 2018-08-01 01:46:23 UTC

Additional Enviroment (cloud-provider): Passed
AWS: AH + container
GCE: AH + container

Comment 25 errata-xmlrpc 2018-08-09 22:14:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2337

Comment 26 Red Hat Bugzilla 2023-09-14 04:31:25 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days