Bug 1601378 - [3.7] Could not find an allocated subnet for node [NEEDINFO]
Summary: [3.7] Could not find an allocated subnet for node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.z
Assignee: Avesh Agarwal
QA Contact: sheng.lao
URL:
Whiteboard:
Depends On:
Blocks: 1538616 1603611 1603612
TreeView+ depends on / blocked
 
Reported: 2018-07-16 08:52 UTC by sheng.lao
Modified: 2018-08-09 22:15 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Recently, cloudResourceSyncManager was implemented which continuously fetched node addresses from cloud providers. Kubelet then received node addresses from the cloudResourceSyncManager. At the time of node registration or kubelet start, kubelet fetches node addresses in a blocking loop from cloudResourceSyncManager. The issue was that cloudResourceSyncManager was not started before kubelet had started fetching node addresses from it for the first time, and due to this, kubelet got stuck in the blocking loop and never returned. It caused node failures at network level, and no node could be registered. Also as kubelet blocked early, the cloudResourceSyncManager never got a chance to start. Solution: CloudResourceSyncManager is now started early in the kubelet startup process so that kubelet does not get blocked on it and cloudResourceSyncManager is always started.
Clone Of:
: 1601749 1601813 1603611 1603612 (view as bug list)
Environment:
Last Closed: 2018-08-09 22:14:04 UTC
Target Upstream Version:
avagarwa: needinfo? (ghuang)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2337 None None None 2018-08-09 22:15:05 UTC

Description sheng.lao 2018-07-16 08:52:14 UTC
Description of problem:
Installer failed at task "Wait for Node Registration". 

The failed node logs observered: 

<--snip-->
Jul 16 04:24:00 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: W0716 04:24:00.640021  123167 sdn_controller.go:48] Could not find an allocated subnet for node: qe-smoke37-master-registry-router-1, Waiting...
Jul 16 04:24:03 qe-smoke37-master-registry-router-1 atomic-openshift-node[123100]: I0716 04:24:03.319693  123167 cloud_request_manager.go:80] Waiting for 5s for cloud provider to provide node addresses
<--snip-->

It's likely the issue was new introduced by:
https://github.com/kubernetes/kubernetes/pull/65226

No such issue in v3.7.57-1

Version-Release number of selected component (if applicable):
openshift-ansible-3.7.58-1.git.37.6db1e6f.el7.noarch.rpm
# oc version
oc v3.7.58
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:
always

Steps to Reproduce:
1.Install OCP-3.7.58-1.git.37.6db1e6f with redhat/openshift-ovs-subnet plugin on GCE or OpenStack
2.
3.

Actual results:
TASK [openshift_manage_node : Wait for Node Registration] **********************
Sunday 15 July 2018  20:24:32 -0400 (0:01:04.076)       0:13:56.730 *********** 
<--snip-->
fatal: [qe-smoke37-master-registry-router-1.0715-5ha.qe.rhcloud.com -> qe-smoke37-master-registry-router-1.0715-5ha.qe.rhcloud.com]: FAILED! => {"attempts": 50, "changed": false, "failed": true, "results": {"cmd": "/usr/local/bin/oc get node qe-smoke37-master-registry-router-1 -o json -n default", "results": [{}], "returncode": 0, "stderr": "Error from server (NotFound): nodes \"qe-smoke37-master-registry-router-1\" not found\n", "stdout": ""}, "state": "list"}

#journalctl -u atomic-openshift-node
<--snip-->
Jul  16 04:24:00 qe-smoke37-master-registry-router-1  atomic-openshift-node[123100]: W0716 04:24:00.640021  123167  sdn_controller.go:48] Could not find an allocated subnet for node:  qe-smoke37-master-registry-router-1, Waiting...
Jul  16 04:24:03 qe-smoke37-master-registry-router-1  atomic-openshift-node[123100]: I0716 04:24:03.319693  123167  cloud_request_manager.go:80] Waiting for 5s for cloud provider to  provide node addresses


Expected results:
Install success

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Johnny Liu 2018-07-16 09:11:00 UTC
As far as QE know, this could be reproduced on GCE and OpenStack, this is blocking testing on GCE and OpenStack.

Realsed verson v3.7.54 does not have such issue, so it is a regression.

Comment 4 Jan Chaloupka 2018-07-17 08:00:57 UTC
I don't see any "Requesting node addresses from cloud provider for node ..." log 2 level message in the node logs. So, I would assume the openshift node just does not start the cloud provider manager. Lemme check the code.

Comment 5 Jan Chaloupka 2018-07-17 09:04:58 UTC
How much time it would take to build another node image and run it through the test?

Comment 8 Weihua Meng 2018-07-17 09:55:33 UTC
FYI.
https://bugzilla.redhat.com/show_bug.cgi?id=1583129#c62

Comment 10 Avesh Agarwal 2018-07-17 14:04:02 UTC
Hi Jan,

This condition: https://github.com/kubernetes/kubernetes/pull/65226/files#diff-6a7b3a253c1cbcc3470d325d4a448e19R79 seems a bit weird to me. May be I am wrong but just trying to understand. You are only retrying when the following is true:

if len(nodeAddresses) == 0 && err == nil {
..
}

Why are you not trying atleast for some N (maybe 5) times when there is an error?

Comment 11 Jan Chaloupka 2018-07-17 14:16:23 UTC
If the condition holds it means there is no node address buffered so we need to wait until the first request for node addresses succeeds. Otherwise, it's pointless to return an empty list of node addresses.

Comment 14 Avesh Agarwal 2018-07-18 13:23:59 UTC
FYI, I will replacing openshift binary on those machines with my custom built openshift binary for testing. Please let me know if there is any issues with it.

Comment 15 Johnny Liu 2018-07-18 13:36:05 UTC
(In reply to Avesh Agarwal from comment #14)
> FYI, I will replacing openshift binary on those machines with my custom
> built openshift binary for testing. Please let me know if there is any
> issues with it.

Sure, go ahead, QE is okay with it.

Comment 16 Avesh Agarwal 2018-07-18 18:29:17 UTC
Just FYI, I have got a fix and it seems to be working fine. I am going to send a PR upstream.

Comment 17 Avesh Agarwal 2018-07-18 19:35:19 UTC
Here is upstream PR: https://github.com/kubernetes/kubernetes/pull/66350

Comment 18 Jan Chaloupka 2018-07-18 23:03:45 UTC
The issue affects only the containerized deployment. Upstream PR: https://github.com/openshift/ose/pull/1361

Weihua, feel free to destroy the cluster. Thanks for it.

Comment 19 Weihua Meng 2018-07-19 00:34:16 UTC
got it.
cluster for debug terminated.

Thanks for quick action.

Comment 20 Jan Chaloupka 2018-07-19 15:05:26 UTC
PR merged upstream

Comment 23 sheng.lao 2018-08-01 01:46:23 UTC
Additional Enviroment (cloud-provider): Passed
AWS: AH + container
GCE: AH + container

Comment 25 errata-xmlrpc 2018-08-09 22:14:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2337


Note You need to log in before you can comment on or make changes to this bug.