Bug 1601378
Summary: | [3.7] Could not find an allocated subnet for node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | sheng.lao <shlao> | |
Component: | Node | Assignee: | Avesh Agarwal <avagarwa> | |
Status: | CLOSED ERRATA | QA Contact: | sheng.lao <shlao> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.7.1 | CC: | akostadi, aos-bugs, avagarwa, ghuang, jchaloup, jialiu, jokerman, mifiedle, mmccomas, wmeng | |
Target Milestone: | --- | Keywords: | Regression, TestBlocker | |
Target Release: | 3.7.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Recently, cloudResourceSyncManager was implemented which continuously fetched node addresses from cloud providers. Kubelet then received node addresses from the cloudResourceSyncManager. At the time of node registration or kubelet start, kubelet fetches node addresses in a blocking loop from cloudResourceSyncManager. The issue was that cloudResourceSyncManager was not started before kubelet had started fetching node addresses from it for the first time, and due to this, kubelet got stuck in the blocking loop and never returned. It caused node failures at network level, and no node could be registered. Also as kubelet blocked early, the cloudResourceSyncManager never got a chance to start.
Solution: CloudResourceSyncManager is now started early in the kubelet startup process so that kubelet does not get blocked on it and cloudResourceSyncManager is always started.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1601749 1601813 1603611 1603612 (view as bug list) | Environment: | ||
Last Closed: | 2018-08-09 22:14:04 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1538616, 1603611, 1603612 |
Description
sheng.lao
2018-07-16 08:52:14 UTC
As far as QE know, this could be reproduced on GCE and OpenStack, this is blocking testing on GCE and OpenStack. Realsed verson v3.7.54 does not have such issue, so it is a regression. I don't see any "Requesting node addresses from cloud provider for node ..." log 2 level message in the node logs. So, I would assume the openshift node just does not start the cloud provider manager. Lemme check the code. How much time it would take to build another node image and run it through the test? Hi Jan, This condition: https://github.com/kubernetes/kubernetes/pull/65226/files#diff-6a7b3a253c1cbcc3470d325d4a448e19R79 seems a bit weird to me. May be I am wrong but just trying to understand. You are only retrying when the following is true: if len(nodeAddresses) == 0 && err == nil { .. } Why are you not trying atleast for some N (maybe 5) times when there is an error? If the condition holds it means there is no node address buffered so we need to wait until the first request for node addresses succeeds. Otherwise, it's pointless to return an empty list of node addresses. FYI, I will replacing openshift binary on those machines with my custom built openshift binary for testing. Please let me know if there is any issues with it. (In reply to Avesh Agarwal from comment #14) > FYI, I will replacing openshift binary on those machines with my custom > built openshift binary for testing. Please let me know if there is any > issues with it. Sure, go ahead, QE is okay with it. Just FYI, I have got a fix and it seems to be working fine. I am going to send a PR upstream. Here is upstream PR: https://github.com/kubernetes/kubernetes/pull/66350 The issue affects only the containerized deployment. Upstream PR: https://github.com/openshift/ose/pull/1361 Weihua, feel free to destroy the cluster. Thanks for it. got it. cluster for debug terminated. Thanks for quick action. PR merged upstream Additional Enviroment (cloud-provider): Passed AWS: AH + container GCE: AH + container Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2337 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |