Bug 1572622
| Summary: | Amazon ELB IP change causes atomic-openshift-node NotReady state for 15 minutes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> |
| Component: | Master | Assignee: | Jordan Liggitt <jliggitt> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ge liu <geliu> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.7.0 | CC: | aos-bugs, bleanhar, bmorriso, ddelcian, jliggitt, jokerman, kmendez, kurktchiev, mmccomas, pportant, rhowe, rkant, sreber, tkimura, tparsons, travi, vlaad, zhizhang |
| Target Milestone: | --- | Keywords: | OpsBlocker |
| Target Release: | 3.7.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
Established TCP connections from the kubelet to the API server could remain established after a network change to the IP addresses the API DNS name resolved to.
Consequence:
Kubelet heartbeating would fail until the TCP connections timed out (typically 15 minutes with default OS settings). This would cause workloads to be evicted from the node, since the node appeared unresponsive.
Fix:
Close the kubelet->api connections after heartbeating fails more than once.
Result:
The API DNS name is re-resolved quickly, and the kubelet->api connection recovers within ~30 seconds, well within the grace period for node responsiveness.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-08-28 14:05:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jaspreet Kaur
2018-04-27 12:41:32 UTC
This bug looks similar to bug 1464653 which should be fixed in 3.7. (In reply to Ryan Howe from comment #2) > This bug looks similar to bug 1464653 which should be fixed in 3.7. Never mind, it looks as though this issue can still happen even with the fix in bug 1464653. Reproducer: https://github.com/kubernetes/kubernetes/pull/52176#issuecomment-370598132 PR that was closed: https://github.com/kubernetes/kubernetes/pull/48670#issuecomment-352257836 https://github.com/kubernetes/kubernetes/pull/48670 *** Bug 1577695 has been marked as a duplicate of this bug. *** I'm working on it already, and will continue checking... Tested on ocp with version: openshift v3.7.56 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 Setup HA env with elb on aws cluster, there are 3 ELB IP(s) before outage, which each is match with qe-geliu-elbmaster-etcd-zone1,qe-geliu-elbmaster-etcd-zone2-1 qe-geliu-elbmaster-etcd-zone2-2, choose 1 ip removed from elb in aws ui, then checked there is not node be in NotReady status for long time, and check the log of atomic-openshift-node, there is not critical err reference comment 1 above. *** Bug 1584471 has been marked as a duplicate of this bug. *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |