Bug 1572622

Summary: Amazon ELB IP change causes atomic-openshift-node NotReady state for 15 minutes
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: MasterAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: aos-bugs, bleanhar, bmorriso, ddelcian, jliggitt, jokerman, kmendez, kurktchiev, mmccomas, pportant, rhowe, rkant, sreber, tkimura, tparsons, travi, vlaad, zhizhang
Target Milestone: ---Keywords: OpsBlocker
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Established TCP connections from the kubelet to the API server could remain established after a network change to the IP addresses the API DNS name resolved to. Consequence: Kubelet heartbeating would fail until the TCP connections timed out (typically 15 minutes with default OS settings). This would cause workloads to be evicted from the node, since the node appeared unresponsive. Fix: Close the kubelet->api connections after heartbeating fails more than once. Result: The API DNS name is re-resolved quickly, and the kubelet->api connection recovers within ~30 seconds, well within the grace period for node responsiveness.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-28 14:05:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2018-04-27 12:41:32 UTC
Description of problem: 
As IP addresses assigned to an Amazon ELB which load balances incoming requests for the Master API are removed, any atomic-openshift-node services which was bound to an IP address which no longer load balances Master API requests enter a NotReady state.  We expect atomic-openshift-node to gracefully handle the loss of connectivity to a specific IP address associated with a load balanced Master API FQDN and not attempt to reuse a half closed connection. 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:  Amazon ELB IP change causes atomic-openshift-node NotReady state for 15 minutes


Expected results: We expect atomic-openshift-node to gracefully handle the loss of connectivity to a specific IP address associated with a load balanced Master API FQDN and not attempt to reuse a half closed connection. 


Additional info:


Support upstream kubernetes documentation:
 - https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-312428731
 - https://github.com/kubernetes/client-go/issues/374
 - https://github.com/kubernetes/kubernetes/issues/48638
 - https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-312230215
 - https://github.com/kubernetes/kubernetes/pull/52176#issuecomment-355143494

Comment 2 Ryan Howe 2018-04-27 16:51:10 UTC
This bug looks similar to bug 1464653 which should be fixed in 3.7.

Comment 3 Ryan Howe 2018-04-27 17:22:45 UTC
(In reply to Ryan Howe from comment #2)
> This bug looks similar to bug 1464653 which should be fixed in 3.7.

Never mind, it looks as though this issue can still happen even with the fix in bug 1464653.

    Reproducer: 
     https://github.com/kubernetes/kubernetes/pull/52176#issuecomment-370598132

    PR that was closed:
     https://github.com/kubernetes/kubernetes/pull/48670#issuecomment-352257836
     https://github.com/kubernetes/kubernetes/pull/48670

Comment 9 Jordan Liggitt 2018-05-14 17:07:39 UTC
*** Bug 1577695 has been marked as a duplicate of this bug. ***

Comment 17 ge liu 2018-07-02 09:25:30 UTC
I'm working on it already, and will continue checking...

Comment 18 ge liu 2018-07-03 06:19:28 UTC
Tested on ocp with version:
openshift v3.7.56
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Setup HA env with elb on aws cluster, there are 3 ELB IP(s) before outage, which each is match with qe-geliu-elbmaster-etcd-zone1,qe-geliu-elbmaster-etcd-zone2-1 
qe-geliu-elbmaster-etcd-zone2-2, choose 1 ip removed from elb in aws ui, then checked there is not node be in NotReady status for long time, and check the log of atomic-openshift-node, there is not critical err reference comment 1 above.

Comment 23 Michal Fojtik 2019-03-07 11:20:26 UTC
*** Bug 1584471 has been marked as a duplicate of this bug. ***

Comment 25 Red Hat Bugzilla 2023-09-15 00:07:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days