Bug 1572622 - Amazon ELB IP change causes atomic-openshift-node NotReady state for 15 minutes [NEEDINFO]
Summary: Amazon ELB IP change causes atomic-openshift-node NotReady state for 15 minutes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.z
Assignee: Jordan Liggitt
QA Contact: ge liu
URL:
Whiteboard:
: 1577695 1584471 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-27 12:41 UTC by Jaspreet Kaur
Modified: 2019-07-15 06:19 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Established TCP connections from the kubelet to the API server could remain established after a network change to the IP addresses the API DNS name resolved to. Consequence: Kubelet heartbeating would fail until the TCP connections timed out (typically 15 minutes with default OS settings). This would cause workloads to be evicted from the node, since the node appeared unresponsive. Fix: Close the kubelet->api connections after heartbeating fails more than once. Result: The API DNS name is re-resolved quickly, and the kubelet->api connection recovers within ~30 seconds, well within the grace period for node responsiveness.
Clone Of:
Environment:
Last Closed: 2018-08-28 14:05:30 UTC
Target Upstream Version:
jkaur: needinfo? (jliggitt)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1584471 0 medium CLOSED [3.5] All nodes got NotReady with network hiccups 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 3571491 0 None None None 2018-08-21 14:50:25 UTC

Description Jaspreet Kaur 2018-04-27 12:41:32 UTC
Description of problem: 
As IP addresses assigned to an Amazon ELB which load balances incoming requests for the Master API are removed, any atomic-openshift-node services which was bound to an IP address which no longer load balances Master API requests enter a NotReady state.  We expect atomic-openshift-node to gracefully handle the loss of connectivity to a specific IP address associated with a load balanced Master API FQDN and not attempt to reuse a half closed connection. 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:  Amazon ELB IP change causes atomic-openshift-node NotReady state for 15 minutes


Expected results: We expect atomic-openshift-node to gracefully handle the loss of connectivity to a specific IP address associated with a load balanced Master API FQDN and not attempt to reuse a half closed connection. 


Additional info:


Support upstream kubernetes documentation:
 - https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-312428731
 - https://github.com/kubernetes/client-go/issues/374
 - https://github.com/kubernetes/kubernetes/issues/48638
 - https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-312230215
 - https://github.com/kubernetes/kubernetes/pull/52176#issuecomment-355143494

Comment 2 Ryan Howe 2018-04-27 16:51:10 UTC
This bug looks similar to bug 1464653 which should be fixed in 3.7.

Comment 3 Ryan Howe 2018-04-27 17:22:45 UTC
(In reply to Ryan Howe from comment #2)
> This bug looks similar to bug 1464653 which should be fixed in 3.7.

Never mind, it looks as though this issue can still happen even with the fix in bug 1464653.

    Reproducer: 
     https://github.com/kubernetes/kubernetes/pull/52176#issuecomment-370598132

    PR that was closed:
     https://github.com/kubernetes/kubernetes/pull/48670#issuecomment-352257836
     https://github.com/kubernetes/kubernetes/pull/48670

Comment 9 Jordan Liggitt 2018-05-14 17:07:39 UTC
*** Bug 1577695 has been marked as a duplicate of this bug. ***

Comment 17 ge liu 2018-07-02 09:25:30 UTC
I'm working on it already, and will continue checking...

Comment 18 ge liu 2018-07-03 06:19:28 UTC
Tested on ocp with version:
openshift v3.7.56
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Setup HA env with elb on aws cluster, there are 3 ELB IP(s) before outage, which each is match with qe-geliu-elbmaster-etcd-zone1,qe-geliu-elbmaster-etcd-zone2-1 
qe-geliu-elbmaster-etcd-zone2-2, choose 1 ip removed from elb in aws ui, then checked there is not node be in NotReady status for long time, and check the log of atomic-openshift-node, there is not critical err reference comment 1 above.

Comment 23 Michal Fojtik 2019-03-07 11:20:26 UTC
*** Bug 1584471 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.