Description of problem: As IP addresses assigned to an Amazon ELB which load balances incoming requests for the Master API are removed, any atomic-openshift-node services which was bound to an IP address which no longer load balances Master API requests enter a NotReady state. We expect atomic-openshift-node to gracefully handle the loss of connectivity to a specific IP address associated with a load balanced Master API FQDN and not attempt to reuse a half closed connection. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Amazon ELB IP change causes atomic-openshift-node NotReady state for 15 minutes Expected results: We expect atomic-openshift-node to gracefully handle the loss of connectivity to a specific IP address associated with a load balanced Master API FQDN and not attempt to reuse a half closed connection. Additional info: Support upstream kubernetes documentation: - https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-312428731 - https://github.com/kubernetes/client-go/issues/374 - https://github.com/kubernetes/kubernetes/issues/48638 - https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-312230215 - https://github.com/kubernetes/kubernetes/pull/52176#issuecomment-355143494
This bug looks similar to bug 1464653 which should be fixed in 3.7.
(In reply to Ryan Howe from comment #2) > This bug looks similar to bug 1464653 which should be fixed in 3.7. Never mind, it looks as though this issue can still happen even with the fix in bug 1464653. Reproducer: https://github.com/kubernetes/kubernetes/pull/52176#issuecomment-370598132 PR that was closed: https://github.com/kubernetes/kubernetes/pull/48670#issuecomment-352257836 https://github.com/kubernetes/kubernetes/pull/48670
*** Bug 1577695 has been marked as a duplicate of this bug. ***
I'm working on it already, and will continue checking...
Tested on ocp with version: openshift v3.7.56 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 Setup HA env with elb on aws cluster, there are 3 ELB IP(s) before outage, which each is match with qe-geliu-elbmaster-etcd-zone1,qe-geliu-elbmaster-etcd-zone2-1 qe-geliu-elbmaster-etcd-zone2-2, choose 1 ip removed from elb in aws ui, then checked there is not node be in NotReady status for long time, and check the log of atomic-openshift-node, there is not critical err reference comment 1 above.
*** Bug 1584471 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days