Bug 1809665

Summary: Router and default exposed frontends (oauth and console) should gracefully terminate
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: RoutingAssignee: Miciah Dashiel Butler Masters <mmasters>
Status: CLOSED DEFERRED QA Contact: Hongan Li <hongli>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.4CC: alchan, amcdermo, aos-bugs, bbennett, bparees, dmace, fiezzi, hongkliu, knarra, lmohanty, miabbott, mmasters, mnewby, oarribas, pstrick, sdodson, skunkerk, wking, wvoesch
Target Milestone: ---Keywords: Reopened, Upgrades
Target Release: 4.6.0Flags: wking: needinfo? (ccoleman)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1809667 1824163 1869785 (view as bug list) Environment:
Cluster frontend ingress remain available
Last Closed: 2020-08-22 02:25:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1809667, 1805690, 1809742, 1818104, 1819147, 1868486, 1869785, 1871299    

Description Clayton Coleman 2020-03-03 16:16:03 UTC
The router, console, and oauth endpoints should all gracefully terminate when their pods are marked deleted without dropping traffic.

Console and oauth can have simple "wait before shutdown" logic because they do not execute long running transactions.  The router needs to wait longer (it is a service load balancer) and then instruct HAProxy to gracefully terminate, then wait up to a limit, and then shut down.

In combination these fixes will ensure end users see no disruption of the control plane or web console, or their frontend web applications, during upgrade.

Comment 4 Clayton Coleman 2020-03-23 22:16:32 UTC
Oops, that was accidental.

Comment 5 Hongan Li 2020-03-24 09:30:10 UTC
upgrade from 4.5.0-0.nightly-2020-03-20-200807 to 4.5.0-0.nightly-2020-03-23-213917, the console is not reachable for about 60 seconds.

when curling the console route and it shows:

curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-upg807.example.com:443

Comment 6 Dan Mace 2020-03-31 17:41:09 UTC
*** Bug 1809742 has been marked as a duplicate of this bug. ***

Comment 11 Hongan Li 2020-04-26 10:23:06 UTC
Upgrade from 4.5.0-0.nightly-2020-04-21-103613 to 4.5.0-0.nightly-2020-04-25-17044, and keep accessing console and auth route in another window, still found connection problem during upgrade and it lasted for about 50 seconds.  

Sun 26 Apr 2020 05:29:34 PM CST
curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received
000
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:29:43 PM CST
200
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:29:46 PM CST
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
403
Sun 26 Apr 2020 05:29:50 PM CST
200
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:29:53 PM CST
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
403
Sun 26 Apr 2020 05:29:57 PM CST
200
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:30:00 PM CST
200
403
Sun 26 Apr 2020 05:30:04 PM CST
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:30:07 PM CST
200
403
Sun 26 Apr 2020 05:30:11 PM CST
curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received
000
403
Sun 26 Apr 2020 05:30:21 PM CST
200
403

Comment 12 Clayton Coleman 2020-04-28 13:38:55 UTC
Causes

Cluster frontend ingress remain available

to fail

Comment 13 Clayton Coleman 2020-04-28 13:48:11 UTC
*** Bug 1828856 has been marked as a duplicate of this bug. ***

Comment 14 Clayton Coleman 2020-05-06 18:46:36 UTC
The current status of this issue is that the router traffic is being blackholed when it moves from cloud service load balancer to the node after the router pod has started termination, but before it completes. It does so because we use the local traffic policy to ensure the source IP of incoming packets are configured, but in that mode the kube-proxy too aggressively removes the local endpoint from iptables.

The proposed fix is being discussed upstream to keep the terminating pod in the local list until it actually stops, before blackholing the traffic.  The fix is still being explored to ensure it does not introduce regressions or undesirable behavior, but we have some testing in 4.5 that the fix will correct the bulk of the unavailability.

Comment 15 Miciah Dashiel Butler Masters 2020-05-06 18:49:26 UTC
Following are the URLs for Andrew Sy Kim's KEP and WIP PR to implement the proposed fix that Clayton mentions in comment 14:

https://github.com/kubernetes/enhancements/pull/1607

https://github.com/kubernetes/kubernetes/pull/89780

Comment 16 Andrew McDermott 2020-05-07 15:57:45 UTC
*** Bug 1819147 has been marked as a duplicate of this bug. ***

Comment 17 W. Trevor King 2020-05-08 04:26:36 UTC
Taking a stab at an impact statement:

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  All clusters.  Longstanding kube-core issue; not a regression.  If edges are blocked, the goal will be to route users to fixed releases as quickly as possible, after which subsequent updates will proceed more smoothly.
What is the impact?  Is it serious enough to warrant blocking edges?
  Up to several minutes of disruption in router-fronted services, including OAuth, the web console, and user Routes, whenever a node holding a router pod reboots or the router pod is otherwise terminated.  Because it is not a regression, it is not serious enough to warrant blocking edges yet.  Once we have fixed releases, we may want to prune edges among unfixed releases to route users towards fixed releases more quickly.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Issue resolves itself after a few minutes, after load balancers feeding the old node realize that the router pod is dead.

Does that sound reasonable?  If I'm missing or misunderstanding anything, can we get an impact statement from someone closer to the problem?

Comment 18 Ben Bennett 2020-05-08 23:36:41 UTC
*** Bug 1818104 has been marked as a duplicate of this bug. ***

Comment 19 Samuel Padgett 2020-05-11 19:03:43 UTC
*** Bug 1812387 has been marked as a duplicate of this bug. ***

Comment 20 Miciah Dashiel Butler Masters 2020-05-19 23:19:31 UTC
We are continue to track the upstream test (https://github.com/kubernetes/kubernetes/pull/89780) in hopes of shipping it in 4.5.

Comment 21 Miciah Dashiel Butler Masters 2020-05-19 23:23:35 UTC
(In reply to W. Trevor King from comment #17)
> Does that sound reasonable?  If I'm missing or misunderstanding anything,
> can we get an impact statement from someone closer to the problem?

The impact statement looks good.

Comment 22 Miciah Dashiel Butler Masters 2020-05-19 23:25:01 UTC
(In reply to Miciah Dashiel Butler Masters from comment #20)
> We are continue to track the upstream test
> (https://github.com/kubernetes/kubernetes/pull/89780) in hopes of shipping
> it in 4.5.

Sorry, that got hopelessly garbled...

We are continuing to track kubernetes#89780 in hopes of shipping it in 4.5 on top of the other improvements linked to this Bugzilla report.

Comment 26 Andrew McDermott 2020-05-27 16:05:16 UTC
Moving to 4.6 as not a release blocker or an upgrade blocker.

Comment 27 Miciah Dashiel Butler Masters 2020-06-18 19:28:35 UTC
We'll continue to track https://github.com/kubernetes/kubernetes/pull/89780 and any other relevant PRs that would likely mitigate downtime during upgrades in the upcoming sprint.

Comment 29 Miciah Dashiel Butler Masters 2020-07-09 05:09:24 UTC
We'll continuing to track the upstream work in the upcoming sprint.

Comment 30 Miciah Dashiel Butler Masters 2020-07-30 08:30:39 UTC
We're continuing to track the upstream work.

Comment 31 Daneyon Hansen 2020-08-13 16:04:19 UTC
*** Bug 1868485 has been marked as a duplicate of this bug. ***

Comment 32 Daneyon Hansen 2020-08-13 16:05:44 UTC
*** Bug 1868486 has been marked as a duplicate of this bug. ***

Comment 33 Miciah Dashiel Butler Masters 2020-08-20 15:54:30 UTC
We're tracking https://github.com/kubernetes/kubernetes/pull/89780 in https://issues.redhat.com/browse/NE-348.

Comment 34 Maru Newby 2020-08-22 02:10:52 UTC
Reopening and re-titling to ensure we have a tracking bz for build watchers.

Comment 35 Maru Newby 2020-08-22 02:12:31 UTC
The failure rate for this test remains high and a tracking bz will ensure build watchers have traceability for the problem:

https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job&search=Cluster+frontend+ingress+remain+available

Comment 36 Maru Newby 2020-08-22 02:25:07 UTC
Reverting changes, I realized that sippy wants a 4.5 bz. I'll file separately and reference this one.