Bug 1809665 - Router and default exposed frontends (oauth and console) should gracefully terminate [NEEDINFO]
Summary: Router and default exposed frontends (oauth and console) should gracefully te...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.6.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
: 1809742 1812387 1818104 1819147 1828856 1868485 1868486 (view as bug list)
Depends On:
Blocks: 1805690 1809667 1871299 1809742 1818104 1819147 1868486 1869785
TreeView+ depends on / blocked
 
Reported: 2020-03-03 16:16 UTC by Clayton Coleman
Modified: 2020-08-22 02:29 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1809667 1824163 1869785 (view as bug list)
Environment:
Cluster frontend ingress remain available
Last Closed: 2020-08-22 02:25:07 UTC
Target Upstream Version:
wking: needinfo? (ccoleman)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 252 None closed Bug 1809665: The oauth server should wait until it is out of rotation to shut down 2020-09-14 13:21:10 UTC
Github openshift cluster-ingress-operator pull 363 None closed Bug 1809665: Tune AWS load balancers to be consistent with other platforms 2020-09-14 13:21:09 UTC
Github openshift cluster-ingress-operator pull 366 None closed Bug 1809665: Router should deploy with a very long grace period 2020-09-14 13:21:09 UTC
Github openshift cluster-ingress-operator pull 387 None closed Bug 1809665: Re-add pod disruption budget for ingress controllers 2020-09-14 13:21:09 UTC
Github openshift cluster-network-operator pull 524 None closed Bug 1807638: Fixes to reliably save/restore flows. 2020-09-14 13:21:09 UTC
Github openshift console-operator pull 385 None closed Bug 1809665: The console should wait until it is out of rotation to shut down 2020-09-14 13:21:09 UTC
Github openshift router pull 94 None closed Bug 1809665: Start graceful shutdown on SIGTERM 2020-09-14 13:21:09 UTC

Description Clayton Coleman 2020-03-03 16:16:03 UTC
The router, console, and oauth endpoints should all gracefully terminate when their pods are marked deleted without dropping traffic.

Console and oauth can have simple "wait before shutdown" logic because they do not execute long running transactions.  The router needs to wait longer (it is a service load balancer) and then instruct HAProxy to gracefully terminate, then wait up to a limit, and then shut down.

In combination these fixes will ensure end users see no disruption of the control plane or web console, or their frontend web applications, during upgrade.

Comment 4 Clayton Coleman 2020-03-23 22:16:32 UTC
Oops, that was accidental.

Comment 5 Hongan Li 2020-03-24 09:30:10 UTC
upgrade from 4.5.0-0.nightly-2020-03-20-200807 to 4.5.0-0.nightly-2020-03-23-213917, the console is not reachable for about 60 seconds.

when curling the console route and it shows:

curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-upg807.example.com:443

Comment 6 Dan Mace 2020-03-31 17:41:09 UTC
*** Bug 1809742 has been marked as a duplicate of this bug. ***

Comment 11 Hongan Li 2020-04-26 10:23:06 UTC
Upgrade from 4.5.0-0.nightly-2020-04-21-103613 to 4.5.0-0.nightly-2020-04-25-17044, and keep accessing console and auth route in another window, still found connection problem during upgrade and it lasted for about 50 seconds.  

Sun 26 Apr 2020 05:29:34 PM CST
curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received
000
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:29:43 PM CST
200
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:29:46 PM CST
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
403
Sun 26 Apr 2020 05:29:50 PM CST
200
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:29:53 PM CST
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
403
Sun 26 Apr 2020 05:29:57 PM CST
200
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:30:00 PM CST
200
403
Sun 26 Apr 2020 05:30:04 PM CST
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 
000
Sun 26 Apr 2020 05:30:07 PM CST
200
403
Sun 26 Apr 2020 05:30:11 PM CST
curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received
000
403
Sun 26 Apr 2020 05:30:21 PM CST
200
403

Comment 12 Clayton Coleman 2020-04-28 13:38:55 UTC
Causes

Cluster frontend ingress remain available

to fail

Comment 13 Clayton Coleman 2020-04-28 13:48:11 UTC
*** Bug 1828856 has been marked as a duplicate of this bug. ***

Comment 14 Clayton Coleman 2020-05-06 18:46:36 UTC
The current status of this issue is that the router traffic is being blackholed when it moves from cloud service load balancer to the node after the router pod has started termination, but before it completes. It does so because we use the local traffic policy to ensure the source IP of incoming packets are configured, but in that mode the kube-proxy too aggressively removes the local endpoint from iptables.

The proposed fix is being discussed upstream to keep the terminating pod in the local list until it actually stops, before blackholing the traffic.  The fix is still being explored to ensure it does not introduce regressions or undesirable behavior, but we have some testing in 4.5 that the fix will correct the bulk of the unavailability.

Comment 15 Miciah Dashiel Butler Masters 2020-05-06 18:49:26 UTC
Following are the URLs for Andrew Sy Kim's KEP and WIP PR to implement the proposed fix that Clayton mentions in comment 14:

https://github.com/kubernetes/enhancements/pull/1607

https://github.com/kubernetes/kubernetes/pull/89780

Comment 16 Andrew McDermott 2020-05-07 15:57:45 UTC
*** Bug 1819147 has been marked as a duplicate of this bug. ***

Comment 17 W. Trevor King 2020-05-08 04:26:36 UTC
Taking a stab at an impact statement:

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  All clusters.  Longstanding kube-core issue; not a regression.  If edges are blocked, the goal will be to route users to fixed releases as quickly as possible, after which subsequent updates will proceed more smoothly.
What is the impact?  Is it serious enough to warrant blocking edges?
  Up to several minutes of disruption in router-fronted services, including OAuth, the web console, and user Routes, whenever a node holding a router pod reboots or the router pod is otherwise terminated.  Because it is not a regression, it is not serious enough to warrant blocking edges yet.  Once we have fixed releases, we may want to prune edges among unfixed releases to route users towards fixed releases more quickly.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Issue resolves itself after a few minutes, after load balancers feeding the old node realize that the router pod is dead.

Does that sound reasonable?  If I'm missing or misunderstanding anything, can we get an impact statement from someone closer to the problem?

Comment 18 Ben Bennett 2020-05-08 23:36:41 UTC
*** Bug 1818104 has been marked as a duplicate of this bug. ***

Comment 19 Samuel Padgett 2020-05-11 19:03:43 UTC
*** Bug 1812387 has been marked as a duplicate of this bug. ***

Comment 20 Miciah Dashiel Butler Masters 2020-05-19 23:19:31 UTC
We are continue to track the upstream test (https://github.com/kubernetes/kubernetes/pull/89780) in hopes of shipping it in 4.5.

Comment 21 Miciah Dashiel Butler Masters 2020-05-19 23:23:35 UTC
(In reply to W. Trevor King from comment #17)
> Does that sound reasonable?  If I'm missing or misunderstanding anything,
> can we get an impact statement from someone closer to the problem?

The impact statement looks good.

Comment 22 Miciah Dashiel Butler Masters 2020-05-19 23:25:01 UTC
(In reply to Miciah Dashiel Butler Masters from comment #20)
> We are continue to track the upstream test
> (https://github.com/kubernetes/kubernetes/pull/89780) in hopes of shipping
> it in 4.5.

Sorry, that got hopelessly garbled...

We are continuing to track kubernetes#89780 in hopes of shipping it in 4.5 on top of the other improvements linked to this Bugzilla report.

Comment 26 Andrew McDermott 2020-05-27 16:05:16 UTC
Moving to 4.6 as not a release blocker or an upgrade blocker.

Comment 27 Miciah Dashiel Butler Masters 2020-06-18 19:28:35 UTC
We'll continue to track https://github.com/kubernetes/kubernetes/pull/89780 and any other relevant PRs that would likely mitigate downtime during upgrades in the upcoming sprint.

Comment 29 Miciah Dashiel Butler Masters 2020-07-09 05:09:24 UTC
We'll continuing to track the upstream work in the upcoming sprint.

Comment 30 Miciah Dashiel Butler Masters 2020-07-30 08:30:39 UTC
We're continuing to track the upstream work.

Comment 31 Daneyon Hansen 2020-08-13 16:04:19 UTC
*** Bug 1868485 has been marked as a duplicate of this bug. ***

Comment 32 Daneyon Hansen 2020-08-13 16:05:44 UTC
*** Bug 1868486 has been marked as a duplicate of this bug. ***

Comment 33 Miciah Dashiel Butler Masters 2020-08-20 15:54:30 UTC
We're tracking https://github.com/kubernetes/kubernetes/pull/89780 in https://issues.redhat.com/browse/NE-348.

Comment 34 Maru Newby 2020-08-22 02:10:52 UTC
Reopening and re-titling to ensure we have a tracking bz for build watchers.

Comment 35 Maru Newby 2020-08-22 02:12:31 UTC
The failure rate for this test remains high and a tracking bz will ensure build watchers have traceability for the problem:

https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job&search=Cluster+frontend+ingress+remain+available

Comment 36 Maru Newby 2020-08-22 02:25:07 UTC
Reverting changes, I realized that sippy wants a 4.5 bz. I'll file separately and reference this one.


Note You need to log in before you can comment on or make changes to this bug.