Bug 1809667 - Router and default exposed frontends (oauth and console) should gracefully terminate
Summary: Router and default exposed frontends (oauth and console) should gracefully te...
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.z
Assignee: Clayton Coleman
QA Contact: Hongan Li
URL:
Whiteboard:
: 1794169 1811858 (view as bug list)
Depends On: 1809665 1869785
Blocks: 1809668 1794169
TreeView+ depends on / blocked
 
Reported: 2020-03-03 16:17 UTC by Clayton Coleman
Modified: 2020-10-27 14:16 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1809665
: 1809668 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 255 None closed [4.4] Bug 1809667: The oauth server should wait until it is out of rotation to shut down 2020-10-22 19:38:53 UTC
Github openshift cluster-ingress-operator pull 368 None closed Bug 1809667: Router should deploy with a very long grace period 2020-10-22 19:39:06 UTC
Github openshift cluster-ingress-operator pull 370 None closed Bug 1809667: Tune AWS load balancers to be consistent with other platforms 2020-10-22 19:39:06 UTC
Github openshift console-operator pull 387 None closed [release-4.4] Bug 1809667: The console should wait until it is out of rotation to shut down 2020-10-22 19:38:54 UTC
Github openshift router pull 97 None closed [release-4.4] Bug 1809667: Start graceful shutdown on SIGTERM 2020-10-22 19:39:07 UTC

Description Clayton Coleman 2020-03-03 16:17:08 UTC
+++ This bug was initially created as a clone of Bug #1809665 +++

The router, console, and oauth endpoints should all gracefully terminate when their pods are marked deleted without dropping traffic.

Console and oauth can have simple "wait before shutdown" logic because they do not execute long running transactions.  The router needs to wait longer (it is a service load balancer) and then instruct HAProxy to gracefully terminate, then wait up to a limit, and then shut down.

In combination these fixes will ensure end users see no disruption of the control plane or web console, or their frontend web applications, during upgrade.

Comment 1 Ben Bennett 2020-03-06 14:19:26 UTC
*** Bug 1794169 has been marked as a duplicate of this bug. ***

Comment 2 Dan Mace 2020-03-10 12:36:01 UTC
*** Bug 1811858 has been marked as a duplicate of this bug. ***

Comment 5 Hongan Li 2020-03-20 10:11:26 UTC
upgrade from 4.4.0-0.nightly-2020-03-19-181721 to 4.4.0-0.nightly-2020-03-19-205629 and still can find that the console is not accessible for about 70 seconds.
during the period, seems the nodes are updated by machine-config and the node status keep changing  

$ oc get node
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-129-101.us-east-2.compute.internal   Ready                         master   3h12m   v1.17.1
ip-10-0-136-49.us-east-2.compute.internal    Ready,SchedulingDisabled      worker   3h      v1.17.1
ip-10-0-144-234.us-east-2.compute.internal   Ready                         master   3h12m   v1.17.1
ip-10-0-159-139.us-east-2.compute.internal   Ready                         worker   3h      v1.17.1
ip-10-0-160-16.us-east-2.compute.internal    NotReady,SchedulingDisabled   master   3h12m   v1.17.1


$ oc get node
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-129-101.us-east-2.compute.internal   Ready,SchedulingDisabled   master   3h13m   v1.17.1
ip-10-0-136-49.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   3h2m    v1.17.1
ip-10-0-144-234.us-east-2.compute.internal   Ready                      master   3h13m   v1.17.1
ip-10-0-159-139.us-east-2.compute.internal   Ready                      worker   3h2m    v1.17.1
ip-10-0-160-16.us-east-2.compute.internal    Ready                      master   3h13m   v1.17.1

Comment 6 Dan Mace 2020-03-23 11:39:51 UTC
If nodes are flapping erratically, let's open a new issue for that. I'm afraid of this issue getting too confused with things that aren't related to ingress.

Comment 7 David Ffrench 2020-03-23 15:40:18 UTC
Do you know what release this issue will be included in? and if it will be backported to 4.3?

Comment 8 Hongan Li 2020-03-24 01:49:55 UTC
Hi Dan, I think the nodes flapping erratically is caused by machine-config operator upgrading.
I agree with you that maybe they aren't related to ingress, but actually this BZ has included other components' fixes like console and auth.

Another reason is that the linked customer issue are all about upgrade, and in the BZ description we can see "In combination these fixes will ensure end users see no disruption of the control plane or web console, or their frontend web applications, during upgrade", so opening a new issue may be not good to trace and verify the upgrade issue.

WDYT?

Comment 10 Hongan Li 2020-03-30 06:25:10 UTC
Upgrade from 4.4.0-0.nightly-2020-03-27-223052 to 4.4.0-0.nightly-2020-03-29-132004, the console route is not accessible for about 30 seconds during upgrade process.

In this period, the result of `curl $console_route` is:
curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received

Comment 14 Ben Bennett 2020-05-08 19:59:31 UTC
Waiting for the master work to complete.

Comment 15 Andrew McDermott 2020-05-28 16:05:55 UTC
Per comment #14 - Waiting for the master work to complete.

Comment 16 milti leonard 2020-06-27 20:48:54 UTC
is there any update on the work regarding this BZ, pls?

Comment 17 Andrew McDermott 2020-07-09 12:12:50 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 18 Andrew McDermott 2020-07-30 10:11:38 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 19 Miciah Dashiel Butler Masters 2020-08-21 05:11:44 UTC
We'll continue tracking this issue in the upcoming sprint.

Comment 20 Andrew McDermott 2020-09-10 11:55:34 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 21 Miciah Dashiel Butler Masters 2020-10-26 05:37:29 UTC
The remaining known issue is being tracked in https://issues.redhat.com/browse/NE-348 (graceful termination for LoadBalancer-type services using the "Local" external traffic policy); no backport of NE-348 is planned for 4.4.z.


Note You need to log in before you can comment on or make changes to this bug.