Bug 1809667

Summary:	Router and default exposed frontends (oauth and console) should gracefully terminate
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Clayton Coleman <ccoleman>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	high
Priority:	high	CC:	amcdermo, aos-bugs, bbennett, dffrench, dmace, hongli, ikarpukh, jeder, ltsai, mgahagan, mleonard, mmasters, wking
Version:	4.4
Target Milestone:	---
Target Release:	4.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1809665
Clones:	1809668 (view as bug list)		Environment:
Last Closed:	2020-11-06 14:19:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1809665, 1869785
Bug Blocks:	1794169, 1809668

Description Clayton Coleman 2020-03-03 16:17:08 UTC

+++ This bug was initially created as a clone of Bug #1809665 +++

The router, console, and oauth endpoints should all gracefully terminate when their pods are marked deleted without dropping traffic.

Console and oauth can have simple "wait before shutdown" logic because they do not execute long running transactions.  The router needs to wait longer (it is a service load balancer) and then instruct HAProxy to gracefully terminate, then wait up to a limit, and then shut down.

In combination these fixes will ensure end users see no disruption of the control plane or web console, or their frontend web applications, during upgrade.

Comment 1 Ben Bennett 2020-03-06 14:19:26 UTC

*** Bug 1794169 has been marked as a duplicate of this bug. ***

Comment 2 Dan Mace 2020-03-10 12:36:01 UTC

*** Bug 1811858 has been marked as a duplicate of this bug. ***

Comment 5 Hongan Li 2020-03-20 10:11:26 UTC

upgrade from 4.4.0-0.nightly-2020-03-19-181721 to 4.4.0-0.nightly-2020-03-19-205629 and still can find that the console is not accessible for about 70 seconds.
during the period, seems the nodes are updated by machine-config and the node status keep changing  

$ oc get node
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-129-101.us-east-2.compute.internal   Ready                         master   3h12m   v1.17.1
ip-10-0-136-49.us-east-2.compute.internal    Ready,SchedulingDisabled      worker   3h      v1.17.1
ip-10-0-144-234.us-east-2.compute.internal   Ready                         master   3h12m   v1.17.1
ip-10-0-159-139.us-east-2.compute.internal   Ready                         worker   3h      v1.17.1
ip-10-0-160-16.us-east-2.compute.internal    NotReady,SchedulingDisabled   master   3h12m   v1.17.1


$ oc get node
NAME                                         STATUS                     ROLES    AGE     VERSION
ip-10-0-129-101.us-east-2.compute.internal   Ready,SchedulingDisabled   master   3h13m   v1.17.1
ip-10-0-136-49.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   3h2m    v1.17.1
ip-10-0-144-234.us-east-2.compute.internal   Ready                      master   3h13m   v1.17.1
ip-10-0-159-139.us-east-2.compute.internal   Ready                      worker   3h2m    v1.17.1
ip-10-0-160-16.us-east-2.compute.internal    Ready                      master   3h13m   v1.17.1

Comment 6 Dan Mace 2020-03-23 11:39:51 UTC

If nodes are flapping erratically, let's open a new issue for that. I'm afraid of this issue getting too confused with things that aren't related to ingress.

Comment 7 David Ffrench 2020-03-23 15:40:18 UTC

Do you know what release this issue will be included in? and if it will be backported to 4.3?

Comment 8 Hongan Li 2020-03-24 01:49:55 UTC

Hi Dan, I think the nodes flapping erratically is caused by machine-config operator upgrading.
I agree with you that maybe they aren't related to ingress, but actually this BZ has included other components' fixes like console and auth.

Another reason is that the linked customer issue are all about upgrade, and in the BZ description we can see "In combination these fixes will ensure end users see no disruption of the control plane or web console, or their frontend web applications, during upgrade", so opening a new issue may be not good to trace and verify the upgrade issue.

WDYT?

Comment 10 Hongan Li 2020-03-30 06:25:10 UTC

Upgrade from 4.4.0-0.nightly-2020-03-27-223052 to 4.4.0-0.nightly-2020-03-29-132004, the console route is not accessible for about 30 seconds during upgrade process.

In this period, the result of `curl $console_route` is:
curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received

Comment 14 Ben Bennett 2020-05-08 19:59:31 UTC

Waiting for the master work to complete.

Comment 15 Andrew McDermott 2020-05-28 16:05:55 UTC

Per comment #14 - Waiting for the master work to complete.

Comment 16 milti leonard 2020-06-27 20:48:54 UTC

is there any update on the work regarding this BZ, pls?

Comment 17 Andrew McDermott 2020-07-09 12:12:50 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 18 Andrew McDermott 2020-07-30 10:11:38 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 19 Miciah Dashiel Butler Masters 2020-08-21 05:11:44 UTC

We'll continue tracking this issue in the upcoming sprint.

Comment 20 Andrew McDermott 2020-09-10 11:55:34 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 21 Miciah Dashiel Butler Masters 2020-10-26 05:37:29 UTC

The remaining known issue is being tracked in https://issues.redhat.com/browse/NE-348 (graceful termination for LoadBalancer-type services using the "Local" external traffic policy); no backport of NE-348 is planned for 4.4.z.

Comment 22 Scott Dodson 2020-11-05 18:22:04 UTC

CLOSE DEFERRED Then?