Bug 1710376 - Router does not gracefully shut down and drain traffic
Summary: Router does not gracefully shut down and drain traffic
Keywords:
Status: CLOSED DUPLICATE of bug 1709958
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.2.0
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On: 1709958
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-15 12:52 UTC by Ben Bennett
Modified: 2022-08-04 22:24 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1709958
Environment:
Last Closed: 2019-07-30 14:16:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ben Bennett 2019-05-15 12:52:44 UTC
+++ This bug was initially created as a clone of Bug #1709958 +++

The router needs to have logic similar to the apiserver that is roughly:

1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)
2. then start refusing new connections
3. then wait for connections to drain with a max time (we should already have this configured in haproxy)
4. then exit with code zero

If a second TERM or INT comes in, you shut down.  Grace period for the pod needs to be set longer than time for 1+3.

This needs to be fixed in 4.1.Z, but can miss 4.1.0.  We need an e2e upgrade test that verifies that the router continues to serve traffic via the ELB without interruption during a node upgrade.  We should also verify that the router has a disruption budget that prevents it from being all taken down at once.

This was an oversight during reviewing the product for upgrade tests.  Fortunately the use of node ports should minimize the impact to just a second or so on unloaded clusters (which is why this is a 4.1.Z candidate) and we should be able to fix this before customers begin running high loads.

During a normal upgrade, the router must answer 100% of connections successfully.

Comment 1 Ben Bennett 2019-07-30 14:16:31 UTC
Closing this duplicate.

*** This bug has been marked as a duplicate of bug 1709958 ***


Note You need to log in before you can comment on or make changes to this bug.