Bug 1710376

Summary: Router does not gracefully shut down and drain traffic
Product: OpenShift Container Platform Reporter: Ben Bennett <bbennett>
Component: NetworkingAssignee: Dan Mace <dmace>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: medium CC: aos-bugs, ccoleman, dmace, hongli
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1709958 Environment:
Last Closed: 2019-07-30 14:16:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1709958    
Bug Blocks:    

Description Ben Bennett 2019-05-15 12:52:44 UTC
+++ This bug was initially created as a clone of Bug #1709958 +++

The router needs to have logic similar to the apiserver that is roughly:

1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)
2. then start refusing new connections
3. then wait for connections to drain with a max time (we should already have this configured in haproxy)
4. then exit with code zero

If a second TERM or INT comes in, you shut down.  Grace period for the pod needs to be set longer than time for 1+3.

This needs to be fixed in 4.1.Z, but can miss 4.1.0.  We need an e2e upgrade test that verifies that the router continues to serve traffic via the ELB without interruption during a node upgrade.  We should also verify that the router has a disruption budget that prevents it from being all taken down at once.

This was an oversight during reviewing the product for upgrade tests.  Fortunately the use of node ports should minimize the impact to just a second or so on unloaded clusters (which is why this is a 4.1.Z candidate) and we should be able to fix this before customers begin running high loads.

During a normal upgrade, the router must answer 100% of connections successfully.

Comment 1 Ben Bennett 2019-07-30 14:16:31 UTC
Closing this duplicate.

*** This bug has been marked as a duplicate of bug 1709958 ***