Bug 1709958

Summary: Router does not gracefully shut down and drain traffic
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: RoutingAssignee: Miciah Dashiel Butler Masters <mmasters>
Status: CLOSED DEFERRED QA Contact: Hongan Li <hongli>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: aos-bugs, bbennett, dmace, erich, mmasters
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1710376 (view as bug list) Environment:
Last Closed: 2019-10-11 16:29:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1710376    

Description Clayton Coleman 2019-05-14 16:10:47 UTC
The router needs to have logic similar to the apiserver that is roughly:

1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)
2. then start refusing new connections
3. then wait for connections to drain with a max time (we should already have this configured in haproxy)
4. then exit with code zero

If a second TERM or INT comes in, you shut down.  Grace period for the pod needs to be set longer than time for 1+3.

This needs to be fixed in 4.1.Z, but can miss 4.1.0.  We need an e2e upgrade test that verifies that the router continues to serve traffic via the ELB without interruption during a node upgrade.  We should also verify that the router has a disruption budget that prevents it from being all taken down at once.

This was an oversight during reviewing the product for upgrade tests.  Fortunately the use of node ports should minimize the impact to just a second or so on unloaded clusters (which is why this is a 4.1.Z candidate) and we should be able to fix this before customers begin running high loads.

During a normal upgrade, the router must answer 100% of connections successfully.

Comment 1 Miciah Dashiel Butler Masters 2019-05-20 17:52:11 UTC
Would you mind clarifying some of these requirements?

> 1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)

Can you explain the purpose of this step?  I am guessing that the intention is that some process will initiate both a drain of kube-proxy and ELBs and the graceful shutdown of the router.  Am I guessing correctly? What is this process? Is this process outside of the scope of this Bugzilla report?

I am implementing this requirement with a sleep before sending any signal to HAProxy: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR32

> 2. then start refusing new connections
> 3. then wait for connections to drain with a max time (we should already have this configured in haproxy)

I am implementing these two requirements by inhibiting reloads and sending all HAProxy processes SIGUSR1, giving them a fixed amount of time to terminate: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR35

> 4. then exit with code zero

I assume that it is implied that we should kill any remaining HAProxy processes using SIGTERM: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR47

> During a normal upgrade, the router must answer 100% of connections successfully.

Do you mean only that the router should seamlessly continue accepting new connections? We cannot feasibly preserve open, long-lived connections (such as WebSocket connections).

For reference, here are the relevant PRs to implement these requirements:

• https://github.com/openshift/router/pull/28 (add shell script for graceful shutdown).
• https://github.com/openshift/cluster-ingress-operator/pull/240 (use shutdown script, increase the termination grace period, and create a pod disruption budget)
• https://github.com/openshift/origin/pull/22852 (add an upgrade test)

I have manually tested the first two changes but have not gotten the upgrade test passing CI yet.

Comment 2 Clayton Coleman 2019-05-21 18:25:25 UTC
> 1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)

> Can you explain the purpose of this step?  I am guessing that the intention is that some process will initiate both a drain of kube-proxy and ELBs and the graceful shutdown of the router.  Am I guessing correctly? What is this process? Is this process outside of the scope of this Bugzilla report?

You have to wait for load balancers to stop sending new traffic to you before you start accepting traffic.  You have to wait longer than the max propagation time of endpoints.  Talk to Stefan on the master team if you need more details.

> 3. then wait for connections to drain with a max time (we should already have this configured in haproxy)

> I am implementing these two requirements by inhibiting reloads and sending all HAProxy processes SIGUSR1, giving them a fixed amount of time to terminate: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR35

That is reasonable - the shutdown time should be proportional to the max wait for idle connections (if not identical)

> 4. then exit with code zero

> I assume that it is implied that we should kill any remaining HAProxy processes using SIGTERM: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR47

Before exiting, ensure all processes stop.  Don't exit if those processes don't stop.

>  During a normal upgrade, the router must answer 100% of connections successfully.

> Do you mean only that the router should seamlessly continue accepting new connections? We cannot feasibly preserve open, long-lived connections (such as WebSocket connections).

During an upgrade, 100% of new connection requests must succeed.  100% of requests that finish <max_shutdown should be the same as their upstream (no new failures injected).  100% of idle websocket/tls/tcp connections must shut down gracefully.  Connections that don't close within max_shutdown must close with the appropriate error for a proxy.

Comment 3 Ben Bennett 2019-07-30 14:16:31 UTC
*** Bug 1710376 has been marked as a duplicate of this bug. ***

Comment 5 Dan Mace 2019-09-17 17:13:42 UTC
We've decided to fix this in 4.2.

Comment 6 Dan Mace 2019-09-18 21:50:48 UTC
Now that we've got a more clear picture of the solution, we've decided it's too risky to do in 4.2. Moving it back to 4.3 and will focus on backporting a fix ASAP.

Comment 7 Dan Mace 2019-10-11 16:29:08 UTC
We're going to continue tracking this in https://jira.coreos.com/browse/NE-203.