Bug 1709958 - Router does not gracefully shut down and drain traffic
Summary: Router does not gracefully shut down and drain traffic
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
: 1710376 (view as bug list)
Depends On:
Blocks: 1710376
TreeView+ depends on / blocked
 
Reported: 2019-05-14 16:10 UTC by Clayton Coleman
Modified: 2019-10-11 16:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1710376 (view as bug list)
Environment:
Last Closed: 2019-10-11 16:29:08 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 240 None closed Bug 1709958: Use graceful shutdown for router 2020-06-17 22:05:07 UTC
Github openshift cluster-ingress-operator pull 280 None closed Bug 1709958: Avoid dropping traffic during upgrade 2020-06-17 22:05:07 UTC
Github openshift router pull 28 None closed Add shell script for graceful shutdown 2020-06-17 22:05:07 UTC
Origin (Github) 22852 None None None 2019-05-20 17:52:10 UTC

Description Clayton Coleman 2019-05-14 16:10:47 UTC
The router needs to have logic similar to the apiserver that is roughly:

1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)
2. then start refusing new connections
3. then wait for connections to drain with a max time (we should already have this configured in haproxy)
4. then exit with code zero

If a second TERM or INT comes in, you shut down.  Grace period for the pod needs to be set longer than time for 1+3.

This needs to be fixed in 4.1.Z, but can miss 4.1.0.  We need an e2e upgrade test that verifies that the router continues to serve traffic via the ELB without interruption during a node upgrade.  We should also verify that the router has a disruption budget that prevents it from being all taken down at once.

This was an oversight during reviewing the product for upgrade tests.  Fortunately the use of node ports should minimize the impact to just a second or so on unloaded clusters (which is why this is a 4.1.Z candidate) and we should be able to fix this before customers begin running high loads.

During a normal upgrade, the router must answer 100% of connections successfully.

Comment 1 Miciah Dashiel Butler Masters 2019-05-20 17:52:11 UTC
Would you mind clarifying some of these requirements?

> 1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)

Can you explain the purpose of this step?  I am guessing that the intention is that some process will initiate both a drain of kube-proxy and ELBs and the graceful shutdown of the router.  Am I guessing correctly? What is this process? Is this process outside of the scope of this Bugzilla report?

I am implementing this requirement with a sleep before sending any signal to HAProxy: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR32

> 2. then start refusing new connections
> 3. then wait for connections to drain with a max time (we should already have this configured in haproxy)

I am implementing these two requirements by inhibiting reloads and sending all HAProxy processes SIGUSR1, giving them a fixed amount of time to terminate: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR35

> 4. then exit with code zero

I assume that it is implied that we should kill any remaining HAProxy processes using SIGTERM: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR47

> During a normal upgrade, the router must answer 100% of connections successfully.

Do you mean only that the router should seamlessly continue accepting new connections? We cannot feasibly preserve open, long-lived connections (such as WebSocket connections).

For reference, here are the relevant PRs to implement these requirements:

• https://github.com/openshift/router/pull/28 (add shell script for graceful shutdown).
• https://github.com/openshift/cluster-ingress-operator/pull/240 (use shutdown script, increase the termination grace period, and create a pod disruption budget)
• https://github.com/openshift/origin/pull/22852 (add an upgrade test)

I have manually tested the first two changes but have not gotten the upgrade test passing CI yet.

Comment 2 Clayton Coleman 2019-05-21 18:25:25 UTC
> 1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB)

> Can you explain the purpose of this step?  I am guessing that the intention is that some process will initiate both a drain of kube-proxy and ELBs and the graceful shutdown of the router.  Am I guessing correctly? What is this process? Is this process outside of the scope of this Bugzilla report?

You have to wait for load balancers to stop sending new traffic to you before you start accepting traffic.  You have to wait longer than the max propagation time of endpoints.  Talk to Stefan on the master team if you need more details.

> 3. then wait for connections to drain with a max time (we should already have this configured in haproxy)

> I am implementing these two requirements by inhibiting reloads and sending all HAProxy processes SIGUSR1, giving them a fixed amount of time to terminate: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR35

That is reasonable - the shutdown time should be proportional to the max wait for idle connections (if not identical)

> 4. then exit with code zero

> I assume that it is implied that we should kill any remaining HAProxy processes using SIGTERM: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR47

Before exiting, ensure all processes stop.  Don't exit if those processes don't stop.

>  During a normal upgrade, the router must answer 100% of connections successfully.

> Do you mean only that the router should seamlessly continue accepting new connections? We cannot feasibly preserve open, long-lived connections (such as WebSocket connections).

During an upgrade, 100% of new connection requests must succeed.  100% of requests that finish <max_shutdown should be the same as their upstream (no new failures injected).  100% of idle websocket/tls/tcp connections must shut down gracefully.  Connections that don't close within max_shutdown must close with the appropriate error for a proxy.

Comment 3 Ben Bennett 2019-07-30 14:16:31 UTC
*** Bug 1710376 has been marked as a duplicate of this bug. ***

Comment 5 Dan Mace 2019-09-17 17:13:42 UTC
We've decided to fix this in 4.2.

Comment 6 Dan Mace 2019-09-18 21:50:48 UTC
Now that we've got a more clear picture of the solution, we've decided it's too risky to do in 4.2. Moving it back to 4.3 and will focus on backporting a fix ASAP.

Comment 7 Dan Mace 2019-10-11 16:29:08 UTC
We're going to continue tracking this in https://jira.coreos.com/browse/NE-203.


Note You need to log in before you can comment on or make changes to this bug.