Bug 1709958
Summary: | Router does not gracefully shut down and drain traffic | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> | |
Networking sub component: | router | QA Contact: | Hongan Li <hongli> | |
Status: | CLOSED DEFERRED | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aos-bugs, bbennett, dmace, erich, mmasters | |
Version: | 4.1.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1710376 (view as bug list) | Environment: | ||
Last Closed: | 2019-10-11 16:29:08 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1710376 |
Description
Clayton Coleman
2019-05-14 16:10:47 UTC
Would you mind clarifying some of these requirements? > 1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB) Can you explain the purpose of this step? I am guessing that the intention is that some process will initiate both a drain of kube-proxy and ELBs and the graceful shutdown of the router. Am I guessing correctly? What is this process? Is this process outside of the scope of this Bugzilla report? I am implementing this requirement with a sleep before sending any signal to HAProxy: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR32 > 2. then start refusing new connections > 3. then wait for connections to drain with a max time (we should already have this configured in haproxy) I am implementing these two requirements by inhibiting reloads and sending all HAProxy processes SIGUSR1, giving them a fixed amount of time to terminate: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR35 > 4. then exit with code zero I assume that it is implied that we should kill any remaining HAProxy processes using SIGTERM: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR47 > During a normal upgrade, the router must answer 100% of connections successfully. Do you mean only that the router should seamlessly continue accepting new connections? We cannot feasibly preserve open, long-lived connections (such as WebSocket connections). For reference, here are the relevant PRs to implement these requirements: • https://github.com/openshift/router/pull/28 (add shell script for graceful shutdown). • https://github.com/openshift/cluster-ingress-operator/pull/240 (use shutdown script, increase the termination grace period, and create a pod disruption budget) • https://github.com/openshift/origin/pull/22852 (add an upgrade test) I have manually tested the first two changes but have not gotten the upgrade test passing CI yet. > 1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB) > Can you explain the purpose of this step? I am guessing that the intention is that some process will initiate both a drain of kube-proxy and ELBs and the graceful shutdown of the router. Am I guessing correctly? What is this process? Is this process outside of the scope of this Bugzilla report? You have to wait for load balancers to stop sending new traffic to you before you start accepting traffic. You have to wait longer than the max propagation time of endpoints. Talk to Stefan on the master team if you need more details. > 3. then wait for connections to drain with a max time (we should already have this configured in haproxy) > I am implementing these two requirements by inhibiting reloads and sending all HAProxy processes SIGUSR1, giving them a fixed amount of time to terminate: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR35 That is reasonable - the shutdown time should be proportional to the max wait for idle connections (if not identical) > 4. then exit with code zero > I assume that it is implied that we should kill any remaining HAProxy processes using SIGTERM: https://github.com/openshift/router/pull/28/files#diff-b1dd751246c919dac57cc6cd3628a57aR47 Before exiting, ensure all processes stop. Don't exit if those processes don't stop. > During a normal upgrade, the router must answer 100% of connections successfully. > Do you mean only that the router should seamlessly continue accepting new connections? We cannot feasibly preserve open, long-lived connections (such as WebSocket connections). During an upgrade, 100% of new connection requests must succeed. 100% of requests that finish <max_shutdown should be the same as their upstream (no new failures injected). 100% of idle websocket/tls/tcp connections must shut down gracefully. Connections that don't close within max_shutdown must close with the appropriate error for a proxy. *** Bug 1710376 has been marked as a duplicate of this bug. *** We've decided to fix this in 4.2. Now that we've got a more clear picture of the solution, we've decided it's too risky to do in 4.2. Moving it back to 4.3 and will focus on backporting a fix ASAP. We're going to continue tracking this in https://jira.coreos.com/browse/NE-203. |