+++ This bug was initially created as a clone of Bug #1709958 +++ The router needs to have logic similar to the apiserver that is roughly: 1. On delete, wait a fixed amount of time (enough time for other proxies to drain, usually 5-30s for Kube-proxy and 20-40s for ELB) 2. then start refusing new connections 3. then wait for connections to drain with a max time (we should already have this configured in haproxy) 4. then exit with code zero If a second TERM or INT comes in, you shut down. Grace period for the pod needs to be set longer than time for 1+3. This needs to be fixed in 4.1.Z, but can miss 4.1.0. We need an e2e upgrade test that verifies that the router continues to serve traffic via the ELB without interruption during a node upgrade. We should also verify that the router has a disruption budget that prevents it from being all taken down at once. This was an oversight during reviewing the product for upgrade tests. Fortunately the use of node ports should minimize the impact to just a second or so on unloaded clusters (which is why this is a 4.1.Z candidate) and we should be able to fix this before customers begin running high loads. During a normal upgrade, the router must answer 100% of connections successfully.
Closing this duplicate. *** This bug has been marked as a duplicate of bug 1709958 ***