The router, console, and oauth endpoints should all gracefully terminate when their pods are marked deleted without dropping traffic. Console and oauth can have simple "wait before shutdown" logic because they do not execute long running transactions. The router needs to wait longer (it is a service load balancer) and then instruct HAProxy to gracefully terminate, then wait up to a limit, and then shut down. In combination these fixes will ensure end users see no disruption of the control plane or web console, or their frontend web applications, during upgrade.
Oops, that was accidental.
upgrade from 4.5.0-0.nightly-2020-03-20-200807 to 4.5.0-0.nightly-2020-03-23-213917, the console is not reachable for about 60 seconds. when curling the console route and it shows: curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-upg807.example.com:443
*** Bug 1809742 has been marked as a duplicate of this bug. ***
Upgrade from 4.5.0-0.nightly-2020-04-21-103613 to 4.5.0-0.nightly-2020-04-25-17044, and keep accessing console and auth route in another window, still found connection problem during upgrade and it lasted for about 50 seconds. Sun 26 Apr 2020 05:29:34 PM CST curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received 000 curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 Sun 26 Apr 2020 05:29:43 PM CST 200 curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 Sun 26 Apr 2020 05:29:46 PM CST curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 403 Sun 26 Apr 2020 05:29:50 PM CST 200 curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 Sun 26 Apr 2020 05:29:53 PM CST curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 403 Sun 26 Apr 2020 05:29:57 PM CST 200 curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 Sun 26 Apr 2020 05:30:00 PM CST 200 403 Sun 26 Apr 2020 05:30:04 PM CST curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to console-openshift-console.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.hongli-45bv.qe.devcluster.openshift.com:443 000 Sun 26 Apr 2020 05:30:07 PM CST 200 403 Sun 26 Apr 2020 05:30:11 PM CST curl: (28) Operation timed out after 5000 milliseconds with 0 out of 0 bytes received 000 403 Sun 26 Apr 2020 05:30:21 PM CST 200 403
Causes Cluster frontend ingress remain available to fail
*** Bug 1828856 has been marked as a duplicate of this bug. ***
The current status of this issue is that the router traffic is being blackholed when it moves from cloud service load balancer to the node after the router pod has started termination, but before it completes. It does so because we use the local traffic policy to ensure the source IP of incoming packets are configured, but in that mode the kube-proxy too aggressively removes the local endpoint from iptables. The proposed fix is being discussed upstream to keep the terminating pod in the local list until it actually stops, before blackholing the traffic. The fix is still being explored to ensure it does not introduce regressions or undesirable behavior, but we have some testing in 4.5 that the fix will correct the bulk of the unavailability.
Following are the URLs for Andrew Sy Kim's KEP and WIP PR to implement the proposed fix that Clayton mentions in comment 14: https://github.com/kubernetes/enhancements/pull/1607 https://github.com/kubernetes/kubernetes/pull/89780
*** Bug 1819147 has been marked as a duplicate of this bug. ***
Taking a stab at an impact statement: Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? All clusters. Longstanding kube-core issue; not a regression. If edges are blocked, the goal will be to route users to fixed releases as quickly as possible, after which subsequent updates will proceed more smoothly. What is the impact? Is it serious enough to warrant blocking edges? Up to several minutes of disruption in router-fronted services, including OAuth, the web console, and user Routes, whenever a node holding a router pod reboots or the router pod is otherwise terminated. Because it is not a regression, it is not serious enough to warrant blocking edges yet. Once we have fixed releases, we may want to prune edges among unfixed releases to route users towards fixed releases more quickly. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Issue resolves itself after a few minutes, after load balancers feeding the old node realize that the router pod is dead. Does that sound reasonable? If I'm missing or misunderstanding anything, can we get an impact statement from someone closer to the problem?
*** Bug 1818104 has been marked as a duplicate of this bug. ***
*** Bug 1812387 has been marked as a duplicate of this bug. ***
We are continue to track the upstream test (https://github.com/kubernetes/kubernetes/pull/89780) in hopes of shipping it in 4.5.
(In reply to W. Trevor King from comment #17) > Does that sound reasonable? If I'm missing or misunderstanding anything, > can we get an impact statement from someone closer to the problem? The impact statement looks good.
(In reply to Miciah Dashiel Butler Masters from comment #20) > We are continue to track the upstream test > (https://github.com/kubernetes/kubernetes/pull/89780) in hopes of shipping > it in 4.5. Sorry, that got hopelessly garbled... We are continuing to track kubernetes#89780 in hopes of shipping it in 4.5 on top of the other improvements linked to this Bugzilla report.
Moving to 4.6 as not a release blocker or an upgrade blocker.
We'll continue to track https://github.com/kubernetes/kubernetes/pull/89780 and any other relevant PRs that would likely mitigate downtime during upgrades in the upcoming sprint.
We'll continuing to track the upstream work in the upcoming sprint.
We're continuing to track the upstream work.
*** Bug 1868485 has been marked as a duplicate of this bug. ***
*** Bug 1868486 has been marked as a duplicate of this bug. ***
We're tracking https://github.com/kubernetes/kubernetes/pull/89780 in https://issues.redhat.com/browse/NE-348.
Reopening and re-titling to ensure we have a tracking bz for build watchers.
The failure rate for this test remains high and a tracking bz will ensure build watchers have traceability for the problem: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job&search=Cluster+frontend+ingress+remain+available
Reverting changes, I realized that sippy wants a 4.5 bz. I'll file separately and reference this one.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days