The current shell script built into RHCOS that manages routes on GCP is in many ways dissatisfying I'd appreciate a solution that builds upon my PR to move that script's functionality into MCO's gcp-routes-controller, see: https://github.com/openshift/machine-config-operator/pull/1489 Better even would be to find a way we could get rid of that container alltogether and manage GCP LB routes natively in a sane manner.
The API LB is created by the installer, manipulated indirectly with gcp-routes-controller via MCO, in support of the apiserver. So, not really directly related to ingress as pertains to routes. I'm not sure if SDN is really the right home for the bug, but SDN seems closer in this case. If somebody feels differently, please re-assign!
It's not really SDN because the SDN doesn't manage the internal load balancer, that would be MCO. Antonio, the problem is that when the gcp-routes service is stopped, any established connection that requires thouse routes will be dropped. And the connections mustn't be dropped as early as it's being done right now. Regarding the health check, this is what you want to achieve on any implementation: > - at N+0: /readyz turns red At this point you must preserve establishing connections but should stop acepting new ones and start monitoring /healthz > - at N+70: kube-apiserver stop listening, /healthz is red At this point you must stop sending new connections but must keep established connections, and wait 60 seconds to reach the next point. > - at N+70+60: kube-apiserver terminates at the latest. Keep in mind the 60 seconds here may be actually less. At this point the server isn't listening at all and there aren't any established connections to keep alive. For GCP in particular, although I'm not incredibly familiar with GCP nor this healthcheck, but based on Stephan's comment, on the GCP docs(1) this and some *non exhaustive* checks on my own gcp cluster: How to solve this is actually pretty complex due to gcp limitations(2)... When readyz stops working, you want to send all the new connectionsto a different server, and this seems to require the gcp-routes service to stop (2), but you don't want to drop existing connections, and this requires the gcp-routes service to keep running. I see in the GCP documentation(3) there is a method called instanceGroupManagers.abandonInstances(3) which may be good enough, except it requires connections to stop in up to 60 seconds and we'd ideally want 120, but it's still better than what we have right now. I haven't actually tested this, so it may not work. I'm reassigning this to MCO because it's really an MCO component and I don't know the MCO flows well enough nor how the LBs are managed, but I'm happy to assist with any network related queries. Althouggh at this point we want someone who really has experience with GCP load balancer to assist. References: 1- https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/abandonInstances 2- https://github.com/openshift/machine-config-operator/pull/1031#issue-303605505 3- https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/abandonInstances
Thanks for the summary! My PR that I linked above (https://github.com/openshift/machine-config-operator/pull/1489) doesn't yet work properly and isn't passing e2e tests. Nonetheless, instead of fixing this in the shell script built into RHCOS, the preferable way forward is to move all of this functionality (including the requirements you laid out) into the gcp-routes-controller in MCO, so it's independent of things built into the base OS (for OKD we use Fedora CoreOS as base which does not have the script at all). If this is not immediate priority, I will work this into my PR, once I get around to tackling that again.
We designed a test to reproduce the issue. We consistently see the issue on a gcp cluster, on the other hand, the issue does not exist on an aws cluster. For a detailed report please see - https://github.com/tkashem/graceful/blob/master/report/report.md
(In reply to Juan Luis de Sousa-Valadas from comment #4) > I'm reassigning this to MCO because it's really an MCO component and I don't > know the MCO flows well enough nor how the LBs are managed, but I'm happy to > assist with any network related queries. Althouggh at this point we want > someone who really has experience with GCP load balancer to assist. As a side note GCP routes are really not system configuration nor upgrade (MCO) nor operating system. I understand that this functionality currently resides within MCO but the reality is it should live where cloud configuration occurs OR possibly where special cloud workarounds exist (EG: agent -- afterburn).
I'm quoting some notes from https://bugzilla.redhat.com/show_bug.cgi?id=1793592#c6 In meetings speaking with folks in the MCO team it's quite apparent that this functionality: - Shouldn't be in the MCO - Isn't something the team has a lot of knowledge or comfort around - Isn't something that will likely be doable in a day I believe that, for 4.4 at least, it's too late to try to place cloud configuration in a more expected place but there are a few things we could do: - Pass this off to a person or team who does have more background and comfort in this to fix it as is - Move this off to 4.5 and fix the root issue (move it out of MCO to a better component that is comfortable with it) As noted by Luca Bruno the actual place this should be happen is probably in NetworkManager's nm-cloud _if_ it is to be at the OS level. Stefan, what do you think?
Some more context on the new proposed location of this functionality, nm-cloud-setup: https://lists.fedoraproject.org/archives/list/cloud@lists.fedoraproject.org/thread/FSQR6KL4KA37WHTUAXL774SXSWIBSYGI/#FSQR6KL4KA37WHTUAXL774SXSWIBSYGI
Broken connections and EOF and connection refused are unacceptable for the kube-apiserver in any context.
I'm going to treat this as the blocker to "GCP drops connections on graceful apiserver reload, which makes apiserver reload not graceful", moving to urgent, this will have to be backported or worked around.
https://superuser.com/questions/510630/change-the-default-route-without-affecting-existing-tcp-connections might give an idea: We can list existing connections (preferibly not just by source IP, because that's not the goal when switching over) and do magic to keep them alive, while changing the route to the LB.
And another hint https://serverfault.com/questions/828703/iptables-redirecting-established-connections: iptables -t nat -A PREROUTING -p tcp --dport 8080 -j REDIRECT --to-ports 8180 This redirects new connection (those not yet in the conntrack table) from one port to another. We want redirect to another IP, not port.
The technical release team is tracking this as a significant cause of GCP upgrade failures, it is a blocker for 4.4 at this point. Resetting target.
Ultimately, we should get some code similar to this in to upstream nm-cloud-setup. While GCP suggests using local routing table manipulation, it doesn't degrade gracefully. Perhaps it can be an option - routing or conntrack based redirection.
Discussion around nm-cloud-setup: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/407
The mentioned PR was merged against master, not 4.4. Do we need backport and a clone of this bug?
We do need it backported. Moving to 4.5 and cloning it back to 4.4.
Verfied upgrade from 4.4 -> 4.5 and 4.5 -> 4.5 using clusterbot test upgrade quay.io/openshift-release-dev/ocp-release:4.4.0-rc.7-x86_64 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 gcp job started, you will be notified on completion job <https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/585%7Ctest upgrade quay.io/openshift-release-dev/ocp-release:4.4.0-rc.7-x86_64 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 gcp> succeeded test upgrade registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-133703 gcp job started, you will be notified on completion job <https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/586%7Ctest upgrade registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-133703 gcp> succeeded
found another issue about the internal gcp LB networking: The gcp-route services shuts down when /readyz goes red of the local kube-apiserver. New local connections won't be routed locally from that point on, but go to the other two masters (this is what we want). But external connections can still come in because the google LB also needs time to reconcile (by pinging /readyz). Hence, google network keeps sending traffic. But because the gcp-route service is down, the iptables rules are gone which were routing the traffic to the kube-apiserver instance (which is keep serving traffic for some time) via nat+REDIRECT. Hence, those request fail with connection refused. In other words: - gcp-route has to route new local connections elsewhere (to the other two masters) as soon as /readyz goes red (this is what we now) - but keep receiving requests from outside the node (this is missing).
Mor info: gcp-route-controller takes around 50 seconds (FailureThreshold=10 * Interval=5) before it stops the gcp-route service. https://github.com/openshift/machine-config-operator/blob/master/cmd/gcp-routes-controller/run.go#L87-L103 And the gcp-service processes routes every 30s - https://github.com/openshift/installer/pull/3067/files#diff-f3a509446e9615909e1407b8d19b3dcdR81.
(In reply to Stefan Schimanski from comment #27) > In other words: > - gcp-route has to route new local connections elsewhere (to the other two > masters) as soon as /readyz goes red (this is what we now) > - but keep receiving requests from outside the node (this is missing). This is possible but not easy; what if we just tighten the timing and stop accepting new connections (while preserving existing ones) after 20 seconds /readyz failure?
Keeping forward with my original idea: add "downfile" support to gcp-routes.sh, so we can only mark the internal LB vip as down. Also, tighten timings a bit. Where this leaves us: 1. Connections to the external API lb won't be disrupted, because we won't remove that vip 2. Connections internally to the service IP won't be affected (and never were) 3. Only kubelet and kube-proxy will be affected, and they tolerate reconnections PRs: https://github.com/openshift/machine-config-operator/pull/1670 https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/899
Spoke with Ryan and he thinks that the problems we are seeing in MCO CI (which is currently blocked on job gcp-op) are the result of this bug... See: https://bugzilla.redhat.com/show_bug.cgi?id=1826463 Which, might block the linked fix since it is hitting the issue I wrote about in the above BZ...
I'm told that one symptom of this is: ``` E0418 13:23:26.401722 1 leaderelection.go:331] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ci-op-kjv25557-1354f.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: dial tcp 10.0.0.2:6443: i/o timeout ``` but since that's not in the BZ title this bug is impossible to find.. if anyone doesn't have objections can we update the title of this to include?
(In reply to Kirsten Garrison from comment #32) > I'm told that one symptom of this is: ... It *can* be, but that also happens any time the API server falls over. It's not a sufficient error message. Remember, this only affects openshift-sdn, kubelet, and multus. Everyone else talks to the service IP, which is unaffected.
> It *can* be, but that also happens any time the API server falls over. It's not a sufficient error message. Note that it is a "dial tcp: i/o timeout". So if everything works as expected, this should neither happen on the service network IP, nor on the external load balancer. But as Casey wrote, if you see this when speaking *not* to the *internal* LB, it is not this BZ.
As discussed with Mrunal Patel this bug is required for https://bugzilla.redhat.com/show_bug.cgi?id=1826329 as it fixes the main underlying issues. Hence marking this as upgrade blocker.
*** Bug 1800780 has been marked as a duplicate of this bug. ***
Michael, Do we have a timeline for when we'd be able to verify this bug. I'd really like to get this backported to 4.4.z this week which requires that we verify this in 4.5.
Hi sdodson, we are currently running a series of tests and hopefully have a result out soon.
(In reply to Abu Kashem from comment #43) > Hi sdodson, > we are currently running a series of tests and hopefully have a result out > soon. Perfect, thank you!
Thanks Abu, I'm moving this over to kube-apiserver but leaving it assigned to Casey. While the change was made in the MCO this mostly affects kube-apiserver and teams associated with kube-apiserver are best equipped to verify the fix.
Hi kewang, I have a couple of suggestions: - can you run the test longer than the entire kube-apiserver rollout (all 3 master nodes)? - run the test on all master nodes - remove 'sleep 1' wait I have been doing some testing and seeing the following errors during rollout window - write tcp 10.0.0.6:52950->10.0.0.2:6443: write: broken pipe - unexpected EOF I have been capturing the results here - https://github.com/tkashem/graceful/blob/master/gcp-route-fix-test/report.md
Hi akashem, No problem, will have a longer test tomorrow, adding 'sleep 1' for counting the time. I will try without 'sleep 1'.
Hi kewang, Also, can we replace /readyz and healthz with something like `oc get configmaps --all-namespaces -o yaml`? Thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475