Bug 1802534 - gcp-routes mechanism leads to EOF and/or i/o timeout on switch-over on GCP
Summary: gcp-routes mechanism leads to EOF and/or i/o timeout on switch-over on GCP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.5.0
Assignee: Casey Callendrello
QA Contact: Ke Wang
URL:
Whiteboard:
: 1800780 (view as bug list)
Depends On:
Blocks: 1822603 1843928
TreeView+ depends on / blocked
 
Reported: 2020-02-13 11:16 UTC by Stefan Schimanski
Modified: 2021-04-05 17:47 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Race condition between node and GCP load balancer Consequence: Sometimes, on upgrades, the Openshift apiserver would be added back to the GCP load balancer, despite not yet being able to serve traffic because routes on the node were misconfigured. Fix: Move route configuration to iptables, and differentiate between local and non-local traffic. Always accept non-local traffic. Result: During apiserver upgrades, connections will be gracefully terminated, and new connections will be load-balanced only to running apiservers.
Clone Of:
: 1822603 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:14:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1670 0 None closed Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing 2020-12-11 02:36:09 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:15:14 UTC

Comment 2 Christian Glombek 2020-02-21 00:26:33 UTC
The current shell script built into RHCOS that manages routes on GCP is in many ways dissatisfying
I'd appreciate a solution that builds upon my PR to move that script's functionality into MCO's gcp-routes-controller, see: https://github.com/openshift/machine-config-operator/pull/1489

Better even would be to find a way we could get rid of that container alltogether and manage GCP LB routes natively in a sane manner.

Comment 3 Dan Mace 2020-02-21 14:29:30 UTC
The API LB is created by the installer, manipulated indirectly with gcp-routes-controller via MCO, in support of the apiserver. So, not really directly related to ingress as pertains to routes. I'm not sure if SDN is really the right home for the bug, but SDN seems closer in this case. If somebody feels differently, please re-assign!

Comment 4 Juan Luis de Sousa-Valadas 2020-02-25 15:24:21 UTC
It's not really SDN because the SDN doesn't manage the internal load balancer, that would be MCO.

Antonio, the problem is that when the gcp-routes service is stopped, any established connection that requires thouse routes will be dropped. And the connections mustn't be dropped as early as it's being done right now.

Regarding the health check, this is what you want to achieve on any implementation:
>  - at N+0: /readyz turns red
        At this point you must preserve establishing connections but should stop acepting new ones and start monitoring /healthz

>  - at N+70: kube-apiserver stop listening, /healthz is red
        At this point you must stop sending new connections but must keep established connections, and wait 60 seconds to reach the next point.

>  - at N+70+60: kube-apiserver terminates at the latest.
        Keep in  mind the 60 seconds here may be actually less.
        At this point the server isn't listening at all and there aren't any established connections to keep alive.

For GCP in particular, although I'm not incredibly familiar with GCP nor this healthcheck, but based on Stephan's comment, on the GCP docs(1) this and some *non exhaustive* checks on my own gcp cluster:

How to solve this is actually pretty complex due to gcp limitations(2)... When readyz stops working, you want to send all the new connectionsto a different server, and this seems to require the gcp-routes service to stop (2), but you don't want to drop existing connections, and this requires the gcp-routes service to keep running.

I see in the GCP documentation(3) there is a method called instanceGroupManagers.abandonInstances(3) which may be good enough, except it requires connections to stop in up to 60 seconds and we'd ideally want 120, but it's still better than what we have right now. I haven't actually tested this, so it may not work.

I'm reassigning this to MCO because it's really an MCO component and I don't know the MCO flows well enough nor how the LBs are managed, but I'm happy to assist with any network related queries. Althouggh at this point we want someone who really has experience with GCP load balancer to assist.

References:
1- https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/abandonInstances
2- https://github.com/openshift/machine-config-operator/pull/1031#issue-303605505
3- https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/abandonInstances

Comment 5 Christian Glombek 2020-02-25 22:40:33 UTC
Thanks for the summary!
My PR that I linked above (https://github.com/openshift/machine-config-operator/pull/1489) doesn't yet work properly and isn't passing e2e tests.
Nonetheless, instead of fixing this in the shell script built into RHCOS, the preferable way forward is to move all of this functionality (including the requirements you laid out) into the gcp-routes-controller in MCO, so it's independent of things built into the base OS (for OKD we use Fedora CoreOS as base which does not have the script at all).

If this is not immediate priority, I will work this into my PR, once I get around to tackling that again.

Comment 6 Abu Kashem 2020-02-26 14:35:25 UTC
We designed a test to reproduce the issue. We consistently see the issue on a gcp cluster, on the other hand, the issue does not exist on an aws cluster.
For a detailed report please see - https://github.com/tkashem/graceful/blob/master/report/report.md

Comment 7 Steve Milner 2020-03-03 14:17:14 UTC
(In reply to Juan Luis de Sousa-Valadas from comment #4)
> I'm reassigning this to MCO because it's really an MCO component and I don't
> know the MCO flows well enough nor how the LBs are managed, but I'm happy to
> assist with any network related queries. Althouggh at this point we want
> someone who really has experience with GCP load balancer to assist.

As a side note GCP routes are really not system configuration nor upgrade (MCO) nor operating system. I understand that this functionality currently resides within MCO but the reality is it should live where cloud configuration occurs OR possibly where special cloud workarounds exist (EG: agent -- afterburn).

Comment 8 Antonio Murdaca 2020-03-06 14:37:54 UTC
I'm quoting some notes from https://bugzilla.redhat.com/show_bug.cgi?id=1793592#c6

In meetings speaking with folks in the MCO team it's quite apparent that this functionality:

- Shouldn't be in the MCO
- Isn't something the team has a lot of knowledge or comfort around
- Isn't something that will likely be doable in a day

I believe that, for 4.4 at least, it's too late to try to place cloud configuration in a more expected place but there are a few things we could do:

- Pass this off to a person or team who does have more background and comfort in this to fix it as is
- Move this off to 4.5 and fix the root issue (move it out of MCO to a better component that is comfortable with it)

As noted by Luca Bruno the actual place this should be happen is probably in NetworkManager's nm-cloud _if_ it is to be at the OS level.


Stefan, what do you think?

Comment 9 Christian Glombek 2020-03-07 00:55:30 UTC
Some more context on the new proposed location of this functionality, nm-cloud-setup: https://lists.fedoraproject.org/archives/list/cloud@lists.fedoraproject.org/thread/FSQR6KL4KA37WHTUAXL774SXSWIBSYGI/#FSQR6KL4KA37WHTUAXL774SXSWIBSYGI

Comment 11 Clayton Coleman 2020-03-10 20:30:39 UTC
Broken connections and EOF and connection refused are unacceptable for the kube-apiserver in any context.

Comment 12 Clayton Coleman 2020-03-10 20:31:31 UTC
I'm going to treat this as the blocker to "GCP drops connections on graceful apiserver reload, which makes apiserver reload not graceful", moving to urgent, this will have to be backported or worked around.

Comment 13 Stefan Schimanski 2020-04-03 12:11:16 UTC
https://superuser.com/questions/510630/change-the-default-route-without-affecting-existing-tcp-connections might give an idea:

We can list existing connections (preferibly not just by source IP, because that's not the goal when switching over) and do magic to keep them alive, while changing the route to the LB.

Comment 14 Stefan Schimanski 2020-04-03 12:14:32 UTC
And another hint https://serverfault.com/questions/828703/iptables-redirecting-established-connections:

  iptables -t nat -A PREROUTING -p tcp --dport 8080 -j REDIRECT --to-ports 8180

This redirects new connection (those not yet in the conntrack table) from one port to another. We want redirect to another IP, not port.

Comment 15 Ben Parees 2020-04-03 13:31:26 UTC
The technical release team is tracking this as a significant cause of GCP upgrade failures, it is a blocker for 4.4 at this point.  Resetting target.

Comment 19 Casey Callendrello 2020-04-07 12:09:15 UTC
Ultimately, we should get some code similar to this in to upstream nm-cloud-setup. While GCP suggests using local routing table manipulation, it doesn't degrade gracefully. Perhaps it can be an option - routing or conntrack based redirection.

Comment 20 Steve Milner 2020-04-07 15:00:49 UTC
Discussion around nm-cloud-setup: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/407

Comment 24 Stefan Schimanski 2020-04-09 13:17:33 UTC
The mentioned PR was merged against master, not 4.4. Do we need backport and a clone of this bug?

Comment 25 Steve Milner 2020-04-09 13:30:03 UTC
We do need it backported. Moving to 4.5 and cloning it back to 4.4.

Comment 26 Michael Nguyen 2020-04-14 00:34:15 UTC
Verfied upgrade from 4.4 -> 4.5 and 4.5 -> 4.5 using clusterbot

test upgrade quay.io/openshift-release-dev/ocp-release:4.4.0-rc.7-x86_64 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 gcp

job started, you will be notified on completion

job <https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/585%7Ctest upgrade quay.io/openshift-release-dev/ocp-release:4.4.0-rc.7-x86_64 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 gcp> succeeded


test upgrade registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-133703 gcp

job started, you will be notified on completion

job <https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/586%7Ctest upgrade registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-024845 registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-13-133703 gcp> succeeded

Comment 27 Stefan Schimanski 2020-04-16 14:09:10 UTC
found another issue about the internal gcp LB networking:

The gcp-route services shuts down when /readyz goes red of the local kube-apiserver. New local connections won't be routed locally from that point on, but go to the other two masters (this is what we want). But external connections can still come in because the google LB also needs time to reconcile (by pinging /readyz). Hence, google network keeps sending traffic. But because the gcp-route service is down, the iptables rules are gone which were routing the traffic to the kube-apiserver instance (which is keep serving traffic for some time) via nat+REDIRECT. Hence, those request fail with connection refused.
In other words:
- gcp-route has to route new local connections elsewhere (to the other two masters) as soon as /readyz goes red (this is what we now)
- but keep receiving requests from outside the node (this is missing).

Comment 28 Abu Kashem 2020-04-16 14:55:54 UTC
Mor info: 
gcp-route-controller takes around 50 seconds (FailureThreshold=10 * Interval=5) before it stops the gcp-route service. 
https://github.com/openshift/machine-config-operator/blob/master/cmd/gcp-routes-controller/run.go#L87-L103

And the gcp-service processes routes every 30s - https://github.com/openshift/installer/pull/3067/files#diff-f3a509446e9615909e1407b8d19b3dcdR81.

Comment 29 Casey Callendrello 2020-04-21 10:08:57 UTC
(In reply to Stefan Schimanski from comment #27)
> In other words:
> - gcp-route has to route new local connections elsewhere (to the other two
> masters) as soon as /readyz goes red (this is what we now)
> - but keep receiving requests from outside the node (this is missing).

This is possible but not easy; what if we just tighten the timing and stop accepting new connections (while preserving existing ones) after 20 seconds /readyz failure?

Comment 30 Casey Callendrello 2020-04-21 10:40:36 UTC
Keeping forward with my original idea: add "downfile" support to gcp-routes.sh, so we can only mark the internal LB vip as down. Also, tighten timings a bit.

Where this leaves us:

1. Connections to the external API lb won't be disrupted, because we won't remove that vip
2. Connections internally to the service IP won't be affected (and never were)
3. Only kubelet and kube-proxy will be affected, and they tolerate reconnections

PRs:
https://github.com/openshift/machine-config-operator/pull/1670
https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/899

Comment 31 Kirsten Garrison 2020-04-21 23:53:57 UTC
Spoke with Ryan and he thinks that the problems we are seeing in MCO CI (which is currently blocked on job gcp-op) are the result of this bug...

See: https://bugzilla.redhat.com/show_bug.cgi?id=1826463

Which, might block the linked fix since it is hitting the issue I wrote about in the above BZ...

Comment 32 Kirsten Garrison 2020-04-22 00:31:41 UTC
I'm told that one symptom of this is:
```
E0418 13:23:26.401722       1 leaderelection.go:331] error retrieving resource lock openshift-sdn/openshift-network-controller: Get https://api-int.ci-op-kjv25557-1354f.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-sdn/configmaps/openshift-network-controller: dial tcp 10.0.0.2:6443: i/o timeout
```

but since that's not in the BZ title this bug is impossible to find.. if anyone doesn't have objections can we update the title of this to include?

Comment 33 Casey Callendrello 2020-04-22 12:09:14 UTC
(In reply to Kirsten Garrison from comment #32)
> I'm told that one symptom of this is:
...

It *can* be, but that also happens any time the API server falls over. It's not a sufficient error message.

Remember, this only affects openshift-sdn, kubelet, and multus. Everyone else talks to the service IP, which is unaffected.

Comment 34 Stefan Schimanski 2020-04-22 15:11:03 UTC
> It *can* be, but that also happens any time the API server falls over. It's not a sufficient error message.

Note that it is a "dial tcp: i/o timeout". So if everything works as expected, this should neither happen on the service network IP, nor on the external load balancer. But as Casey wrote, if you see this when speaking *not* to the *internal* LB, it is not this BZ.

Comment 35 Lalatendu Mohanty 2020-04-24 09:37:19 UTC
As discussed with Mrunal Patel this bug is required for https://bugzilla.redhat.com/show_bug.cgi?id=1826329 as it fixes the main underlying issues. Hence marking this as upgrade blocker.

Comment 37 Antonio Murdaca 2020-04-29 14:14:59 UTC
*** Bug 1800780 has been marked as a duplicate of this bug. ***

Comment 42 Scott Dodson 2020-06-08 14:31:01 UTC
Michael, Do we have a timeline for when we'd be able to verify this bug. I'd really like to get this backported to 4.4.z this week which requires that we verify this in 4.5.

Comment 43 Abu Kashem 2020-06-08 14:35:44 UTC
Hi sdodson,
we are currently running a series of tests and hopefully have a result out soon.

Comment 44 Steve Milner 2020-06-08 14:41:25 UTC
(In reply to Abu Kashem from comment #43)
> Hi sdodson,
> we are currently running a series of tests and hopefully have a result out
> soon.

Perfect, thank you!

Comment 45 Scott Dodson 2020-06-08 14:44:14 UTC
Thanks Abu, I'm moving this over to kube-apiserver but leaving it assigned to Casey. While the change was made in the MCO this mostly affects kube-apiserver and teams associated with kube-apiserver are best equipped to verify the fix.

Comment 47 Abu Kashem 2020-06-10 14:10:46 UTC
Hi kewang,
I have a couple of suggestions:
- can you run the test longer than the entire kube-apiserver rollout (all 3 master nodes)?
- run the test on all master nodes
- remove 'sleep 1' wait

I have been doing some testing and seeing the following errors during rollout window
- write tcp 10.0.0.6:52950->10.0.0.2:6443: write: broken pipe
- unexpected EOF

I have been capturing the results here - https://github.com/tkashem/graceful/blob/master/gcp-route-fix-test/report.md

Comment 48 Ke Wang 2020-06-10 14:53:15 UTC
Hi akashem, No problem, will have a longer test tomorrow, adding 'sleep 1' for counting the time. I will try without 'sleep 1'.

Comment 49 Abu Kashem 2020-06-10 14:53:56 UTC
Hi kewang,
Also, can we replace /readyz and healthz with something like `oc get configmaps --all-namespaces -o yaml`?

Thanks!

Comment 52 errata-xmlrpc 2020-07-13 17:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 53 W. Trevor King 2021-04-05 17:47:24 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.