2015512 – disruption to load-balancer new connections on gcp-ovn

Bug 2015512 - disruption to load-balancer new connections on gcp-ovn

Summary: disruption to load-balancer new connections on gcp-ovn

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	jamo luhrsen
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2015504 (view as bug list)
Depends On:
Blocks:	1929396 2024313
TreeView+	depends on / blocked

Reported:	2021-10-19 12:00 UTC by Devan Goodwin
Modified:	2023-09-15 01:16 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2024313 (view as bug list)
Environment:
Last Closed:	2022-01-07 05:11:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Devan Goodwin 2021-10-19 12:00:00 UTC

Mean disruption for load-balancer for gcp with ovn is off our charts at over 100s, possibly getting worse. Problem can be seen at: 

https://datastudio.google.com/reporting/26007bde-88c8-44dd-8fb9-4f5e41dfb0ce/page/p_i6fa5cr1mc

By setting Release to 4.10, and Network to ovn, and scrolling down to "Mean Disruption by Platform for load-balancer"

gcp values are presently around 100-180. 

For comparison Azure is around 25 and AWS 13.

TRT is rating sev high due to the risk to customer workloads during a cluster upgrade.

Comment 4 jamo luhrsen 2021-11-05 00:44:28 UTC

There is a very small PR in ovnk that I think can help with some of our disruptions.
https://github.com/openshift/ovn-kubernetes/pull/806

I've associated it with this bug for now and can test with our upgrade presubmit
job in that project. Hopefully it will just be merged soon and we can see how
things behave across all the uprade jobs.

Comment 7 jamo luhrsen 2021-11-05 15:41:01 UTC

*** Bug 2015504 has been marked as a duplicate of this bug. ***

Comment 11 David Eads 2021-11-29 13:26:13 UTC

This is still happening. We're seeing median gcp ovn new connection disruption of 80 seconds.

Comment 12 jamo luhrsen 2021-12-02 19:24:04 UTC

@dgoodwin, finally attempting to look more closely at this. learning on the fly...

looking at the data collected I can see that ovn-gcp is in the 80s range while azure and aws are ~10s.

my assumption, if someone could confirm it, is that this data is coming from the test case:
  "disruption_tests: [sig-network-edge] Application behind service load balancer with PDB is not disrupted"

that seems to match some results I see in individual GCP jobs, but when I look at an example aws [0] or azure [1]
job I'm not finding that test being run. I do see the test run in all three [2][3][4] platforms with
sdn-networking. I'm a little confused because of this.

[0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1440362887733317632
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-ovn-upgrade/1459611269538516992
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade/1466452267598090240
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1460686809745854464
[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-upgrade/1465773777051717632

Comment 13 jamo luhrsen 2021-12-02 22:32:04 UTC

(In reply to jamo luhrsen from comment #12)

> that seems to match some results I see in individual GCP jobs, but when I
> look at an example aws [0] or azure [1]
> job I'm not finding that test being run.

sorry about this comment. I missed it. this test is being run in those jobs.

Comment 14 jamo luhrsen 2021-12-03 19:41:12 UTC

been staring at CI logs for the past day and not getting very far, but wanted to update with some
notes. for this example [0] the disruption window for "disruption/service-loadbalancer-with-pdb connection/new"
looks like this:

15:34:05.000 10s
15:34:05.641 10s
15:34:21.000 10s
15:34:21.642 10s
15:37:25.000 10s
15:37:25.642 10s
15:37:38.000 20s
15:37:38.642 20s
15:40:38.000 10s
15:40:38.642 10s
15:41:21.000 10s
15:41:21.642 10s
15:41:35.000 10s
15:41:35.642 10s
15:41:47.000 10s
15:41:47.642 10s

I know it's obvious, but here are the reboot timestamps for each node all happen during this window:

time of actual 'Rebooting node' message in node journal:

15:33:24.522815 ./ci-op-lfwsrrrg-82914-2q27h-master-2
15:33:27.044693 ./ci-op-lfwsrrrg-82914-2q27h-worker-a-zh8nx
15:36:53.757970 ./ci-op-lfwsrrrg-82914-2q27h-worker-b-qc2q5
15:39:23.941157 ./ci-op-lfwsrrrg-82914-2q27h-master-0
15:40:51.169145 ./ci-op-lfwsrrrg-82914-2q27h-worker-c-lmxsq
15:46:23.254643 ./ci-op-lfwsrrrg-82914-2q27h-master-1

one thing to note is that there are some containers on each node that do not respond to SIGTERM
and so the actual reboot is issued at the timestamp above, but the host doesn't actually perform
the reboot for 30s longer while those containers are waiting to stop and finally time out and
SIGKILL is sent. In that 30s window I do see some networking activity logs in the node journal.
CI logs don't preserve the previous container logs so unable to see what might have been going
on with relevant ovn pods.

this is not a problem on AWS or Azure, which I have no theory about yet.

I am able to reproduce this with a cluster-bot cluster and locally running the upgrade
suite against it so I can have access to the cluster during and after the problem is
reproduced. I will try to dig deeper with that.


[0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade/1466410046328934400

Comment 15 jamo luhrsen 2021-12-10 02:13:32 UTC

update.

I think I have an easier way to reproduce this. Essentially, create and curl an LB service and then reboot *any*
worker node in the cluster. If this is done on a cluster running in GCP w/ OVN there will be a disruption in the curl
requests. 10-30s is what I see. Doing this same test on AWS w/ OVN or GCP w/out OVN and there is not really
any disruption (I say really, because I did see one instance of a 10s disruption in twelve node reboots)

with this, I can try to drive down on what is getting in the way on these node reboots. The LB service is
running two pods across three worker nodes. Even rebooting the node that is not running one of the LB pods
will produce this disruption.

steps used to reproduce:

- create 4.10 cluster bot cluster on GCP w/ OVN
- create the LB service used in CI
- I was just running openshift-tests and aborting the process once I saw the LB pods up and running
but I'm sure there is a more proper way I can do this
- ./openshift-tests run-upgrade all --to-image registry.ci.openshift.org/ocp/release:4.9.0-0.ci-2021-12-09-065806 --options '' -o /tmp/logs/artifacts/e2e.log --junit-dir /tmp/logs/artifacts/junit
- put curl in a loop to hit the LB VIP
- while true; do curl -I 35.224.74.56 | head -n2 | tee -a /tmp/curl_output; sleep 3; done
- reboot any worker node
- oc adm cordon <worker node>
- oc adm drain --ignore-daemonsets --delete-emptydir-data --force <worker node>
- oc debug node/<worker node>
- chroot /host
- systemctl reboot

Comment 16 jamo luhrsen 2021-12-14 22:42:07 UTC

what's happening is that every node in the cluster is handling the LB request in turn and not actually
the LB pods that get created. I learned this in a discussion here [0]. I verified this by running
tcpdump in a debug pod on each node and see the requests come in and responded to from the VIP.

if this is done repeatedly and a node is rebooted, eventually the request (initial SYN packet) will
be dropped by the rebooting node and the client then sits in a back-off/retry loop on that same
connection. Eventually the rebooting node will come back up and respond.
ou can
I'm not sure yet how to explain why this is happening with OVN and not with OpenShift-SDN.

still debugging...


https://coreos.slack.com/archives/CDCP2LA9L/p1639517164443200

Comment 17 jamo luhrsen 2021-12-23 23:06:19 UTC

still fighting with this. more notes:

- The disruption time-frame starts when the rebooting node is going down and it's networking is being torn
  down. (e.g., ovs ports and bridges being deleted are marked down. And the disruption stops when the
  reboot is coming up and networking is coming back online. this is aprox 30-45s for a worker node that has
  been drained.

- The LB VIP is being hit every 3s (I think) and since it may end up going to one of the other 5 nodes
  that are still up, it could take ~15s (assuming basic round-robin LB) before a connection is hung on
  the rebooting node.

- As part of my debugging I notice that all nodes have a VIP in the OVN DB (output below). I was able to
  manually delete the VIP for a node before the reboot, thinking that might help, but it did not. The
  disruption was the same.

 name               : "Service_e2e-k8s-service-lb-available-4373/service-test_TCP_node_router+switch_ci-ln-yndjmdk-72292-7278f-worker-b-dlqfx"
 vips               : {"10.0.128.3:31446"="10.128.2.28:80,10.129.2.19:80"}
 --
 name               : "Service_e2e-k8s-service-lb-available-4373/service-test_TCP_node_router+switch_ci-ln-yndjmdk-72292-7278f-master-2"
 vips               : {"10.0.0.4:31446"="10.128.2.28:80,10.129.2.19:80"}
 --
 name               : "Service_e2e-k8s-service-lb-available-4373/service-test_TCP_node_router+switch_ci-ln-yndjmdk-72292-7278f-worker-c-8q5rn"
 vips               : {"10.0.128.4:31446"="10.128.2.28:80,10.129.2.19:80"}
 --
 name               : "Service_e2e-k8s-service-lb-available-4373/service-test_TCP_node_router+switch_ci-ln-yndjmdk-72292-7278f-worker-a-c5bqx"
 vips               : {"10.0.128.2:31446"="10.128.2.28:80,10.129.2.19:80"}
 --
 name               : "Service_e2e-k8s-service-lb-available-4373/service-test_TCP_node_router+switch_ci-ln-yndjmdk-72292-7278f-master-1"
 vips               : {"10.0.0.5:31446"="10.128.2.28:80,10.129.2.19:80"}
 --
 name               : "Service_e2e-k8s-service-lb-available-4373/service-test_TCP_cluster"
 vips               : {"172.30.235.51:80"="10.128.2.28:80,10.129.2.19:80", "35.224.222.44:80"="10.128.2.28:80,10.129.2.19:80"}
 --
 name               : "Service_e2e-k8s-service-lb-available-4373/service-test_TCP_node_router+switch_ci-ln-yndjmdk-72292-7278f-master-0"
 vips               : {"10.0.0.3:31446"="10.128.2.28:80,10.129.2.19:80"}



workaround ideas for the test:

- If the LB service externalTrafficPolicy uses Local instead of Cluster, this disruption does not occur.
  In this case the LB requests only end up being handled by the worker nodes with the running LB pods.
  I have a PR [0] that will modify the test to use Local, but no idea if that is acceptable or not. This
  was already attempted once here [1], but that was reverted here [2]

- Another option to lessen the severity of how this is reported in CI is to modify the test to have
  much shorter retry timeouts. The connection that hits the rebooting node can timeout quickly and the
  next request (~3s later) will likely hit another node that is up. I suspect this would cut the reported
  disruption time to be much less.

- none of these workarounds are fixing whatever the problem is, but it could help improve the pass/fail
  numbers of our GCP+OVN upgrade job. This bug could stay open until the root cause is better understood
  and hopefully resolved.


[0] https://github.com/openshift/origin/pull/26722
[1] https://github.com/openshift/origin/pull/25406
[2] https://github.com/openshift/origin/pull/26175

Comment 18 jamo luhrsen 2022-01-04 23:15:59 UTC

@stbenjam , back to looking at this after the shutdown and I noticed a few things:

- the disruption data for this bug seems to have resolved itself. around 12/18 it looks like
GCP has settled back down to be similar with our other platforms.

- the tests themselves seemed to have some work done around them in this past month as well.
specifically the service LB and PDB tests were collapsed in to a standard check [0] (part
of this PR [1]). I'm not sure what that really means, but I know that the latest version of
openshift-tests will not create the service LB I was using to recreate this issue.

- I searched through the last 20 failed 4.9->4.10 gcp-ovn-upgrade jobs and did not find any
with this kind of service disruption.

- the problem I've been reproducing still exists. There is something with how the GCP Cluster
LB is networked with OVN that it still tries to give a new connection request to a node that
is rebooting. With a quicker connection timeout in place a client would retry a new connection
and be successful on a different node. We can continue to try to drive down on this, but I
am completely stuck and will need help to get further.

- I would argue that with this understanding we can at least move this from blocker. I also
am guessing that this problem is not going to be on anyone's priority list.

[0] https://github.com/openshift/origin/pull/26703/commits/defef8607c30ab3fd737939fd1adb05e38f2e2a7
[1] https://github.com/openshift/origin/pull/26703

Comment 19 jamo luhrsen 2022-01-04 23:17:55 UTC

@dgoodwin , I meant to @ you instead of Stephen in comment #18. sorry.

Comment 20 Devan Goodwin 2022-01-05 17:58:05 UTC

Around this same timeframe, deads rewrote a lot of the disruption code to fix some major flaws found in the original logic. We believe the new logic should be more accurately reflecting the real world, so it looks like this is probably resolved. I agree it should not qualify as a blocker now, I will remove the flag.

Comment 21 jamo luhrsen 2022-01-07 05:11:13 UTC

assuming it's ok to close this now like we did with https://bugzilla.redhat.com/show_bug.cgi?id=2015523

Comment 22 Red Hat Bugzilla 2023-09-15 01:16:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.