1981736 – Slowness in services propagation after upgrading to v3.11.465

Bug 1981736 - Slowness in services propagation after upgrading to v3.11.465

Summary: Slowness in services propagation after upgrading to v3.11.465

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Dan Winship
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-13 09:09 UTC by Joel Rosental R.
Modified:	2024-12-20 20:27 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-25 15:16:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 26387	None	None	None	2021-08-10 14:09:24 UTC
Github	openshift ose pull 1558	None	None	None	2021-08-03 17:51:08 UTC
Red Hat Knowledge Base (Solution)	3739751	None	None	None	2021-08-26 22:08:11 UTC
Red Hat Product Errata	RHSA-2021:3193	None	None	None	2021-08-25 15:17:05 UTC

Description Joel Rosental R. 2021-07-13 09:09:28 UTC

Description of problem:
Upon upgrade of OCP to version v3.11.404 service endpoint propagation started to take too much time (in some cases around 2.5 hours).

Service idling is properly disabled and iptablesSyncPeriod is set to 2h.

The initial problem was addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1963160, however after applying the errata for the aforementioned BZ if ovs and sdn pods from a node running on v3.11.465 version are deleted, the node takes some minutes to reach some of the services.

Version-Release number of selected component (if applicable):
v3.11.465

How reproducible:


Steps to Reproduce:
1. Delete ovs and sdn pods from a node upgraded to v3.11.404 version
2.
3.

Actual results:
It takes some minutes for services to be reached from this node.

Expected results:
It should reach services quickly.

Additional info:

Comment 5 Dan Winship 2021-07-29 12:10:35 UTC

> if ovs and sdn pods from a node running on v3.11.465 version are deleted, the node takes some minutes to reach some of the services.

I assume that by "take some minutes to reach some of the services" you mean that it takes a long time for the node to have the expected set of iptables rules, NOT that the iptables rules are there but the TCP traffic is slow?

Also, wny are you deleting ovs and sdn pods? You should never do that. If you randomly kill the OVS pod on a running node, we do not make any claims about how long it will take the node to recover.

If you are trying to test "how long does it take a node to fully program itself at startup", the right way to do that is to reboot the node, not to just randomly kill infrastructure pods on it. But also, the fact that it takes a long time to set up all of the service IPs at startup is a totally different bug from it taking a long time to propagate changes to a running node...

> Here [0] are the following logs provided by customer:

I don't have access to that folder.


Can you please clarify: is there still any problem with propagation of service changes to running nodes?
And, why is the customer killing sdn and ovs pods and then expecting things to still work well?

Comment 12 Dan Winship 2021-08-03 16:37:00 UTC

OK, I see; I was misled by the "Caches are synced for service config controller" message before, which was just indicating when it had received all of the data internally, not when the proxy had actually *processed* all of the data.

In the .286 log, the last "sdn proxy: add ..." message appears 18 seconds after startup.

In the .465 log, the last "sdn proxy: add ..." message appears 21 *minutes* after startup, and that's not even the last one (eg, the log cuts off before demo.verification.svc appears).

I'm pretty sure this is due to an iptables locking fix introduced in 3.11.344 which is mostly unnoticeable in ordinary operation, but at startup time it could end up interfering with the initial bulk creation of services pretty badly.

Comment 14 zhaozhanqi 2021-08-16 10:40:26 UTC

Create about 3000 service on build 


# oc version
oc v3.11.500
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-13-214.ec2.internal:8443
openshift v3.11.499
kubernetes v1.11.0+d4cacc0

# oc get svc -n z1 | wc -l
3006

and then delete sdn pod to make it recreated. From the new created sdn pod, I see it costs 6 seconds from started to the last "sdn proxy: add".

Here is the logs in attachment.

@Dan Winshop Could you help confirm if this is enough to verify this bug.

Comment 16 Dan Winship 2021-08-16 12:38:48 UTC

(In reply to zhaozhanqi from comment #14)
> and then delete sdn pod to make it recreated. From the new created sdn pod,
> I see it costs 6 seconds from started to the last "sdn proxy: add".
> 
> Here is the logs in attachment.
> 
> @Dan Winship Could you help confirm if this is enough to verify this bug.

Yeah, it would take several minutes to complete if the fix wasn't working.

Maybe just do some spot checks of a handful of the services to make sure they actually work?

Comment 17 zhaozhanqi 2021-08-16 13:21:30 UTC

yes, I checked all service are working well.

then I will move this bug to verified. thanks Dan

Comment 21 errata-xmlrpc 2021-08-25 15:16:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 3.11.z security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3193

Note You need to log in before you can comment on or make changes to this bug.