1739401 – Resources sync among multiple clusters is very slow

Bug 1739401 - Resources sync among multiple clusters is very slow

Summary: Resources sync among multiple clusters is very slow

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Federation
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Maru Newby
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-09 08:53 UTC by Qin Ping
Modified:	2019-10-16 06:35 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:35:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:35:45 UTC

Description Qin Ping 2019-08-09 08:53:58 UTC

Description of problem:
Resources sync among multiple clusters is very slow

Version-Release number of selected component (if applicable):
KubeFed controller-manager version: version.Info{Version:"v4.2.0", GitCommit:"b8ae65cee603cc9c746911debd3dc23b922222d8", GitTreeState:"clean", BuildDate:"2019-08-08T18:58:34Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
kubefedctl version: version.Info{Version:"v4.2.0", GitCommit:"b8ae65cee603cc9c746911debd3dc23b922222d8", GitTreeState:"clean", BuildDate:"2019-08-08T18:18:54Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:
100%

Steps to Reproduce:
1. Deployed a cluster scoped kubefed controller manager
2. Added to clusters in the federation
$ oc get kubefedcluster -n openshift-operators
NAME       READY   AGE
cluster1   True    77m
cluster2   True    77m
3. Sync a namespace to both clusters
$ oc create ns test-namespace
$ kubefedctl federate ns test-namespace -e --kubefed-namespace=openshift-operators
4. Sync a secret to both clusters(it's not slow)
$ oc create -f secret.yaml -n test-namespace
$ cat secret.yaml 
apiVersion: v1
data:
  token: xxxx
kind: Secret
metadata:
  name: kubefed
type: Opaque
$ kubefedctl federate secret -n test-namespace -e --kubefed-namespace=openshift-operators
$ oc get federatedsecret -n test-namespace --context=cluster1
NAME           AGE
kubefed        66m
$ for c in cluster1 cluster2; do oc get secret kubefed -n test-namespace --context=$c; done
NAME      TYPE     DATA   AGE
kubefed   Opaque   1      71m
NAME      TYPE     DATA   AGE
kubefed   Opaque   1      66m
5. Sync another secret to both cluster(it's slow, about 5 minutes, the secret is synced to cluster2)
$ oc create ns nofed-namespace
$ oc create -f secret.yaml -n nofed-namespace
$ kubfedctl federate secret -n nofed-namespace --kubefed-namespace=openshift-operators
$ kubefedctl federated ns nofed-namespace --kubefed-namespace=openshift-operators
$ oc get ns|grep nofed-namespace
nofed-namespace                                         Active        52m
$ oc get federatednamespace -n nofed-namespace
NAME              AGE
nofed-namespace   40m
$ oc get ns nofed-namespace --context=cluster2
NAME              STATUS   AGE
nefod-namespace   Active   40m
$ oc get federatedsecret -n nofed-namespace
NAME           AGE
kubefed        52m
$ for c in cluster1 cluster2; do oc get secret kubefed -n nofed-namespace --context=$c; done
NAME      TYPE     DATA   AGE
kubefed   Opaque   1      55m
NAME      TYPE     DATA   AGE
kubefed   Opaque   1      35m
6. Sync secrets resource again
$ oc get federatedsecret kubefed-test -n nofed-namespace
NAME           AGE
kubefed-test   41m
$ for c in cluster1 cluster2; do oc get secret kubefed-test -n nofed-namespace --context=$c; done
NAME           TYPE     DATA   AGE
kubefed-test   Opaque   1      41m
NAME           TYPE     DATA   AGE
kubefed-test   Opaque   1      7m29s (this resource is synced after about 33 minutes)
$ oc get federatedsecret -n test1
NAME      AGE
kubefed   43m

$ for c in cluster1 cluster2; do oc get secret kubefed -n test1 --context=$c; done (secret resources are synced to the target clusters after 40 minutes)
NAME      TYPE     DATA   AGE
kubefed   Opaque   1      2m26s
NAME      TYPE     DATA   AGE
kubefed   Opaque   1      2m29s

Actual results:
We can see the secrets resources are synced to the target clusters is slower and slower.

Expected results:


Additional info:
Seems no helpful info in the kubfed controller manager log.
$ oc logs kubefed-controller-manager-54f578454b-v96zg -n openshift-operators|grep "test1\/kubefed"
I0809 08:48:04.473679       1 controller.go:268] Starting to reconcile FederatedSecret "test1/kubefed"
I0809 08:48:04.473688       1 controller.go:276] Ensuring finalizer exists on FederatedSecret "test1/kubefed"
I0809 08:48:04.473697       1 controller.go:638] Adding finalizer kubefed.io/sync-controller to FederatedSecret "test1/kubefed"
I0809 08:48:04.482318       1 controller.go:303] Syncing Secret "test1/kubefed" in underlying clusters, selected clusters are: map[cluster1:{} cluster2:{}]
I0809 08:48:09.147178       1 controller.go:283] Finished reconciling FederatedSecret "test1/kubefed" (duration: 93ns)
I0809 08:48:37.134601       1 controller.go:268] Starting to reconcile FederatedSecret "test1/kubefed"
I0809 08:48:37.134606       1 controller.go:276] Ensuring finalizer exists on FederatedSecret "test1/kubefed"
I0809 08:48:37.134710       1 controller.go:303] Syncing Secret "test1/kubefed" in underlying clusters, selected clusters are: map[cluster1:{} cluster2:{}]
I0809 08:48:41.819347       1 controller.go:283] Finished reconciling FederatedSecret "test1/kubefed" (duration: 68ns)
I0809 08:48:46.456534       1 controller.go:268] Starting to reconcile FederatedSecret "test1/kubefed"
I0809 08:48:46.456559       1 controller.go:276] Ensuring finalizer exists on FederatedSecret "test1/kubefed"
I0809 08:48:46.456676       1 controller.go:303] Syncing Secret "test1/kubefed" in underlying clusters, selected clusters are: map[cluster1:{} cluster2:{}]
I0809 08:48:51.101569       1 controller.go:283] Finished reconciling FederatedSecret "test1/kubefed" (duration: 136ns)
I0809 08:48:55.740040       1 controller.go:268] Starting to reconcile FederatedSecret "test1/kubefed"
I0809 08:48:55.740048       1 controller.go:276] Ensuring finalizer exists on FederatedSecret "test1/kubefed"
I0809 08:48:55.740177       1 controller.go:303] Syncing Secret "test1/kubefed" in underlying clusters, selected clusters are: map[cluster1:{} cluster2:{}]
I0809 08:49:00.418007       1 controller.go:283] Finished reconciling FederatedSecret "test1/kubefed" (duration: 100ns)

Comment 1 Maru Newby 2019-08-12 12:21:00 UTC

I am able to replicate the behavior on the test clusters Ping was using:

 - a FederatedSecret named `test-ns2/kubefed` is propagating fine, and changes to the template or placement are propagated within ~30s (the host is very slow)
 - I created a new FederatedSecret named `test-ns2/kubefed-maru` copied from the `test-ns2/kubefed` secret and it took ~53m to be noticed by the controller manager

This suggests that the problem is not with the controller but the watch. Updates to existing resources are being noticed in a timely manner by KubeFed controllers, but new resources are only noticed after a considerable interval. That suggests to me that KubeFed watches are somehow misconfigured or that watches are somehow broken on the build of OCP being used for testing.

Comment 2 Maru Newby 2019-08-13 14:05:29 UTC

The problem was not the watch (i.e. select is not broken: https://blog.codinghorror.com/the-first-rule-of-programming-its-always-your-fault/). 

KubeFed was performing unnecessary updates and this was bogging down the cluster.  A fix has been proposed upstream:

https://github.com/kubernetes-sigs/kubefed/pull/1099

Comment 3 Aniket Bhat 2019-08-23 17:33:45 UTC

Ping,

The fix should be available in the latest downstream image.

Comment 5 Qin Ping 2019-08-28 08:09:03 UTC

Verified with KubeFed controller-manager version: version.Info{Version:"v4.2.0", GitCommit:"7f002471b9dd8366e1e0f080b46bc79864682f71", GitTreeState:"clean", BuildDate:"2019-08-25T20:09:07Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}

Comment 6 errata-xmlrpc 2019-10-16 06:35:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.