1703234 – MCO is reading things frequently that it should not need to read

Bug 1703234 - MCO is reading things frequently that it should not need to read

Summary: MCO is reading things frequently that it should not need to read

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1706638 1707516 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-25 20:43 UTC by Clayton Coleman
Modified:	2019-10-16 06:28 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:28:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:28:21 UTC

Description Clayton Coleman 2019-04-25 20:43:06 UTC

{client="machine-config-operator/v0.0.0 (linux/amd64) kubernetes/$Format/machine-config",resource="clusterrolebindings",scope="cluster",verb="GET"}	0.4666666666666667
{client="machine-config-operator/v0.0.0 (linux/amd64) kubernetes/$Format/machine-config",resource="customresourcedefinitions",scope="cluster",verb="GET"}	0.3888888888888889

This looks like a misconfiguration, these resources only rarely change and the MCO shouldn't be reading them this frequently.

Contributes to idle read traffic on the cluster.

Comment 1 Antonio Murdaca 2019-04-25 21:06:14 UTC

We're just syncing the informer for clusterrolebinding and adding an event handler so we're not really calling anything. That informer is there since I've started working on the MCO so I don't know why we kick an operator resync when that informer fires (Abhinav might help on that).

crd informer/lister are used in the Operator sync:

func (optr *Operator) waitForCustomResourceDefinition(resource *apiextv1beta1.CustomResourceDefinition) error {

We basically make sure the CRD is there and we do that everytime the operator sync (sounds right though right? we need to make sure our stuff is there at every sync to reconcile if something drifted)

Comment 2 Antonio Murdaca 2019-04-25 21:15:08 UTC

actually, for both we might be wrongly updating these stuff when we shouldn't (CRDs aren't changing so the ones in cluster shouldn't be updated). I'll dig around this.

Comment 3 Antonio Murdaca 2019-04-25 21:18:23 UTC

(In reply to Antonio Murdaca from comment #2)
> actually, for both we might be wrongly updating these stuff when we
> shouldn't (CRDs aren't changing so the ones in cluster shouldn't be
> updated). I'll dig around this.

Update: we GET those resources at every sync of the operator itself to make sure they haven't changed, and if they did, we re-apply them. That's why we GET them.

Comment 4 Antonio Murdaca 2019-04-26 21:03:07 UTC

Not as urgent as upgrades issues - so moving it to 4.2 but if cycles allow, we're gonna work on this as soon as we can

Comment 5 Kirsten Garrison 2019-05-07 19:21:41 UTC

*** Bug 1706638 has been marked as a duplicate of this bug. ***

Comment 6 Antonio Murdaca 2019-05-07 19:23:27 UTC

*** Bug 1707516 has been marked as a duplicate of this bug. ***

Comment 7 Antonio Murdaca 2019-06-04 16:35:15 UTC

https://github.com/openshift/machine-config-operator/pull/813

Comment 9 Antonio Murdaca 2019-06-27 15:27:14 UTC

Clayton, how can we access the metrics you've been showing in https://bugzilla.redhat.com/show_bug.cgi?id=1703234#c0 to check this BZ is solved? we've been adding custom logs in a custom payload to check this but QE needs something on a pristine payload.

Comment 10 Micah Abbott 2019-07-01 19:22:20 UTC

I talked with Antonio about this and we agreed this is not an issue that an end user is likely to notice.  Thus, it makes testing the change particularly difficult, if not impossible.

In the attached PR, Antonio notes that the MCD is spamming the logs with:

```
I0604 16:04:52.546178       1 operator.go:257] resyncing operator because "openshift-machine-config-operator/machine-config-controller" *v1.ConfigMap changed
I0604 16:04:59.840257       1 operator.go:257] resyncing operator because "openshift-machine-config-operator/machine-config" *v1.ConfigMap changed
```

If I inspect MCD logs on a 4.2 cluster, I see no evidence of those messages being repeated:

```
$ for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc -n openshift-machine-config-operator logs $i | grep "resyncing operator"; done
$
```

And just to show that the above query will return correctly with a different message:

```
$ for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc -n openshift-machine-config-operator logs $i | grep "In desired config"; done
I0701 18:14:30.793968    4209 daemon.go:932] In desired config rendered-master-2abcc719829ba4399ef2bab30ad9de0f
I0701 18:22:59.323839    3082 daemon.go:932] In desired config rendered-worker-ceeb937c77626a4593a2feec69673203
I0701 18:22:56.663589    3128 daemon.go:932] In desired config rendered-worker-ceeb937c77626a4593a2feec69673203
I0701 18:14:20.316160    4851 daemon.go:932] In desired config rendered-master-2abcc719829ba4399ef2bab30ad9de0f
I0701 18:22:56.520222    3203 daemon.go:932] In desired config rendered-worker-ceeb937c77626a4593a2feec69673203
I0701 18:14:30.856923    4245 daemon.go:932] In desired config rendered-master-2abcc719829ba4399ef2bab30ad9de0f
```


Additionally, Antonio notes that he has been creating custom payloads to generate spammy logs while testing other changes:

https://github.com/openshift/machine-config-operator/issues/815#issuecomment-499281762

...but with the fix for this BZ applied, the MCD is not spamming the logs.


Since we don't have any direct steps to test/reproduce this, would be hard for users to detect a change in behavior, and enough evidence to support it has been fixed, I'm going to mark this VERIFIED using 4.2.0-0.nightly-2019-06-30-221852

Comment 12 errata-xmlrpc 2019-10-16 06:28:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.