1760473 – dns-operator reconciliation hot-loop during apiserver connectivity outage windows

Bug 1760473 - dns-operator reconciliation hot-loop during apiserver connectivity outage windows

Summary: dns-operator reconciliation hot-loop during apiserver connectivity outage win...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Dan Mace
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-10 15:42 UTC by Dan Mace
Modified:	2022-08-04 22:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-11 19:13:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dan Mace 2019-10-10 15:42:06 UTC

When an apiserver client connection fails to establish, the dns-operator seems to spin in a hot-loop during reconciliation, which results in voluminous log spam and spurious connection attempts.

Discovered while researching https://bugzilla.redhat.com/show_bug.cgi?id=1760103.

Evidence here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/artifacts/e2e-aws-upgrade/pods/openshift-dns-operator_dns-operator-5db7cc9bd6-6ksfp_dns-operator.log

Comment 1 W. Trevor King 2019-10-11 16:57:18 UTC

Inlining so it turns up in Bugzilla searches:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded'
Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to True: APIServerError: Operation cannot be fulfilled on imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jenkins-agent-nodejs];
Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to False
Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available

But I think a better example of flapping is:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns changed Degraded'
Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists

Comment 2 Clayton Coleman 2019-10-11 17:15:28 UTC

Marking hot loop bugs as high (in general any operator looping behavior is pathological and has outsized impact on customers and cluster health)

Comment 3 Dan Mace 2019-10-11 18:17:06 UTC

(In reply to W. Trevor King from comment #1)
> Inlining so it turns up in Bugzilla searches:
> 
> $ curl -s
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-
> installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded'
> Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to
> True: APIServerError: Operation cannot be fulfilled on
> imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been
> modified; please apply your changes to the latest version and try again
> error replacing imagestream [jenkins-agent-nodejs];
> Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to
> False
> Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> 
> But I think a better example of flapping is:
> 
> $ curl -s
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-
> installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns
> changed Degraded'
> Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False:
> AsExpected: All desired DNS DaemonSets available and operand Namespace exists
> Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False:
> AsExpected: All desired DNS DaemonSets available and operand Namespace exists
> Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False:
> AsExpected: All desired DNS DaemonSets available and operand Namespace exists

The status flapping issue interesting but not related to the hot looping I'm referring to in the original report. Just want to be clear.

Comment 4 Dan Mace 2019-10-11 19:13:24 UTC

I misread the output.

The output is actually showing sets of reflectors (managed by informers) failing to populate due to a connectivity issue, repeating in groups every few seconds. There's not actually a hot loop. The fixed interval for reflected start processing in client-go is actually 1 second[1], so our observed 3 second interval between retry attempts is actually probably better than it could be.

Any operator or controller using client-go's primitives (i.e. reflector) is subject to this behavior.

https://github.com/kubernetes/client-go/blob/kubernetes-1.16.0/tools/cache/controller.go#L125

Comment 5 Dan Mace 2019-10-11 19:36:44 UTC

And as for why we're seeing the log spam only in 4.1 and 4.2 of dns-operator, it's because the k8s 1.14 client library delegates the error to utilruntime.HandleError:

https://github.com/openshift/cluster-dns-operator/blob/release-4.2/vendor/k8s.io/client-go/tools/cache/reflector.go#L126

So the log emission in this particular case on that k8s level will depend on how consumers set up global error handling.

Note You need to log in before you can comment on or make changes to this bug.