Bug 1760473

Summary:	dns-operator reconciliation hot-loop during apiserver connectivity outage windows
Product:	OpenShift Container Platform	Reporter:	Dan Mace <dmace>
Component:	Networking	Assignee:	Dan Mace <dmace>
Networking sub component:	DNS	QA Contact:	Hongan Li <hongli>
Status:	CLOSED UPSTREAM	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, ccoleman, wking
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-11 19:13:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Mace 2019-10-10 15:42:06 UTC

When an apiserver client connection fails to establish, the dns-operator seems to spin in a hot-loop during reconciliation, which results in voluminous log spam and spurious connection attempts.

Discovered while researching https://bugzilla.redhat.com/show_bug.cgi?id=1760103.

Evidence here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/artifacts/e2e-aws-upgrade/pods/openshift-dns-operator_dns-operator-5db7cc9bd6-6ksfp_dns-operator.log

Comment 1 W. Trevor King 2019-10-11 16:57:18 UTC

Inlining so it turns up in Bugzilla searches:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded'
Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to True: APIServerError: Operation cannot be fulfilled on imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jenkins-agent-nodejs];
Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to False
Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available

But I think a better example of flapping is:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns changed Degraded'
Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists

Comment 2 Clayton Coleman 2019-10-11 17:15:28 UTC

Marking hot loop bugs as high (in general any operator looping behavior is pathological and has outsized impact on customers and cluster health)

Comment 3 Dan Mace 2019-10-11 18:17:06 UTC

(In reply to W. Trevor King from comment #1)
> Inlining so it turns up in Bugzilla searches:
> 
> $ curl -s
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-
> installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded'
> Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to
> True: APIServerError: Operation cannot be fulfilled on
> imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been
> modified; please apply your changes to the latest version and try again
> error replacing imagestream [jenkins-agent-nodejs];
> Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to
> False
> Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> 
> But I think a better example of flapping is:
> 
> $ curl -s
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-
> installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns
> changed Degraded'
> Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False:
> AsExpected: All desired DNS DaemonSets available and operand Namespace exists
> Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False:
> AsExpected: All desired DNS DaemonSets available and operand Namespace exists
> Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True:
> NotAllDNSesAvailable: Not all desired DNS DaemonSets available
> Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False:
> AsExpected: All desired DNS DaemonSets available and operand Namespace exists

The status flapping issue interesting but not related to the hot looping I'm referring to in the original report. Just want to be clear.

Comment 4 Dan Mace 2019-10-11 19:13:24 UTC

I misread the output.

The output is actually showing sets of reflectors (managed by informers) failing to populate due to a connectivity issue, repeating in groups every few seconds. There's not actually a hot loop. The fixed interval for reflected start processing in client-go is actually 1 second[1], so our observed 3 second interval between retry attempts is actually probably better than it could be.

Any operator or controller using client-go's primitives (i.e. reflector) is subject to this behavior.

https://github.com/kubernetes/client-go/blob/kubernetes-1.16.0/tools/cache/controller.go#L125

Comment 5 Dan Mace 2019-10-11 19:36:44 UTC

And as for why we're seeing the log spam only in 4.1 and 4.2 of dns-operator, it's because the k8s 1.14 client library delegates the error to utilruntime.HandleError:

https://github.com/openshift/cluster-dns-operator/blob/release-4.2/vendor/k8s.io/client-go/tools/cache/reflector.go#L126

So the log emission in this particular case on that k8s level will depend on how consumers set up global error handling.