When an apiserver client connection fails to establish, the dns-operator seems to spin in a hot-loop during reconciliation, which results in voluminous log spam and spurious connection attempts. Discovered while researching https://bugzilla.redhat.com/show_bug.cgi?id=1760103. Evidence here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/artifacts/e2e-aws-upgrade/pods/openshift-dns-operator_dns-operator-5db7cc9bd6-6ksfp_dns-operator.log
Inlining so it turns up in Bugzilla searches: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded' Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to True: APIServerError: Operation cannot be fulfilled on imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jenkins-agent-nodejs]; Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to False Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available But I think a better example of flapping is: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns changed Degraded' Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Marking hot loop bugs as high (in general any operator looping behavior is pathological and has outsized impact on customers and cluster health)
(In reply to W. Trevor King from comment #1) > Inlining so it turns up in Bugzilla searches: > > $ curl -s > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin- > installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded' > Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to > True: APIServerError: Operation cannot be fulfilled on > imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been > modified; please apply your changes to the latest version and try again > error replacing imagestream [jenkins-agent-nodejs]; > Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to > False > Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > > But I think a better example of flapping is: > > $ curl -s > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin- > installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns > changed Degraded' > Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: > AsExpected: All desired DNS DaemonSets available and operand Namespace exists > Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: > AsExpected: All desired DNS DaemonSets available and operand Namespace exists > Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: > AsExpected: All desired DNS DaemonSets available and operand Namespace exists The status flapping issue interesting but not related to the hot looping I'm referring to in the original report. Just want to be clear.
I misread the output. The output is actually showing sets of reflectors (managed by informers) failing to populate due to a connectivity issue, repeating in groups every few seconds. There's not actually a hot loop. The fixed interval for reflected start processing in client-go is actually 1 second[1], so our observed 3 second interval between retry attempts is actually probably better than it could be. Any operator or controller using client-go's primitives (i.e. reflector) is subject to this behavior. https://github.com/kubernetes/client-go/blob/kubernetes-1.16.0/tools/cache/controller.go#L125
And as for why we're seeing the log spam only in 4.1 and 4.2 of dns-operator, it's because the k8s 1.14 client library delegates the error to utilruntime.HandleError: https://github.com/openshift/cluster-dns-operator/blob/release-4.2/vendor/k8s.io/client-go/tools/cache/reflector.go#L126 So the log emission in this particular case on that k8s level will depend on how consumers set up global error handling.