Bug 1760473
| Summary: | dns-operator reconciliation hot-loop during apiserver connectivity outage windows | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Dan Mace <dmace> |
| Component: | Networking | Assignee: | Dan Mace <dmace> |
| Networking sub component: | DNS | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED UPSTREAM | Docs Contact: | |
| Severity: | high | ||
| Priority: | medium | CC: | aos-bugs, ccoleman, wking |
| Version: | 4.2.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-11 19:13:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Dan Mace
2019-10-10 15:42:06 UTC
Inlining so it turns up in Bugzilla searches: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded' Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to True: APIServerError: Operation cannot be fulfilled on imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jenkins-agent-nodejs]; Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to False Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available But I think a better example of flapping is: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns changed Degraded' Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Marking hot loop bugs as high (in general any operator looping behavior is pathological and has outsized impact on customers and cluster health) (In reply to W. Trevor King from comment #1) > Inlining so it turns up in Bugzilla searches: > > $ curl -s > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin- > installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded' > Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to > True: APIServerError: Operation cannot be fulfilled on > imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been > modified; please apply your changes to the latest version and try again > error replacing imagestream [jenkins-agent-nodejs]; > Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to > False > Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > > But I think a better example of flapping is: > > $ curl -s > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin- > installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns > changed Degraded' > Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: > AsExpected: All desired DNS DaemonSets available and operand Namespace exists > Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: > AsExpected: All desired DNS DaemonSets available and operand Namespace exists > Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: > NotAllDNSesAvailable: Not all desired DNS DaemonSets available > Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: > AsExpected: All desired DNS DaemonSets available and operand Namespace exists The status flapping issue interesting but not related to the hot looping I'm referring to in the original report. Just want to be clear. I misread the output. The output is actually showing sets of reflectors (managed by informers) failing to populate due to a connectivity issue, repeating in groups every few seconds. There's not actually a hot loop. The fixed interval for reflected start processing in client-go is actually 1 second[1], so our observed 3 second interval between retry attempts is actually probably better than it could be. Any operator or controller using client-go's primitives (i.e. reflector) is subject to this behavior. https://github.com/kubernetes/client-go/blob/kubernetes-1.16.0/tools/cache/controller.go#L125 And as for why we're seeing the log spam only in 4.1 and 4.2 of dns-operator, it's because the k8s 1.14 client library delegates the error to utilruntime.HandleError: https://github.com/openshift/cluster-dns-operator/blob/release-4.2/vendor/k8s.io/client-go/tools/cache/reflector.go#L126 So the log emission in this particular case on that k8s level will depend on how consumers set up global error handling. |