+++ This bug was initially created as a clone of Bug #1859918 +++ test: [sig-arch][Early] Managed cluster should start all core operators is failing frequently in CI, see search results: https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D%5C%5BEarly%5C%5D+Managed+cluster+should+start+all+core+operators FIXME: Replace this paragraph with a particular job URI from the search results to ground discussion. A given test may fail for several reasons, and this bug should be scoped to one of those reasons. Ideally you'd pick a job showing the most-common reason, but since that's hard to determine, you may also chose to pick a job at random. Release-gating jobs (release-openshift-...) should be preferred over presubmits (pull-ci-...) because they are closer to the released product and less likely to have in-flight code changes that complicate analysis. FIXME: Provide a snippet of the test failure or error from the job log --- Additional comment from W. Trevor King on 2020-07-24 04:52:42 UTC --- Picking an example job, here's 4.6.0-0.ci-2020-07-21-114552 [1]: fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Jul 23 14:54:30.357: Some cluster operators are not ready: ingress (Degraded=True IngressControllersDegraded: Some ingresscontrollers are degraded: default) From the must-gather [2]: $ cat namespaces/openshift-ingress-operator/ingress.operator.openshift.io/dnsrecords/default-wildcard.yaml ... status: observedGeneration: 1 zones: - conditions: - lastTransitionTime: "2020-07-23T14:43:29Z" message: "The DNS provider failed to ensure the record: failed to update alias in zone Z045029334ANE5TVMEI0C: couldn't update DNS record in zone Z045029334ANE5TVMEI0C: Throttling: Rate exceeded\n\tstatus code: 400, request id: 98eb01d5-99d4-4149-bb44-a6a24bf10616" reason: ProviderError status: "True" type: Failed dnsZone: tags: Name: ci-op-9xsv30bx-1a302-cn55p-int kubernetes.io/cluster/ci-op-9xsv30bx-1a302-cn55p: owned - conditions: - lastTransitionTime: "2020-07-23T14:43:37Z" message: The DNS provider succeeded in ensuring the record reason: ProviderSuccess status: "False" type: Failed dnsZone: id: Z2GYOLTZHS5VK Also "[sig-arch]" != Multi-Arch component (that's about s390x, etc.). sig-arch is about OpenShift/Kube architecture, while Multi-Arch is about CPU architecture. Moving to Routing for the ingress issue. Note for the routing folks: "Some ingresscontrollers are degraded: default" does not make the next step: $ oc -n openshift-ingress-operator get -o yaml dnsrecords default-wildcard (which presumably also has a web-console analog) very clear. It would be nice if we either gave a more-direct pointer in the ClusterOperator condition and/or bubbled some portion of the error up into the condition message. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-shared-vpc-4.6/1286302245683466240 --- Additional comment from Andrew McDermott on 2020-07-24 17:34:55 UTC --- I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. --- Additional comment from on 2020-08-18 20:01:45 UTC --- Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved. --- Additional comment from Russell Teague on 2020-08-31 15:16:41 UTC --- I'm seeing the same test fail regularly, but with a different failure mode occasionally. Should I open a different bug for this? I0831 04:52:52.846255 122 test_context.go:427] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready Aug 31 04:52:52.893: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Aug 31 04:52:52.995: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready Aug 31 04:52:53.063: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) Aug 31 04:52:53.063: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. Aug 31 04:52:53.063: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start Aug 31 04:52:53.087: INFO: e2e test version: v0.0.0-master+$Format:%h$ Aug 31 04:52:53.102: INFO: kube-apiserver version: v1.19.0-rc.2.473+f71a7ab366cffe-dirty Aug 31 04:52:53.124: INFO: Cluster IP family: ipv4 [BeforeEach] [Top Level] github.com/openshift/origin@/test/extended/util/framework.go:1425 [BeforeEach] [Top Level] github.com/openshift/origin@/test/extended/util/framework.go:1425 [BeforeEach] [Top Level] github.com/openshift/origin@/test/extended/util/test.go:59 [It] start all core operators [Suite:openshift/conformance/parallel] github.com/openshift/origin@/test/extended/operators/operators.go:31 STEP: checking for the cluster version operator STEP: ensuring cluster version is stable STEP: ensuring all cluster operators are stable Aug 31 04:52:53.216: FAIL: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded) Full Stack Trace github.com/openshift/origin/test/extended/operators.glob..func7.1() github.com/openshift/origin@/test/extended/operators/operators.go:94 +0x1a2c github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001cb9d40, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x22442a0) github.com/openshift/origin@/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f main.newRunTestCommand.func1.1() github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x4e github.com/openshift/origin/test/extended/util.WithCleanup(0xc001c3fbd8) github.com/openshift/origin@/test/extended/util/test.go:167 +0x58 main.newRunTestCommand.func1(0xc001cc9680, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x0) github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x1be github.com/spf13/cobra.(*Command).execute(0xc001cc9680, 0xc001ad3770, 0x1, 0x1, 0xc001cc9680, 0xc001ad3770) @/github.com/spf13/cobra/command.go:826 +0x460 github.com/spf13/cobra.(*Command).ExecuteC(0xc001cc8f00, 0x0, 0x6963c80, 0x9e9fd00) @/github.com/spf13/cobra/command.go:914 +0x2fb github.com/spf13/cobra.(*Command).Execute(...) @/github.com/spf13/cobra/command.go:864 main.main.func1(0xc001cc8f00, 0x0, 0x0) github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:61 +0x9c main.main() github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:62 +0x36e Aug 31 04:52:53.227: INFO: Running AfterSuite actions on all nodes Aug 31 04:52:53.227: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Aug 31 04:52:53.216: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded) Aug 31 04:52:52.790 I ns/openshift-monitoring pod/thanos-querier-78f465cc58-bw4xp node/ip-10-0-151-54.ec2.internal reason/Created Aug 31 04:52:52.790 I ns/openshift-sdn pod/sdn-metrics-gff76 node/ip-10-0-221-16.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/test-ssh-bastion pod/ssh-bastion-5fcf8d7d9b-mnzkn node/ip-10-0-221-16.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-8mnnc node/ip-10-0-155-158.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/openshift-monitoring pod/grafana-5f4c8bff99-pcgrq node/ip-10-0-151-54.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-vrbjj node/ip-10-0-151-54.ec2.internal reason/Created failed: (600ms) 2020-08-31T04:52:53 "[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]" --- Additional comment from Russell Teague on 2020-08-31 20:54:09 UTC --- Looking at the last 24 hours of failures, the test failed on the dns operator 2 times and the ingress operator 5 times.
Cloned this as Build Watcher as it's also happening under 4.6, numerous tests as well. Here's an example test fail: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.6/1300441657195368448 fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Aug 31 15:29:30.167: Some cluster operators are not ready: ingress (Degraded=True IngressControllersDegraded
here's a narrow search that shows specifically cases where the ingress operator is showing up degraded: https://search.ci.openshift.org/?search=Degraded%3DTrue+IngressControllersDegraded%3A+Some+ingresscontrollers+are+degraded%3A+default&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
This bug has not been hit on 4.6 CI at all in the past 14 days. https://search.ci.openshift.org/?search=Degraded%3DTrue+IngressControllersDegraded%3A+Some+ingresscontrollers+are+degraded%3A+default&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Shows the bug fail 3 times in the last 14 days but only on 4.5, not on 4.6 Closing this bug as not a bug. Will address the 4.5 issue via the cloned BZ.