Bug 1859918
| Summary: | [sig-arch][Early] Managed cluster should start all core operators: AWS Route 53 ProviderError throttling | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Nikolaos Leandros Moraitis <nmoraiti> | |
| Component: | Networking | Assignee: | Stephen Greene <sgreene> | |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> | |
| Status: | CLOSED NOTABUG | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | unspecified | CC: | aos-bugs, bbennett, rteague, sgreene, wking | |
| Version: | 4.5 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.7.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1874513 (view as bug list) | Environment: |
[sig-arch][Early] Managed cluster should start all core operators
|
|
| Last Closed: | 2020-10-23 14:00:06 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1874513 | |||
|
Description
Nikolaos Leandros Moraitis
2020-07-23 10:06:37 UTC
Picking an example job, here's 4.6.0-0.ci-2020-07-21-114552 [1]:
fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Jul 23 14:54:30.357: Some cluster operators are not ready: ingress (Degraded=True IngressControllersDegraded: Some ingresscontrollers are degraded: default)
From the must-gather [2]:
$ cat namespaces/openshift-ingress-operator/ingress.operator.openshift.io/dnsrecords/default-wildcard.yaml
...
status:
observedGeneration: 1
zones:
- conditions:
- lastTransitionTime: "2020-07-23T14:43:29Z"
message: "The DNS provider failed to ensure the record: failed to update alias
in zone Z045029334ANE5TVMEI0C: couldn't update DNS record in zone Z045029334ANE5TVMEI0C:
Throttling: Rate exceeded\n\tstatus code: 400, request id: 98eb01d5-99d4-4149-bb44-a6a24bf10616"
reason: ProviderError
status: "True"
type: Failed
dnsZone:
tags:
Name: ci-op-9xsv30bx-1a302-cn55p-int
kubernetes.io/cluster/ci-op-9xsv30bx-1a302-cn55p: owned
- conditions:
- lastTransitionTime: "2020-07-23T14:43:37Z"
message: The DNS provider succeeded in ensuring the record
reason: ProviderSuccess
status: "False"
type: Failed
dnsZone:
id: Z2GYOLTZHS5VK
Also "[sig-arch]" != Multi-Arch component (that's about s390x, etc.). sig-arch is about OpenShift/Kube architecture, while Multi-Arch is about CPU architecture. Moving to Routing for the ingress issue.
Note for the routing folks: "Some ingresscontrollers are degraded: default" does not make the next step:
$ oc -n openshift-ingress-operator get -o yaml dnsrecords default-wildcard
(which presumably also has a web-console analog) very clear. It would be nice if we either gave a more-direct pointer in the ClusterOperator condition and/or bubbled some portion of the error up into the condition message.
[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-shared-vpc-4.6/1286302245683466240
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved. I'm seeing the same test fail regularly, but with a different failure mode occasionally. Should I open a different bug for this? I0831 04:52:52.846255 122 test_context.go:427] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready Aug 31 04:52:52.893: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Aug 31 04:52:52.995: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready Aug 31 04:52:53.063: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) Aug 31 04:52:53.063: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. Aug 31 04:52:53.063: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start Aug 31 04:52:53.087: INFO: e2e test version: v0.0.0-master+$Format:%h$ Aug 31 04:52:53.102: INFO: kube-apiserver version: v1.19.0-rc.2.473+f71a7ab366cffe-dirty Aug 31 04:52:53.124: INFO: Cluster IP family: ipv4 [BeforeEach] [Top Level] github.com/openshift/origin@/test/extended/util/framework.go:1425 [BeforeEach] [Top Level] github.com/openshift/origin@/test/extended/util/framework.go:1425 [BeforeEach] [Top Level] github.com/openshift/origin@/test/extended/util/test.go:59 [It] start all core operators [Suite:openshift/conformance/parallel] github.com/openshift/origin@/test/extended/operators/operators.go:31 STEP: checking for the cluster version operator STEP: ensuring cluster version is stable STEP: ensuring all cluster operators are stable Aug 31 04:52:53.216: FAIL: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded) Full Stack Trace github.com/openshift/origin/test/extended/operators.glob..func7.1() github.com/openshift/origin@/test/extended/operators/operators.go:94 +0x1a2c github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001cb9d40, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x22442a0) github.com/openshift/origin@/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f main.newRunTestCommand.func1.1() github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x4e github.com/openshift/origin/test/extended/util.WithCleanup(0xc001c3fbd8) github.com/openshift/origin@/test/extended/util/test.go:167 +0x58 main.newRunTestCommand.func1(0xc001cc9680, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x0) github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x1be github.com/spf13/cobra.(*Command).execute(0xc001cc9680, 0xc001ad3770, 0x1, 0x1, 0xc001cc9680, 0xc001ad3770) @/github.com/spf13/cobra/command.go:826 +0x460 github.com/spf13/cobra.(*Command).ExecuteC(0xc001cc8f00, 0x0, 0x6963c80, 0x9e9fd00) @/github.com/spf13/cobra/command.go:914 +0x2fb github.com/spf13/cobra.(*Command).Execute(...) @/github.com/spf13/cobra/command.go:864 main.main.func1(0xc001cc8f00, 0x0, 0x0) github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:61 +0x9c main.main() github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:62 +0x36e Aug 31 04:52:53.227: INFO: Running AfterSuite actions on all nodes Aug 31 04:52:53.227: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Aug 31 04:52:53.216: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded) Aug 31 04:52:52.790 I ns/openshift-monitoring pod/thanos-querier-78f465cc58-bw4xp node/ip-10-0-151-54.ec2.internal reason/Created Aug 31 04:52:52.790 I ns/openshift-sdn pod/sdn-metrics-gff76 node/ip-10-0-221-16.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/test-ssh-bastion pod/ssh-bastion-5fcf8d7d9b-mnzkn node/ip-10-0-221-16.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-8mnnc node/ip-10-0-155-158.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/openshift-monitoring pod/grafana-5f4c8bff99-pcgrq node/ip-10-0-151-54.ec2.internal reason/Created Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-vrbjj node/ip-10-0-151-54.ec2.internal reason/Created failed: (600ms) 2020-08-31T04:52:53 "[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]" Looking at the last 24 hours of failures, the test failed on the dns operator 2 times and the ingress operator 5 times. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. |