Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1859918

Summary:	[sig-arch][Early] Managed cluster should start all core operators: AWS Route 53 ProviderError throttling
Product:	OpenShift Container Platform	Reporter:	Nikolaos Leandros Moraitis <nmoraiti>
Component:	Networking	Assignee:	Stephen Greene <sgreene>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, bbennett, rteague, sgreene, wking
Version:	4.5
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1874513 (view as bug list)		Environment:	[sig-arch][Early] Managed cluster should start all core operators
Last Closed:	2020-10-23 14:00:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1874513

Description Nikolaos Leandros Moraitis 2020-07-23 10:06:37 UTC

test:
[sig-arch][Early] Managed cluster should start all core operators 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D%5C%5BEarly%5C%5D+Managed+cluster+should+start+all+core+operators


FIXME: Replace this paragraph with a particular job URI from the search results to ground discussion.  A given test may fail for several reasons, and this bug should be scoped to one of those reasons.  Ideally you'd pick a job showing the most-common reason, but since that's hard to determine, you may also chose to pick a job at random.  Release-gating jobs (release-openshift-...) should be preferred over presubmits (pull-ci-...) because they are closer to the released product and less likely to have in-flight code changes that complicate analysis.

FIXME: Provide a snippet of the test failure or error from the job log

Comment 1 W. Trevor King 2020-07-24 04:52:42 UTC

Picking an example job, here's 4.6.0-0.ci-2020-07-21-114552 [1]:

fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Jul 23 14:54:30.357: Some cluster operators are not ready: ingress (Degraded=True IngressControllersDegraded: Some ingresscontrollers are degraded: default)

From the must-gather [2]:

$ cat namespaces/openshift-ingress-operator/ingress.operator.openshift.io/dnsrecords/default-wildcard.yaml 
...
status:
  observedGeneration: 1
  zones:
  - conditions:
    - lastTransitionTime: "2020-07-23T14:43:29Z"
      message: "The DNS provider failed to ensure the record: failed to update alias
        in zone Z045029334ANE5TVMEI0C: couldn't update DNS record in zone Z045029334ANE5TVMEI0C:
        Throttling: Rate exceeded\n\tstatus code: 400, request id: 98eb01d5-99d4-4149-bb44-a6a24bf10616"
      reason: ProviderError
      status: "True"
      type: Failed
    dnsZone:
      tags:
        Name: ci-op-9xsv30bx-1a302-cn55p-int
        kubernetes.io/cluster/ci-op-9xsv30bx-1a302-cn55p: owned
  - conditions:
    - lastTransitionTime: "2020-07-23T14:43:37Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed
    dnsZone:
      id: Z2GYOLTZHS5VK

Also "[sig-arch]" != Multi-Arch component (that's about s390x, etc.).  sig-arch is about OpenShift/Kube architecture, while Multi-Arch is about CPU architecture.  Moving to Routing for the ingress issue.

Note for the routing folks: "Some ingresscontrollers are degraded: default" does not make the next step:

  $ oc -n openshift-ingress-operator get -o yaml dnsrecords default-wildcard

(which presumably also has a web-console analog) very clear.  It would be nice if we either gave a more-direct pointer in the ClusterOperator condition and/or bubbled some portion of the error up into the condition message.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-shared-vpc-4.6/1286302245683466240

Comment 2 Andrew McDermott 2020-07-24 17:34:55 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 3 mfisher 2020-08-18 20:01:45 UTC

Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 4 Russell Teague 2020-08-31 15:16:41 UTC

I'm seeing the same test fail regularly, but with a different failure mode occasionally.  Should I open a different bug for this?

I0831 04:52:52.846255     122 test_context.go:427] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready
Aug 31 04:52:52.893: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Aug 31 04:52:52.995: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Aug 31 04:52:53.063: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Aug 31 04:52:53.063: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Aug 31 04:52:53.063: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start
Aug 31 04:52:53.087: INFO: e2e test version: v0.0.0-master+$Format:%h$
Aug 31 04:52:53.102: INFO: kube-apiserver version: v1.19.0-rc.2.473+f71a7ab366cffe-dirty
Aug 31 04:52:53.124: INFO: Cluster IP family: ipv4
[BeforeEach] [Top Level]
  github.com/openshift/origin@/test/extended/util/framework.go:1425
[BeforeEach] [Top Level]
  github.com/openshift/origin@/test/extended/util/framework.go:1425
[BeforeEach] [Top Level]
  github.com/openshift/origin@/test/extended/util/test.go:59
[It] start all core operators [Suite:openshift/conformance/parallel]
  github.com/openshift/origin@/test/extended/operators/operators.go:31
STEP: checking for the cluster version operator
STEP: ensuring cluster version is stable
STEP: ensuring all cluster operators are stable
Aug 31 04:52:53.216: FAIL: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded)

Full Stack Trace
github.com/openshift/origin/test/extended/operators.glob..func7.1()
	github.com/openshift/origin@/test/extended/operators/operators.go:94 +0x1a2c
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001cb9d40, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x22442a0)
	github.com/openshift/origin@/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f
main.newRunTestCommand.func1.1()
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x4e
github.com/openshift/origin/test/extended/util.WithCleanup(0xc001c3fbd8)
	github.com/openshift/origin@/test/extended/util/test.go:167 +0x58
main.newRunTestCommand.func1(0xc001cc9680, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x0)
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x1be
github.com/spf13/cobra.(*Command).execute(0xc001cc9680, 0xc001ad3770, 0x1, 0x1, 0xc001cc9680, 0xc001ad3770)
	@/github.com/spf13/cobra/command.go:826 +0x460
github.com/spf13/cobra.(*Command).ExecuteC(0xc001cc8f00, 0x0, 0x6963c80, 0x9e9fd00)
	@/github.com/spf13/cobra/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
	@/github.com/spf13/cobra/command.go:864
main.main.func1(0xc001cc8f00, 0x0, 0x0)
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:61 +0x9c
main.main()
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:62 +0x36e
Aug 31 04:52:53.227: INFO: Running AfterSuite actions on all nodes
Aug 31 04:52:53.227: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Aug 31 04:52:53.216: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded)

Aug 31 04:52:52.790 I ns/openshift-monitoring pod/thanos-querier-78f465cc58-bw4xp node/ip-10-0-151-54.ec2.internal reason/Created
Aug 31 04:52:52.790 I ns/openshift-sdn pod/sdn-metrics-gff76 node/ip-10-0-221-16.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/test-ssh-bastion pod/ssh-bastion-5fcf8d7d9b-mnzkn node/ip-10-0-221-16.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-8mnnc node/ip-10-0-155-158.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/openshift-monitoring pod/grafana-5f4c8bff99-pcgrq node/ip-10-0-151-54.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-vrbjj node/ip-10-0-151-54.ec2.internal reason/Created

failed: (600ms) 2020-08-31T04:52:53 "[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]"

Comment 5 Russell Teague 2020-08-31 20:54:09 UTC

Looking at the last 24 hours of failures, the test failed on the dns operator 2 times and the ingress operator 5 times.

Comment 6 Andrew McDermott 2020-09-10 11:51:39 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 8 Andrew McDermott 2020-10-02 17:37:40 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.