1874513 – [sig-arch][Early] Managed cluster should start all core operators: <numerous specific tests>

Bug 1874513 - [sig-arch][Early] Managed cluster should start all core operators: <numerous specific tests>

Summary: [sig-arch][Early] Managed cluster should start all core operators: <numerous ...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Andrew McDermott
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:	non-multi-arch
Depends On:	1859918
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-01 14:15 UTC by Brett Tofel
Modified:	2022-08-04 22:30 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1859918
Environment:	[sig-arch][Early] Managed cluster should start all core operators
Last Closed:	2020-09-23 19:59:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Brett Tofel 2020-09-01 14:15:03 UTC

+++ This bug was initially created as a clone of Bug #1859918 +++

test:
[sig-arch][Early] Managed cluster should start all core operators 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-arch%5C%5D%5C%5BEarly%5C%5D+Managed+cluster+should+start+all+core+operators


FIXME: Replace this paragraph with a particular job URI from the search results to ground discussion.  A given test may fail for several reasons, and this bug should be scoped to one of those reasons.  Ideally you'd pick a job showing the most-common reason, but since that's hard to determine, you may also chose to pick a job at random.  Release-gating jobs (release-openshift-...) should be preferred over presubmits (pull-ci-...) because they are closer to the released product and less likely to have in-flight code changes that complicate analysis.

FIXME: Provide a snippet of the test failure or error from the job log

--- Additional comment from W. Trevor King on 2020-07-24 04:52:42 UTC ---

Picking an example job, here's 4.6.0-0.ci-2020-07-21-114552 [1]:

fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Jul 23 14:54:30.357: Some cluster operators are not ready: ingress (Degraded=True IngressControllersDegraded: Some ingresscontrollers are degraded: default)

From the must-gather [2]:

$ cat namespaces/openshift-ingress-operator/ingress.operator.openshift.io/dnsrecords/default-wildcard.yaml 
...
status:
  observedGeneration: 1
  zones:
  - conditions:
    - lastTransitionTime: "2020-07-23T14:43:29Z"
      message: "The DNS provider failed to ensure the record: failed to update alias
        in zone Z045029334ANE5TVMEI0C: couldn't update DNS record in zone Z045029334ANE5TVMEI0C:
        Throttling: Rate exceeded\n\tstatus code: 400, request id: 98eb01d5-99d4-4149-bb44-a6a24bf10616"
      reason: ProviderError
      status: "True"
      type: Failed
    dnsZone:
      tags:
        Name: ci-op-9xsv30bx-1a302-cn55p-int
        kubernetes.io/cluster/ci-op-9xsv30bx-1a302-cn55p: owned
  - conditions:
    - lastTransitionTime: "2020-07-23T14:43:37Z"
      message: The DNS provider succeeded in ensuring the record
      reason: ProviderSuccess
      status: "False"
      type: Failed
    dnsZone:
      id: Z2GYOLTZHS5VK

Also "[sig-arch]" != Multi-Arch component (that's about s390x, etc.).  sig-arch is about OpenShift/Kube architecture, while Multi-Arch is about CPU architecture.  Moving to Routing for the ingress issue.

Note for the routing folks: "Some ingresscontrollers are degraded: default" does not make the next step:

  $ oc -n openshift-ingress-operator get -o yaml dnsrecords default-wildcard

(which presumably also has a web-console analog) very clear.  It would be nice if we either gave a more-direct pointer in the ClusterOperator condition and/or bubbled some portion of the error up into the condition message.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-shared-vpc-4.6/1286302245683466240

--- Additional comment from Andrew McDermott on 2020-07-24 17:34:55 UTC ---

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

--- Additional comment from  on 2020-08-18 20:01:45 UTC ---

Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

--- Additional comment from Russell Teague on 2020-08-31 15:16:41 UTC ---

I'm seeing the same test fail regularly, but with a different failure mode occasionally.  Should I open a different bug for this?

I0831 04:52:52.846255     122 test_context.go:427] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready
Aug 31 04:52:52.893: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Aug 31 04:52:52.995: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Aug 31 04:52:53.063: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Aug 31 04:52:53.063: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Aug 31 04:52:53.063: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start
Aug 31 04:52:53.087: INFO: e2e test version: v0.0.0-master+$Format:%h$
Aug 31 04:52:53.102: INFO: kube-apiserver version: v1.19.0-rc.2.473+f71a7ab366cffe-dirty
Aug 31 04:52:53.124: INFO: Cluster IP family: ipv4
[BeforeEach] [Top Level]
  github.com/openshift/origin@/test/extended/util/framework.go:1425
[BeforeEach] [Top Level]
  github.com/openshift/origin@/test/extended/util/framework.go:1425
[BeforeEach] [Top Level]
  github.com/openshift/origin@/test/extended/util/test.go:59
[It] start all core operators [Suite:openshift/conformance/parallel]
  github.com/openshift/origin@/test/extended/operators/operators.go:31
STEP: checking for the cluster version operator
STEP: ensuring cluster version is stable
STEP: ensuring all cluster operators are stable
Aug 31 04:52:53.216: FAIL: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded)

Full Stack Trace
github.com/openshift/origin/test/extended/operators.glob..func7.1()
	github.com/openshift/origin@/test/extended/operators/operators.go:94 +0x1a2c
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001cb9d40, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x22442a0)
	github.com/openshift/origin@/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f
main.newRunTestCommand.func1.1()
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x4e
github.com/openshift/origin/test/extended/util.WithCleanup(0xc001c3fbd8)
	github.com/openshift/origin@/test/extended/util/test.go:167 +0x58
main.newRunTestCommand.func1(0xc001cc9680, 0xc001ad37b0, 0x1, 0x1, 0x0, 0x0)
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:239 +0x1be
github.com/spf13/cobra.(*Command).execute(0xc001cc9680, 0xc001ad3770, 0x1, 0x1, 0xc001cc9680, 0xc001ad3770)
	@/github.com/spf13/cobra/command.go:826 +0x460
github.com/spf13/cobra.(*Command).ExecuteC(0xc001cc8f00, 0x0, 0x6963c80, 0x9e9fd00)
	@/github.com/spf13/cobra/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
	@/github.com/spf13/cobra/command.go:864
main.main.func1(0xc001cc8f00, 0x0, 0x0)
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:61 +0x9c
main.main()
	github.com/openshift/origin@/cmd/openshift-tests/openshift-tests.go:62 +0x36e
Aug 31 04:52:53.227: INFO: Running AfterSuite actions on all nodes
Aug 31 04:52:53.227: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Aug 31 04:52:53.216: Some cluster operators are not ready: dns (Degraded=True DNSDegraded: DNS default is degraded)

Aug 31 04:52:52.790 I ns/openshift-monitoring pod/thanos-querier-78f465cc58-bw4xp node/ip-10-0-151-54.ec2.internal reason/Created
Aug 31 04:52:52.790 I ns/openshift-sdn pod/sdn-metrics-gff76 node/ip-10-0-221-16.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/test-ssh-bastion pod/ssh-bastion-5fcf8d7d9b-mnzkn node/ip-10-0-221-16.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-8mnnc node/ip-10-0-155-158.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/openshift-monitoring pod/grafana-5f4c8bff99-pcgrq node/ip-10-0-151-54.ec2.internal reason/Created
Aug 31 04:52:52.791 I ns/openshift-sdn pod/sdn-metrics-vrbjj node/ip-10-0-151-54.ec2.internal reason/Created

failed: (600ms) 2020-08-31T04:52:53 "[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]"

--- Additional comment from Russell Teague on 2020-08-31 20:54:09 UTC ---

Looking at the last 24 hours of failures, the test failed on the dns operator 2 times and the ingress operator 5 times.

Comment 1 Brett Tofel 2020-09-01 14:21:27 UTC

Cloned this as Build Watcher as it's also happening under 4.6, numerous tests as well.

Here's an example test fail:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.6/1300441657195368448

fail [github.com/openshift/origin@/test/extended/operators/operators.go:94]: Aug 31 15:29:30.167: Some cluster operators are not ready: ingress (Degraded=True IngressControllersDegraded

Comment 2 Ben Parees 2020-09-03 14:29:54 UTC

here's a narrow search that shows specifically cases where the ingress operator is showing up degraded:
https://search.ci.openshift.org/?search=Degraded%3DTrue+IngressControllersDegraded%3A+Some+ingresscontrollers+are+degraded%3A+default&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 3 Andrew McDermott 2020-09-10 11:54:51 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 4 Stephen Greene 2020-09-23 19:59:14 UTC

This bug has not been hit on 4.6 CI at all in the past 14 days.

https://search.ci.openshift.org/?search=Degraded%3DTrue+IngressControllersDegraded%3A+Some+ingresscontrollers+are+degraded%3A+default&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Shows the bug fail 3 times in the last 14 days but only on 4.5, not on 4.6

Closing this bug as not a bug. Will address the 4.5 issue via the cloned BZ.

Note You need to log in before you can comment on or make changes to this bug.