Bug 1940309

Summary:

CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing

Product:

OpenShift Container Platform

Reporter:

Praveen Kumar <prkumar>

Component:

Networking

Assignee:

aos-network-edge-staff <aos-network-edge-staff>

Networking sub component:

router

QA Contact:

Hongan Li <hongli>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

unspecified

Priority:

unspecified

CC:

aos-bugs, cfergeau, mmasters

Version:

4.8

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-03-19 11:57:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
must gather from cluster	none

Description Praveen Kumar 2021-03-18 06:25:02 UTC

Created attachment 1764281 [details]
must gather from cluster

Description of problem: Recently we are seeing the failure on our e2e-crc jobs for installer repo and we tried it to test it manually and experienced same issue, looks like 4.8.0 nightly were perfectly working for this job 2 days back and suddenly started failing.

Successful run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4700/pull-ci-openshift-installer-master-e2e-crc/1371721481884536832
Failure run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4766/pull-ci-openshift-installer-master-e2e-crc/1372208982986330112

Version-Release number of selected component (if applicable):
```
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.8.0-0.nightly-2021-03-18-000857
built from commit f8a81655daaa0a21c917c671f1dce9733e14c6f2
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:5be3b251ccd17fae881d43591dd1bebe763780f0c7e3386332722ccb2648954d
```

How reproducible:
Try to run openshift with single node on libvirt provider, bootstrap happen successfully but cluster doesn't provision successfully.


Steps to Reproduce:
1. Use latest `openshift-baremetal-install` binary
2. Chose `libvirt` provider

Actual results:
```
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, baremetal, console, image-registry, openshift-samples                                                                   
ERROR Cluster operator authentication Degraded is True with OAuthRouteCheckEndpointAccessibleController_SyncError: OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps-crc.testing/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
INFO Cluster operator authentication Progressing is True with OAuthVersionRoute_WaitingForRoute: OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps-crc.testing/healthz" not successful yet                       
INFO Cluster operator authentication Available is False with OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed: OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps-crc.testing/healthz" failed: dial tcp: i/o timeout
INFO OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps-crc.testing/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)                                         
ERROR Cluster operator console Degraded is True with RouteHealth_FailedLoadCA: RouteHealthDegraded: failed to read CA to check route health: configmaps "trusted-ca-bundle" not found                                                 
INFO Cluster operator console Available is Unknown with NoData:
ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)                                                                                         
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights UploadDegraded is True with UploadFailed: Unable to report: unable to build request to connect to Insights server: Post "https://cloud.redhat.com/api/ingress/v1/upload": dial tcp: i/o timeout        
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html                                                                                                                 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, baremetal, console, image-registry, openshift-samples   
```

Expected results:
Should able to provision cluster successfully.

Additional info:
This looks like similar to https://bugzilla.redhat.com/show_bug.cgi?id=1908389 where canary status was failing but it was because of load balancer and for libvirt there is no load balancer available.

Attached must-gather logs for more info.

Comment 1 Praveen Kumar 2021-03-19 11:57:50 UTC

Debugging this issue more we found out that it was happening because there is no dns operator scheduled on the cluster. It happened because we were using `single-node-developer` profile and https://github.com/openshift/cluster-dns-operator/pull/216, https://github.com/operator-framework/operator-marketplace/pull/369 still in pending state.

Closing this since this is not bug from routing.