1940309 – CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing

Bug 1940309 - CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing

Summary: CanaryChecksRepetitiveFailures: Canary route checks for the default ingress c...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	aos-network-edge-staff
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-18 06:25 UTC by Praveen Kumar
Modified:	2022-08-04 22:32 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-19 11:57:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must gather from cluster (2.94 MB, application/x-xz) 2021-03-18 06:25 UTC, Praveen Kumar	no flags	Details
View All

Description Praveen Kumar 2021-03-18 06:25:02 UTC

Created attachment 1764281 [details]
must gather from cluster

Description of problem: Recently we are seeing the failure on our e2e-crc jobs for installer repo and we tried it to test it manually and experienced same issue, looks like 4.8.0 nightly were perfectly working for this job 2 days back and suddenly started failing.

Successful run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4700/pull-ci-openshift-installer-master-e2e-crc/1371721481884536832
Failure run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4766/pull-ci-openshift-installer-master-e2e-crc/1372208982986330112

Version-Release number of selected component (if applicable):
```
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.8.0-0.nightly-2021-03-18-000857
built from commit f8a81655daaa0a21c917c671f1dce9733e14c6f2
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:5be3b251ccd17fae881d43591dd1bebe763780f0c7e3386332722ccb2648954d
```

How reproducible:
Try to run openshift with single node on libvirt provider, bootstrap happen successfully but cluster doesn't provision successfully.


Steps to Reproduce:
1. Use latest `openshift-baremetal-install` binary
2. Chose `libvirt` provider

Actual results:
```
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, baremetal, console, image-registry, openshift-samples                                                                   
ERROR Cluster operator authentication Degraded is True with OAuthRouteCheckEndpointAccessibleController_SyncError: OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps-crc.testing/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
INFO Cluster operator authentication Progressing is True with OAuthVersionRoute_WaitingForRoute: OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps-crc.testing/healthz" not successful yet                       
INFO Cluster operator authentication Available is False with OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed: OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps-crc.testing/healthz" failed: dial tcp: i/o timeout
INFO OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps-crc.testing/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)                                         
ERROR Cluster operator console Degraded is True with RouteHealth_FailedLoadCA: RouteHealthDegraded: failed to read CA to check route health: configmaps "trusted-ca-bundle" not found                                                 
INFO Cluster operator console Available is Unknown with NoData:
ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)                                                                                         
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights UploadDegraded is True with UploadFailed: Unable to report: unable to build request to connect to Insights server: Post "https://cloud.redhat.com/api/ingress/v1/upload": dial tcp: i/o timeout        
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html                                                                                                                 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, baremetal, console, image-registry, openshift-samples   
```

Expected results:
Should able to provision cluster successfully.

Additional info:
This looks like similar to https://bugzilla.redhat.com/show_bug.cgi?id=1908389 where canary status was failing but it was because of load balancer and for libvirt there is no load balancer available.

Attached must-gather logs for more info.

Comment 1 Praveen Kumar 2021-03-19 11:57:50 UTC

Debugging this issue more we found out that it was happening because there is no dns operator scheduled on the cluster. It happened because we were using `single-node-developer` profile and https://github.com/openshift/cluster-dns-operator/pull/216, https://github.com/operator-framework/operator-marketplace/pull/369 still in pending state.

Closing this since this is not bug from routing.

Note You need to log in before you can comment on or make changes to this bug.