Bug 1940309

Summary: CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing
Product: OpenShift Container Platform Reporter: Praveen Kumar <prkumar>
Component: NetworkingAssignee: aos-network-edge-staff <aos-network-edge-staff>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs, cfergeau, mmasters
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-19 11:57:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
must gather from cluster none

Description Praveen Kumar 2021-03-18 06:25:02 UTC
Created attachment 1764281 [details]
must gather from cluster

Description of problem: Recently we are seeing the failure on our e2e-crc jobs for installer repo and we tried it to test it manually and experienced same issue, looks like 4.8.0 nightly were perfectly working for this job 2 days back and suddenly started failing.

Successful run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4700/pull-ci-openshift-installer-master-e2e-crc/1371721481884536832
Failure run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4766/pull-ci-openshift-installer-master-e2e-crc/1372208982986330112

Version-Release number of selected component (if applicable):
```
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.8.0-0.nightly-2021-03-18-000857
built from commit f8a81655daaa0a21c917c671f1dce9733e14c6f2
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:5be3b251ccd17fae881d43591dd1bebe763780f0c7e3386332722ccb2648954d
```

How reproducible:
Try to run openshift with single node on libvirt provider, bootstrap happen successfully but cluster doesn't provision successfully.


Steps to Reproduce:
1. Use latest `openshift-baremetal-install` binary
2. Chose `libvirt` provider

Actual results:
```
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, baremetal, console, image-registry, openshift-samples                                                                   
ERROR Cluster operator authentication Degraded is True with OAuthRouteCheckEndpointAccessibleController_SyncError: OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps-crc.testing/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
INFO Cluster operator authentication Progressing is True with OAuthVersionRoute_WaitingForRoute: OAuthVersionRouteProgressing: Request to "https://oauth-openshift.apps-crc.testing/healthz" not successful yet                       
INFO Cluster operator authentication Available is False with OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed: OAuthVersionRouteAvailable: HTTP request to "https://oauth-openshift.apps-crc.testing/healthz" failed: dial tcp: i/o timeout
INFO OAuthRouteCheckEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps-crc.testing/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)                                         
ERROR Cluster operator console Degraded is True with RouteHealth_FailedLoadCA: RouteHealthDegraded: failed to read CA to check route health: configmaps "trusted-ca-bundle" not found                                                 
INFO Cluster operator console Available is Unknown with NoData:
ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)                                                                                         
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights UploadDegraded is True with UploadFailed: Unable to report: unable to build request to connect to Insights server: Post "https://cloud.redhat.com/api/ingress/v1/upload": dial tcp: i/o timeout        
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html                                                                                                                 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, baremetal, console, image-registry, openshift-samples   
```

Expected results:
Should able to provision cluster successfully.

Additional info:
This looks like similar to https://bugzilla.redhat.com/show_bug.cgi?id=1908389 where canary status was failing but it was because of load balancer and for libvirt there is no load balancer available.

Attached must-gather logs for more info.

Comment 1 Praveen Kumar 2021-03-19 11:57:50 UTC
Debugging this issue more we found out that it was happening because there is no dns operator scheduled on the cluster. It happened because we were using `single-node-developer` profile and https://github.com/openshift/cluster-dns-operator/pull/216, https://github.com/operator-framework/operator-marketplace/pull/369 still in pending state.

Closing this since this is not bug from routing.