Hide Forgot
Description of problem: OpenShift Container Platform 4.7.30 (as example). The Ingress Controller (even default) is restricted and configured for router sharding purpose (https://docs.openshift.com/container-platform/4.7/networking/ingress-operator.html#nw-ingress-sharding_configuring-ingress). Due to restrictions applied, there is no wildcard domain configured but instead all required routes are simply created as per the below documentation. https://docs.openshift.com/container-platform/4.9/installing/installing_aws/installing-aws-user-infra.html#installation-create-ingress-dns-records_installing-aws-user-infra Interesting enough, even though the Canary route is not listed in the above documentation and thus not created within DNS, the ingress Operator is _NOT_ reporting DEGRADED state. OpenShift release version: - OpenShift Container Platform 4.7.30 Cluster Platform: - AWS - Azure - VMWare How reproducible: - N/A Steps to Reproduce (in detail): 1. N/A Actual results: 16 out of 17 OpenShift Container Platform - Cluster are _NOT_ reporting DEGRADED state, even though the route name can't be resolved and thus be reached. But one OpenShift Container Platform - Cluster that is setup the same way is reporting DEGRADED state. Expected results: It would be nice to understand why the Ingress Cluster Operator is not reporting DEGRADED state when the route is not available and accessible. Further it's crucial to understand why it actually working as expected on one Cluster. Based on the findings from above some changes may need to be applied to prevent unexpected DEGRADED state on many OpenShift Container Platform 4 - Cluster Impact of the problem: So far non, but it's important to understand the overall behavior as otherwise we might have a massive number of OpenShift Container Platform 4 - Cluster reporting DEGRADED state out of a sudden. Additional info:
Setting blocker- and low priority because this is more of a documentation issue and not causing immediate problems. We need to look into this to understand the problem and possibly improve the documentation when time allows.
The purpose of the canary controller is to verify end-to-end connectivity for the default ingresscontroller. To this end, the canary controller creates a test application, service, and canary route, and once the canary route is admitted by the default ingresscontroller, the canary controller periodically sends a request to the canary route and verifies that the controller gets a response. In comment 0, you said that the default ingresscontroller has sharding enabled. In the must-gather archives referenced in comment 1, one cluster's default ingresscontroller is not reporting the CanaryChecksSucceeding status condition, and for this cluster, I can see that the default ingresscontroller has sharding configured, the canary route's namespace is not matched by the default ingresscontroller's namespace selector, and so the canary route's status indicates that it has not been admitted by the default ingresscontroller, which means that the canary controller does not check connectivity. For the other cluster, the default ingresscontroller is reporting CanaryChecksSucceeding=False; for this cluster, I can see that the default ingresscontroller has sharding configured too, but the canary route has nevertheless been admitted, and so the canary controller is checking the route. The fact that the canary route was admitted by the default ingresscontroller in one cluster and the canary route wasn't admitted in the other cluster explains the discrepancy in the reporting of the CanaryChecksSucceeding status condition. Is it possible that in the the first cluster, sharding was configured before the ingress operator started and in the second cluster (the one that reports CanaryChecksSucceeding=False), sharding was configured after the operator started, giving the canary route the opportunity to be admitted? That would explain the discrepancy in the admission status of the canary routes.
(In reply to Miciah Dashiel Butler Masters from comment #4) > The purpose of the canary controller is to verify end-to-end connectivity > for the default ingresscontroller. To this end, the canary controller > creates a test application, service, and canary route, and once the canary > route is admitted by the default ingresscontroller, the canary controller > periodically sends a request to the canary route and verifies that the > controller gets a response. OK, but that still does indicate that we have a couple of problems here: - We don't highlight the need for the Canary route in the documentation and therefore in restricted environment this could expose problems. Therefore we should try to solve this. - The Canary check/Controller does not seem to be aware about router sharing if the default IngressController is sharded. Here we should fix that as we need to take that into consideration and apply the respective labels on the `route` as otherwise we might have false/positive alerts. > Is it possible that in the the first cluster, sharding was configured before > the ingress operator started and in the second cluster (the one that reports > CanaryChecksSucceeding=False), sharding was configured after the operator > started, giving the canary route the opportunity to be admitted? That would > explain the discrepancy in the admission status of the canary routes. So this appears to be a timing issue and thus depends mostly on luck. Meaning that if timing is not on our site, we might have a good number of OpenShift Container Platform - Cluster reporting degraded state for the Ingress Cluster Operator. But this again would be a false/positive as actually things may still work and it's more that the Canary Controller is not properly honoring restricted environments with DNS limitation and router sharding in use.
I've been reviewing this bug and would like to add: 1. The canary route takes 5 minutes to fail. I didn't wait long enough and was beginning to get confused. Just an FYI for those that are looking at this. 2. I verified that what Miciah said is true: if you shard your default ingress controller and cause the canary route to be not admitted, there will be no error messages IF sharding was configured before the ingress operator starts AND your canary route has never been admitted before (there is no previous status). - If ingress-operator is already running and you shard, then you WILL get an error for canary failing. - If ingress-operator is restarted after you shard, but your canary route has been admitted previously, then you WILL get an error for canary failing. - This is because we have a "stale" status issue with routes (see BZ1944851). - If ingress-operator is restarted after you shard AND you clear the status of the canary route manually, then you will NOT get an error for the canary failing - This is situation in the bug. Fresh cluster with sharding. - "oc delete -n openshift-ingress-canary route canary" to manually clear status of the canary route. Long story short, I am going to add a new status to the canary controller status of "unknown". "unknown" will mean the canary route is not admitted and therefore the canary controller isn't operating correctly. This will help with the situation in the bug in which the ingress operator was not showing as degraded, even though the canary controller was not working.
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False 136m Cluster version is 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest melvinjoseph@mjoseph-mac Downloads % oc get co ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 89m melvinjoseph@mjoseph-mac Downloads % oc get pod -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-76699c5f9c-2jrr8 1/1 Running 0 96m router-default-76699c5f9c-xhpfn 1/1 Running 0 96m melvinjoseph@mjoseph-mac Downloads % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 136m baremetal 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m cloud-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m cloud-credential 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m cluster-autoscaler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m config-operator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m console 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 140m csi-snapshot-controller 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m dns 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m etcd 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 150m image-registry 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m insights 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m kube-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m kube-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m kube-scheduler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m kube-storage-version-migrator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m machine-api 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m machine-approver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m machine-config 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 150m marketplace 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m monitoring 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 140m network 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m node-tuning 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m openshift-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 143m openshift-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 148m openshift-samples 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m operator-lifecycle-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m operator-lifecycle-manager-catalog 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m operator-lifecycle-manager-packageserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m service-ca 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m storage 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m melvinjoseph@mjoseph-mac Downloads % oc patch -n openshift-ingress-operator ingresscontroller/default --patch='{"spec":{"routeSelector":{"matchLabels":{"type": "default"}}}}' --type=merge ingresscontroller.operator.openshift.io/default patched melvinjoseph@mjoseph-mac Downloads % melvinjoseph@mjoseph-mac Downloads % oc delete -n openshift-ingress-canary route canary route.route.openshift.io "canary" deleted melvinjoseph@mjoseph-mac Downloads % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False False 21s OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com/healthz" returned "503 Service Unavailable" baremetal 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m cloud-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m cloud-credential 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m cluster-autoscaler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m config-operator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m console 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False False 6s RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com returns '503 Service Unavailable' csi-snapshot-controller 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m dns 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m etcd 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m image-registry 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m insights 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m kube-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m kube-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m kube-scheduler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m kube-storage-version-migrator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m machine-api 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 148m machine-approver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m machine-config 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m marketplace 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m monitoring 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 141m network 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m node-tuning 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m openshift-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 143m openshift-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m openshift-samples 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 145m operator-lifecycle-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m operator-lifecycle-manager-catalog 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m operator-lifecycle-manager-packageserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m service-ca 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 153m storage 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m melvinjoseph@mjoseph-mac Downloads % oc get co ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 146m melvinjoseph@mjoseph-mac Downloads % oc get route -n openshift-ingress-canary NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD canary canary-openshift-ingress-canary.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com ingress-canary 8080 edge/Redirect None melvinjoseph@mjoseph-mac Downloads % oc get all -n openshift-ingress-canary NAME READY STATUS RESTARTS AGE pod/ingress-canary-9qng2 1/1 Running 0 148m pod/ingress-canary-dddcc 1/1 Running 0 148m pod/ingress-canary-zg786 1/1 Running 0 145m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/ingress-canary ClusterIP 172.30.187.227 <none> 8080/TCP,8888/TCP 148m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 148m NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/canary canary-openshift-ingress-canary.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com ingress-canary 8080 edge/Redirect None melvinjoseph@mjoseph-mac Downloads % melvinjoseph@mjoseph-mac Downloads % After 5 minutes..... melvinjoseph@mjoseph-mac Downloads % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False True 6m25s OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com/healthz" returned "503 Service Unavailable" baremetal 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m cloud-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 160m cloud-credential 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 160m cluster-autoscaler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m config-operator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m console 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest False False False 6m10s RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.ci-ln-mz2t1p2-76ef8.origin-ci-int-aws.dev.rhcloud.com returns '503 Service Unavailable' csi-snapshot-controller 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m dns 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m etcd 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 157m image-registry 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m ingress 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False True 152m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) insights 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m kube-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m kube-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 155m kube-scheduler 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 155m kube-storage-version-migrator 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m machine-api 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 154m machine-approver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m machine-config 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 157m marketplace 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m monitoring 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 147m network 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m node-tuning 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m openshift-apiserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 149m openshift-controller-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 155m openshift-samples 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 151m operator-lifecycle-manager 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m operator-lifecycle-manager-catalog 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m operator-lifecycle-manager-packageserver 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 152m service-ca 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 159m storage 4.11.0-0.ci.test-2022-03-17-143404-ci-ln-mz2t1p2-latest True False False 158m melvinjoseph@mjoseph-mac Downloads % Hence verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069