Description of problem: when applying router sharding following https://docs.openshift.com/container-platform/latest/networking/configuring_ingress_cluster_traffic/configuring-ingress-cluster-traffic-ingress-controller.html#nw-ingress-sharding-route-labels_configuring-ingress-cluster-traffic-ingress-controller for the default IngressController, the Ingress Operator is missing to pick-up that change any applying the required label to the canary route in openshift-ingress-canary namespace. > $ oc get ingresscontroller default -n openshift-ingress-operator -o json | jq '.spec.routeSelector' > { > "matchLabels": { > "type": "sharded" > } > } Once applied there is so far no problem reported by the Ingress Operator as once the route seems to be admitted it does not reconcile potential changes from the IngressController. But when restarting the Ingress Operator, it reports Degraded state as now the canary route can't be admitted. > 2021-11-19T14:33:13.880Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check {"error": "expected canary request body to contain \"Healthcheck requested\""} > 2021-11-19T14:33:14.127Z ERROR operator.ingress_controller controller/controller.go:298 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"} > 2021-11-19T14:33:14.127Z INFO operator.ingress_controller controller/controller.go:298 reconciling {"request": "openshift-ingress-operator/default"} > 2021-11-19T14:33:14.188Z ERROR operator.ingress_controller controller/controller.go:298 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"} $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE [...] image-registry 4.9.5 True False False 2d4h ingress 4.9.5 True False True 2d4h The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) > $ oc get co ingress -o json > { > "apiVersion": "config.openshift.io/v1", > "kind": "ClusterOperator", > "metadata": { > "annotations": { > "include.release.openshift.io/ibm-cloud-managed": "true", > "include.release.openshift.io/self-managed-high-availability": "true", > "include.release.openshift.io/single-node-developer": "true" > }, > "creationTimestamp": "2021-11-17T09:36:56Z", > "generation": 1, > "name": "ingress", > "ownerReferences": [ > { > "apiVersion": "config.openshift.io/v1", > "kind": "ClusterVersion", > "name": "version", > "uid": "98163de5-938b-42b8-95b4-524586891a99" > } > ], > "resourceVersion": "978433", > "uid": "e86b4bde-b264-4d4b-8b54-3739cb6c83f9" > }, > "spec": {}, > "status": { > "conditions": [ > { > "lastTransitionTime": "2021-11-17T09:48:08Z", > "message": "The \"default\" ingress controller reports Available=True.", > "reason": "IngressAvailable", > "status": "True", > "type": "Available" > }, > { > "lastTransitionTime": "2021-11-17T09:48:08Z", > "message": "desired and current number of IngressControllers are equal", > "reason": "AsExpected", > "status": "False", > "type": "Progressing" > }, > { > "lastTransitionTime": "2021-11-19T14:32:57Z", > "message": "The \"default\" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)", > "reason": "IngressDegraded", > "status": "True", > "type": "Degraded" > } > ], > "extension": null, > "relatedObjects": [ > { > "group": "", > "name": "openshift-ingress-operator", > "resource": "namespaces" > }, > { > "group": "operator.openshift.io", > "name": "", > "namespace": "openshift-ingress-operator", > "resource": "IngressController" > }, > { > "group": "ingress.operator.openshift.io", > "name": "", > "namespace": "openshift-ingress-operator", > "resource": "DNSRecord" > }, > { > "group": "", > "name": "openshift-ingress", > "resource": "namespaces" > }, > { > "group": "", > "name": "openshift-ingress-canary", > "resource": "namespaces" > } > ], > "versions": [ > { > "name": "operator", > "version": "4.9.5" > }, > { > "name": "ingress-controller", > "version": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cebaea9de8e30add43caddb7158a5da9ac93bdcda2e17352929cbdcdc7b7b07b" > }, > { > "name": "canary-server", > "version": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:22cab10687a9da592ab27fb20efbe72d288d87a6a974afed14da324fbb2b4bbd" > } > ] > } > } As router sharding (https://docs.openshift.com/container-platform/latest/networking/configuring_ingress_cluster_traffic/configuring-ingress-cluster-traffic-ingress-controller.html#nw-ingress-sharding-route-labels_configuring-ingress-cluster-traffic-ingress-controller is a core functionality of OpenShift Container Platform respectively the Ingress Operator, it's expected that the Canary `route` is automatically adjusted with the necessary configuration if the IngressController is changed (to configure router sharding for example). Failing to-do this will raise lots of false/positive alerts as things would actually work if the route or namespace would have the expected label in place for the Canary controller. OpenShift release version: - OpenShift Container Platform 4.9.5 Cluster Platform: - AWS, Azure, VMware, pretty much all of them How reproducible: - Always Steps to Reproduce (in detail): 1. Configure router sharding, following https://docs.openshift.com/container-platform/4.9/networking/configuring_ingress_cluster_traffic/configuring-ingress-cluster-traffic-ingress-controller.html#nw-ingress-sharding-route-labels_configuring-ingress-cluster-traffic-ingress-controller for the default IngressController 2. Restart the Ingress Operator pod in openshift-ingress-operator namespace Actual results: No problem reported if Ingress Operator is not restarted. But once it's restarted, it reports Degraded state with "The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)" message Expected results: The Ingress Operator should pick changes from the default IngressController and act on them. So that the Canary route is updated if sharding is configured and therefore ensure that Canary will always work if all condition are met and not create a false/positive alert. Impact of the problem: False/Positive Alert and Ingress Operator in degraded state for likely unknown reason Additional info:
Setting blocker- because this is not a regression but we do need to figure out whether/how we can support this configuration. This issue looks related to bug 2021446, so I will investigate both BZs.
https://github.com/openshift/cluster-ingress-operator/pull/723 merged on March 30, but automation didn't update the BZ status. The fix should be in nightlies since April, so I'm moving the BZ to ON_QA.
Failed to verified it with 4.11.0-0.nightly-2022-06-23-092832 Flexy id: 114986(I will keep this cluster tonight) kubeconfig: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/114986/artifact/workdir/install-dir/auth/kubeconfig A: 1. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-23-092832 True False 3h59m Cluster version is 4.11.0-0.nightly-2022-06-23-092832 % 2. edit ingresscontroller default with namespaceSelector and nodePlacement % oc -n openshift-ingress-operator edit ingresscontroller default ingresscontroller.operator.openshift.io/default edited % 3. % oc -n openshift-ingress-operator get ingresscontroller default -oyaml | grep -A18 spec: spec: clientTLS: clientCA: name: "" clientCertificatePolicy: "" httpCompression: {} httpEmptyRequestsPolicy: Respond httpErrorCodePages: name: "" namespaceSelector: matchLabels: type: sharded nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" replicas: 2 tuningOptions: {} unsupportedConfigOverrides: null % 4. delete the ingress-operator pod % oc -n openshift-ingress-operator get pods NAME READY STATUS RESTARTS AGE ingress-operator-5b4c9d69df-cpdgb 2/2 Running 2 (63m ago) 70m % oc -n openshift-ingress-operator delete pod ingress-operator-5b4c9d69df-cpdgb pod "ingress-operator-5b4c9d69df-cpdgb" deleted % 5. % oc -n openshift-ingress-operator get pods NAME READY STATUS RESTARTS AGE ingress-operator-5b4c9d69df-lw78x 2/2 Running 0 17s shudi@Shudis-MacBook-Pro 410 % oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-6485c44f88-gct4m 1/1 Running 0 2m45s router-default-6485c44f88-w7lnt 1/1 Running 0 2m45s % 6. ingress co was degraded % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-0.nightly-2022-06-23-153912 False False True 28m OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.shudi-411awspm01.qe.devcluster.openshift.com/healthz" returned "503 Service Unavailable" baremetal 4.11.0-0.nightly-2022-06-23-153912 True False False 95m cloud-controller-manager 4.11.0-0.nightly-2022-06-23-153912 True False False 97m cloud-credential 4.11.0-0.nightly-2022-06-23-153912 True False False 97m cluster-autoscaler 4.11.0-0.nightly-2022-06-23-153912 True False False 95m config-operator 4.11.0-0.nightly-2022-06-23-153912 True False False 96m console 4.11.0-0.nightly-2022-06-23-153912 False False False 28m RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.shudi-411awspm01.qe.devcluster.openshift.com returns '503 Service Unavailable' csi-snapshot-controller 4.11.0-0.nightly-2022-06-23-153912 True False False 96m dns 4.11.0-0.nightly-2022-06-23-153912 True False False 95m etcd 4.11.0-0.nightly-2022-06-23-153912 True False False 94m image-registry 4.11.0-0.nightly-2022-06-23-153912 True False False 83m ingress 4.11.0-0.nightly-2022-06-23-153912 True False True 87m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) B: Use oc apply router-internal2.yaml, then delete the ingress-operator pod, ingress co is degraded, too. 1. % oc apply -f router-internal2.yaml E0624 19:59:34.031859 10669 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body) Warning: resource ingresscontrollers/default is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically. ingresscontroller.operator.openshift.io/default configured % % cat router-internal2.yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: default namespace: openshift-ingress-operator spec: domain: apps.shudi-411awspm01.qe.devcluster.openshift.com nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" routeSelector: matchLabels: type: sharded status: {} kind: List metadata: resourceVersion: "" selfLink: "" % 2. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-23-153912 True False 105m Cluster version is 4.11.0-0.nightly-2022-06-23-153912 shudi@Shudis-MacBook-Pro 410 % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-0.nightly-2022-06-23-153912 False False True 7m36s OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.shudi-411awspm01.qe.devcluster.openshift.com/healthz" returned "503 Service Unavailable" baremetal 4.11.0-0.nightly-2022-06-23-153912 True False False 124m cloud-controller-manager 4.11.0-0.nightly-2022-06-23-153912 True False False 126m cloud-credential 4.11.0-0.nightly-2022-06-23-153912 True False False 126m cluster-autoscaler 4.11.0-0.nightly-2022-06-23-153912 True False False 124m config-operator 4.11.0-0.nightly-2022-06-23-153912 True False False 125m console 4.11.0-0.nightly-2022-06-23-153912 False False False 7m38s RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.shudi-411awspm01.qe.devcluster.openshift.com returns '503 Service Unavailable' csi-snapshot-controller 4.11.0-0.nightly-2022-06-23-153912 True False False 125m dns 4.11.0-0.nightly-2022-06-23-153912 True False False 124m etcd 4.11.0-0.nightly-2022-06-23-153912 True False False 123m image-registry 4.11.0-0.nightly-2022-06-23-153912 True False False 112m ingress 4.11.0-0.nightly-2022-06-23-153912 True False True 15m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
Use fresh cluster and do the test again by the oc apply command 1. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 23m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 % oc get route -o json -n openshift-authentication oauth-openshift | jq '.status' { "ingress": [ { "conditions": [ { "lastTransitionTime": "2022-06-27T09:37:40Z", "status": "True", "type": "Admitted" } ], "host": "oauth-openshift.apps.shudi-411awspm08.qe.devcluster.openshift.com", "routerCanonicalHostname": "router-default.apps.shudi-411awspm08.qe.devcluster.openshift.com", "routerName": "default", "wildcardPolicy": "None" } ] } % 2. % oc get dns.config/cluster -oyaml | grep -i domain baseDomain: shudi-411awspm08.qe.devcluster.openshift.com % 3. oc apply -f router-internal2.yaml % cat router-internal2.yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: default namespace: openshift-ingress-operator spec: domain: apps.shudi-411awspm08.qe.devcluster.openshift.com nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" routeSelector: matchLabels: type: sharded status: {} kind: List metadata: resourceVersion: "" selfLink: "" % oc apply -f router-internal2.yaml 4. More than 10 minutes passed, % oc get route -o json -n openshift-authentication oauth-openshift | jq '.status' {} % 5. % oc get co | grep ingress ingress 4.11.0-0.nightly-2022-06-25-081133 True False True 52m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) % 6. % oc get co | egrep "authentication|ingress|monitoring" authentication 4.11.0-0.nightly-2022-06-25-081133 False False True 30m OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.shudi-411awspm08.qe.devcluster.openshift.com/healthz" returned "503 Service Unavailable" ingress 4.11.0-0.nightly-2022-06-25-081133 True False True 64m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) monitoring 4.11.0-0.nightly-2022-06-25-081133 False True True 15m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. % 7. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 58m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 %
failed to verify it with 4.11.0-0.nightly-2022-09-14-233224 1. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-14-233224 True False 47m Cluster version is 4.11.0-0.nightly-2022-09-14-233224 % 2. % oc -n openshift-ingress-operator edit ingresscontroller default ingresscontroller.operator.openshift.io/default edited % 3. % oc -n openshift-ingress-operator get ingresscontroller default -oyaml | grep -A18 spec: spec: clientTLS: clientCA: name: "" clientCertificatePolicy: "" httpCompression: {} httpEmptyRequestsPolicy: Respond httpErrorCodePages: name: "" namespaceSelector: matchLabels: type: sharded nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" replicas: 2 tuningOptions: {} unsupportedConfigOverrides: null % 4. wait two new router pods created, then show the ingress-operator pod % oc -n openshift-ingress-operator get pods NAME READY STATUS RESTARTS AGE ingress-operator-5b6d5b7fbc-zgj5b 2/2 Running 2 (62m ago) 68m % 5. % oc -n openshift-ingress-operator delete pod ingress-operator-5b6d5b7fbc-zgj5b pod "ingress-operator-5b6d5b7fbc-zgj5b" deleted % % oc -n openshift-ingress-operator get pods NAME READY STATUS RESTARTS AGE ingress-operator-5b6d5b7fbc-gglvl 2/2 Running 0 15s % 6. % oc get co | grep ingress ingress 4.11.0-0.nightly-2022-09-14-233224 True False True 71m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) % 7. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-14-233224 True False 61m Error while reconciling 4.11.0-0.nightly-2022-09-14-233224: the cluster operator ingress is degraded %
Facing it in a 4.11.0 SNO installation (OVN) without sharding, that prevents it to finish. ingress 4.11.0 True False True 22m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.11.5 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6536