Bug 2024946 - Ingress Canary does not respect router sharding on default IngressController
Summary: Ingress Canary does not respect router sharding on default IngressController
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.11.z
Assignee: Grant Spence
QA Contact: Shudi Li
Jesse Dohmann
URL:
Whiteboard:
Depends On: 2108214
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-19 14:55 UTC by Simon Reber
Modified: 2023-04-11 10:31 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The documentation is unclear on the implications of sharding the default ingress controller. Additionally, the cluster operator didn't report an issue if the default ingress controller was not selecting the canary route. Consequence: Users shard their default ingress controller subsequently break the canary route (or others). Fix: Update the documentation to be more clear on the implications of sharding the default ingress controller as well as add a cluster operator error if a user causes the canary route to not be selected by the default ingress controller. Result: Users avoid sharding the default ingress controller in such a way that breaks their cluster.
Clone Of:
Environment:
Last Closed: 2022-09-20 16:34:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift api pull 1146 0 None Merged Bug 1944851: Add selectors to ingress controller status for determining state 2022-06-22 20:39:21 UTC
Github openshift cluster-ingress-operator pull 723 0 None Merged Bug 2021446: Set canary status as unknown if not admitted to default ingress controller 2022-06-22 20:39:27 UTC
Github openshift openshift-docs pull 41403 0 None open OSDOCS-2135: Update routing and ingress sharding documentation 2022-06-23 15:32:12 UTC
Red Hat Product Errata RHSA-2022:6536 0 None None None 2022-09-20 16:35:00 UTC

Description Simon Reber 2021-11-19 14:55:10 UTC
Description of problem:

when applying router sharding following https://docs.openshift.com/container-platform/latest/networking/configuring_ingress_cluster_traffic/configuring-ingress-cluster-traffic-ingress-controller.html#nw-ingress-sharding-route-labels_configuring-ingress-cluster-traffic-ingress-controller for the default IngressController, the Ingress Operator is missing to pick-up that change any applying the required label to the canary route in openshift-ingress-canary namespace.

> $ oc get ingresscontroller default -n openshift-ingress-operator -o json | jq '.spec.routeSelector'
> {
>   "matchLabels": {
>     "type": "sharded"
>   }
> }

Once applied there is so far no problem reported by the Ingress Operator as once the route seems to be admitted it does not reconcile potential changes from the IngressController.

But when restarting the Ingress Operator, it reports Degraded state as now the canary route can't be admitted.

> 2021-11-19T14:33:13.880Z	ERROR	operator.canary_controller	wait/wait.go:155	error performing canary route check	{"error": "expected canary request body to contain \"Healthcheck requested\""}
> 2021-11-19T14:33:14.127Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
> 2021-11-19T14:33:14.127Z	INFO	operator.ingress_controller	controller/controller.go:298	reconciling	{"request": "openshift-ingress-operator/default"}
> 2021-11-19T14:33:14.188Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
[...]   
image-registry                             4.9.5     True        False         False      2d4h    
ingress                                    4.9.5     True        False         True       2d4h    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

> $ oc get co ingress -o json
> {
>     "apiVersion": "config.openshift.io/v1",
>     "kind": "ClusterOperator",
>     "metadata": {
>         "annotations": {
>             "include.release.openshift.io/ibm-cloud-managed": "true",
>             "include.release.openshift.io/self-managed-high-availability": "true",
>             "include.release.openshift.io/single-node-developer": "true"
>         },
>         "creationTimestamp": "2021-11-17T09:36:56Z",
>         "generation": 1,
>         "name": "ingress",
>         "ownerReferences": [
>             {
>                 "apiVersion": "config.openshift.io/v1",
>                 "kind": "ClusterVersion",
>                 "name": "version",
>                 "uid": "98163de5-938b-42b8-95b4-524586891a99"
>             }
>         ],
>         "resourceVersion": "978433",
>         "uid": "e86b4bde-b264-4d4b-8b54-3739cb6c83f9"
>     },
>     "spec": {},
>     "status": {
>         "conditions": [
>             {
>                 "lastTransitionTime": "2021-11-17T09:48:08Z",
>                 "message": "The \"default\" ingress controller reports Available=True.",
>                 "reason": "IngressAvailable",
>                 "status": "True",
>                 "type": "Available"
>             },
>             {
>                 "lastTransitionTime": "2021-11-17T09:48:08Z",
>                 "message": "desired and current number of IngressControllers are equal",
>                 "reason": "AsExpected",
>                 "status": "False",
>                 "type": "Progressing"
>             },
>             {
>                 "lastTransitionTime": "2021-11-19T14:32:57Z",
>                 "message": "The \"default\" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)",
>                 "reason": "IngressDegraded",
>                 "status": "True",
>                 "type": "Degraded"
>             }
>         ],
>         "extension": null,
>         "relatedObjects": [
>             {
>                 "group": "",
>                 "name": "openshift-ingress-operator",
>                 "resource": "namespaces"
>             },
>             {
>                 "group": "operator.openshift.io",
>                 "name": "",
>                 "namespace": "openshift-ingress-operator",
>                 "resource": "IngressController"
>             },
>             {
>                 "group": "ingress.operator.openshift.io",
>                 "name": "",
>                 "namespace": "openshift-ingress-operator",
>                 "resource": "DNSRecord"
>             },
>             {
>                 "group": "",
>                 "name": "openshift-ingress",
>                 "resource": "namespaces"
>             },
>             {
>                 "group": "",
>                 "name": "openshift-ingress-canary",
>                 "resource": "namespaces"
>             }
>         ],
>         "versions": [
>             {
>                 "name": "operator",
>                 "version": "4.9.5"
>             },
>             {
>                 "name": "ingress-controller",
>                 "version": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cebaea9de8e30add43caddb7158a5da9ac93bdcda2e17352929cbdcdc7b7b07b"
>             },
>             {
>                 "name": "canary-server",
>                 "version": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:22cab10687a9da592ab27fb20efbe72d288d87a6a974afed14da324fbb2b4bbd"
>             }
>         ]
>     }
> }

As router sharding (https://docs.openshift.com/container-platform/latest/networking/configuring_ingress_cluster_traffic/configuring-ingress-cluster-traffic-ingress-controller.html#nw-ingress-sharding-route-labels_configuring-ingress-cluster-traffic-ingress-controller is a core functionality of OpenShift Container Platform respectively the Ingress Operator, it's expected that the Canary `route` is automatically adjusted with the necessary configuration if the IngressController is changed (to configure router sharding for example).

Failing to-do this will raise lots of false/positive alerts as things would actually work if the route or namespace would have the expected label in place for the Canary controller.

OpenShift release version:

 - OpenShift Container Platform 4.9.5

Cluster Platform:

 - AWS, Azure, VMware, pretty much all of them

How reproducible:

 - Always

Steps to Reproduce (in detail):
1. Configure router sharding, following https://docs.openshift.com/container-platform/4.9/networking/configuring_ingress_cluster_traffic/configuring-ingress-cluster-traffic-ingress-controller.html#nw-ingress-sharding-route-labels_configuring-ingress-cluster-traffic-ingress-controller for the default IngressController
2. Restart the Ingress Operator pod in openshift-ingress-operator namespace


Actual results:

No problem reported if Ingress Operator is not restarted. But once it's restarted, it reports Degraded state with "The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)" message

Expected results:

The Ingress Operator should pick changes from the default IngressController and act on them. So that the Canary route is updated if sharding is configured and therefore ensure that Canary will always work if all condition are met and not create a false/positive alert.

Impact of the problem:

False/Positive Alert and Ingress Operator in degraded state for likely unknown reason

Additional info:

Comment 1 Miciah Dashiel Butler Masters 2021-11-23 17:14:05 UTC
Setting blocker- because this is not a regression but we do need to figure out whether/how we can support this configuration.  

This issue looks related to bug 2021446, so I will investigate both BZs.

Comment 14 Miciah Dashiel Butler Masters 2022-06-23 16:12:51 UTC
https://github.com/openshift/cluster-ingress-operator/pull/723 merged on March 30, but automation didn't update the BZ status.  The fix should be in nightlies since April, so I'm moving the BZ to ON_QA.

Comment 15 Shudi Li 2022-06-24 12:14:23 UTC
Failed to verified it with 4.11.0-0.nightly-2022-06-23-092832
Flexy id: 114986(I will keep this cluster tonight)
kubeconfig: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/114986/artifact/workdir/install-dir/auth/kubeconfig

A:
1.
% oc get clusterversion                 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-23-092832   True        False         3h59m   Cluster version is 4.11.0-0.nightly-2022-06-23-092832
%

2. edit ingresscontroller default with namespaceSelector and nodePlacement
% oc -n openshift-ingress-operator edit ingresscontroller default
ingresscontroller.operator.openshift.io/default edited
%

3.
% oc -n openshift-ingress-operator get ingresscontroller default -oyaml | grep -A18 spec:
spec:
  clientTLS:
    clientCA:
      name: ""
    clientCertificatePolicy: ""
  httpCompression: {}
  httpEmptyRequestsPolicy: Respond
  httpErrorCodePages:
    name: ""
  namespaceSelector:
    matchLabels:
      type: sharded
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
  replicas: 2
  tuningOptions: {}
  unsupportedConfigOverrides: null
%

4. delete the ingress-operator pod
% oc -n openshift-ingress-operator get pods
NAME                                READY   STATUS    RESTARTS      AGE
ingress-operator-5b4c9d69df-cpdgb   2/2     Running   2 (63m ago)   70m
% oc -n openshift-ingress-operator delete pod ingress-operator-5b4c9d69df-cpdgb
pod "ingress-operator-5b4c9d69df-cpdgb" deleted
%

5.
% oc -n openshift-ingress-operator get pods                                    
NAME                                READY   STATUS    RESTARTS   AGE
ingress-operator-5b4c9d69df-lw78x   2/2     Running   0          17s
shudi@Shudis-MacBook-Pro 410 % oc -n openshift-ingress get pods         
NAME                              READY   STATUS    RESTARTS   AGE
router-default-6485c44f88-gct4m   1/1     Running   0          2m45s
router-default-6485c44f88-w7lnt   1/1     Running   0          2m45s
%

6. ingress co was degraded
% oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.nightly-2022-06-23-153912   False       False         True       28m     OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.shudi-411awspm01.qe.devcluster.openshift.com/healthz" returned "503 Service Unavailable"
baremetal                                  4.11.0-0.nightly-2022-06-23-153912   True        False         False      95m     
cloud-controller-manager                   4.11.0-0.nightly-2022-06-23-153912   True        False         False      97m     
cloud-credential                           4.11.0-0.nightly-2022-06-23-153912   True        False         False      97m     
cluster-autoscaler                         4.11.0-0.nightly-2022-06-23-153912   True        False         False      95m     
config-operator                            4.11.0-0.nightly-2022-06-23-153912   True        False         False      96m     
console                                    4.11.0-0.nightly-2022-06-23-153912   False       False         False      28m     RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.shudi-411awspm01.qe.devcluster.openshift.com returns '503 Service Unavailable'
csi-snapshot-controller                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      96m     
dns                                        4.11.0-0.nightly-2022-06-23-153912   True        False         False      95m     
etcd                                       4.11.0-0.nightly-2022-06-23-153912   True        False         False      94m     
image-registry                             4.11.0-0.nightly-2022-06-23-153912   True        False         False      83m     
ingress                                    4.11.0-0.nightly-2022-06-23-153912   True        False         True       87m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)


B: Use oc apply router-internal2.yaml, then delete the ingress-operator pod, ingress co is degraded, too.
1.
% oc apply -f router-internal2.yaml  
E0624 19:59:34.031859   10669 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
Warning: resource ingresscontrollers/default is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
ingresscontroller.operator.openshift.io/default configured
% 
% cat router-internal2.yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
  kind: IngressController
  metadata:
    name: default
    namespace: openshift-ingress-operator
  spec:
    domain: apps.shudi-411awspm01.qe.devcluster.openshift.com
    nodePlacement:
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker: ""
    routeSelector:
      matchLabels:
        type: sharded
  status: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
%

2.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-23-153912   True        False         105m    Cluster version is 4.11.0-0.nightly-2022-06-23-153912
shudi@Shudis-MacBook-Pro 410 % oc get co               
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.nightly-2022-06-23-153912   False       False         True       7m36s   OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.shudi-411awspm01.qe.devcluster.openshift.com/healthz" returned "503 Service Unavailable"
baremetal                                  4.11.0-0.nightly-2022-06-23-153912   True        False         False      124m    
cloud-controller-manager                   4.11.0-0.nightly-2022-06-23-153912   True        False         False      126m    
cloud-credential                           4.11.0-0.nightly-2022-06-23-153912   True        False         False      126m    
cluster-autoscaler                         4.11.0-0.nightly-2022-06-23-153912   True        False         False      124m    
config-operator                            4.11.0-0.nightly-2022-06-23-153912   True        False         False      125m    
console                                    4.11.0-0.nightly-2022-06-23-153912   False       False         False      7m38s   RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.shudi-411awspm01.qe.devcluster.openshift.com returns '503 Service Unavailable'
csi-snapshot-controller                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      125m    
dns                                        4.11.0-0.nightly-2022-06-23-153912   True        False         False      124m    
etcd                                       4.11.0-0.nightly-2022-06-23-153912   True        False         False      123m    
image-registry                             4.11.0-0.nightly-2022-06-23-153912   True        False         False      112m    
ingress                                    4.11.0-0.nightly-2022-06-23-153912   True        False         True       15m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

Comment 19 Shudi Li 2022-06-27 10:44:52 UTC
Use fresh cluster and do the test again by the oc apply command

1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         23m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133
% oc get route -o json -n openshift-authentication   oauth-openshift | jq '.status'
{
  "ingress": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-06-27T09:37:40Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "oauth-openshift.apps.shudi-411awspm08.qe.devcluster.openshift.com",
      "routerCanonicalHostname": "router-default.apps.shudi-411awspm08.qe.devcluster.openshift.com",
      "routerName": "default",
      "wildcardPolicy": "None"
    }
  ]
}
%

2.
% oc get dns.config/cluster -oyaml | grep -i domain
  baseDomain: shudi-411awspm08.qe.devcluster.openshift.com
% 

3. oc apply -f router-internal2.yaml
 % cat router-internal2.yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
  kind: IngressController
  metadata:
    name: default
    namespace: openshift-ingress-operator
  spec:
    domain: apps.shudi-411awspm08.qe.devcluster.openshift.com
    nodePlacement:
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker: ""
    routeSelector:
      matchLabels:
        type: sharded
  status: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
% oc apply -f router-internal2.yaml

4. More than 10 minutes passed, 
%  oc get route -o json -n openshift-authentication   oauth-openshift | jq '.status'
{}
% 

5.
% oc get co | grep ingress
ingress                                    4.11.0-0.nightly-2022-06-25-081133   True        False         True       52m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
% 

6.
% oc get co | egrep "authentication|ingress|monitoring"
authentication                             4.11.0-0.nightly-2022-06-25-081133   False       False         True       30m     OAuthServerRouteEndpointAccessibleControllerAvailable: "https://oauth-openshift.apps.shudi-411awspm08.qe.devcluster.openshift.com/healthz" returned "503 Service Unavailable"
ingress                                    4.11.0-0.nightly-2022-06-25-081133   True        False         True       64m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
monitoring                                 4.11.0-0.nightly-2022-06-25-081133   False       True          True       15m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
%

7.
% oc get clusterversion                                
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         58m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133
%

Comment 38 Shudi Li 2022-09-15 15:49:59 UTC
failed to verify it with 4.11.0-0.nightly-2022-09-14-233224
1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-14-233224   True        False         47m     Cluster version is 4.11.0-0.nightly-2022-09-14-233224
%

2.
% oc -n openshift-ingress-operator edit ingresscontroller default
ingresscontroller.operator.openshift.io/default edited
%

3.
% oc -n openshift-ingress-operator get ingresscontroller default -oyaml | grep -A18 spec:
spec:
  clientTLS:
    clientCA:
      name: ""
    clientCertificatePolicy: ""
  httpCompression: {}
  httpEmptyRequestsPolicy: Respond
  httpErrorCodePages:
    name: ""
  namespaceSelector:
    matchLabels:
      type: sharded
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
  replicas: 2
  tuningOptions: {}
  unsupportedConfigOverrides: null
% 

4. wait two new router pods created, then show the ingress-operator pod
% oc -n openshift-ingress-operator get pods  
NAME                                READY   STATUS    RESTARTS      AGE
ingress-operator-5b6d5b7fbc-zgj5b   2/2     Running   2 (62m ago)   68m
%

5.
% oc -n openshift-ingress-operator delete pod ingress-operator-5b6d5b7fbc-zgj5b
pod "ingress-operator-5b6d5b7fbc-zgj5b" deleted
% 
% oc -n openshift-ingress-operator get pods                                    
NAME                                READY   STATUS    RESTARTS   AGE
ingress-operator-5b6d5b7fbc-gglvl   2/2     Running   0          15s
%

6.
% oc get co | grep ingress
ingress                                    4.11.0-0.nightly-2022-09-14-233224   True        False         True       71m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
% 

7.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-14-233224   True        False         61m     Error while reconciling 4.11.0-0.nightly-2022-09-14-233224: the cluster operator ingress is degraded
%

Comment 39 Ramon Gordillo 2022-09-16 12:08:46 UTC
Facing it in a 4.11.0 SNO installation (OVN) without sharding, that prevents it to finish.

ingress                                    4.11.0    True        False         True       22m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)

Comment 41 errata-xmlrpc 2022-09-20 16:34:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.11.5 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6536


Note You need to log in before you can comment on or make changes to this bug.