Bug 1955854 - Ingress clusteroperator reports Degraded=True/Available=False if any ingresscontroller is degraded or unavailable
Summary: Ingress clusteroperator reports Degraded=True/Available=False if any ingressc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: jechen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-01 01:15 UTC by Miciah Dashiel Butler Masters
Modified: 2021-07-27 23:05 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:05:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 607 0 None open Bug 1955854: Compute Available and Degraded from default ingress 2021-05-01 01:35:33 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:05:56 UTC

Description Miciah Dashiel Butler Masters 2021-05-01 01:15:01 UTC
Description of problem:

The ingress operator reports Available=False in the ingress clusteroperator status conditions if any ingresscontroller is unavailable, and Degraded=True if any ingresscontroller is degraded.  

The operator should report Available=False or Degraded=True only if the *default* ingresscontroller is unavailable or degraded, respectively.  

In addition, the ingress operator should report metrics and have alerting rules to report if ingresscontrollers are unavailable or degraded.  


Version-Release number of selected component (if applicable):

4.8.0.


How reproducible:

100%.


Steps to Reproduce:

1. Launch a new cluster with <20 nodes.

2. On the cluster from Step 1, create an ingresscontroller with 1 replica:

    % oc create -f - <<'EOF'
    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: xyz
      namespace: openshift-ingress-operator
    spec:
      replicas: 1
      domain: xyz.com
      endpointPublishingStrategy:
        type: Private
    EOF

3. Check the ingress clusteroperator:

    % oc get clusteroperators/ingress
    NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
    ingress   4.8.0-0.ci-2021-04-30-171538   True        False         False      2m43s

4. Scale the ingresscontroller from Step 2 to 20 replicas:

    % oc -n openshift-ingress-operator scale ingresscontrollers/xyz --replicas=20
    ingresscontroller.operator.openshift.io/xyz scaled

5. Wait about 30 seconds and check the ingress clusteroperator again:

    % oc get clusteroperators/ingress
    NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
    ingress   4.8.0-0.ci-2021-04-30-171538   False       True          True       32s


Actual results:

After Step 5, the ingress clusteroperator reports Available=False and Degraded=True.


Expected results:

After Step 5, the ingress clusteroperator should continue to report Available=True and Degraded=False.  However, a metric or alert should indicate that the ingresscontroller from Step 2 is unavailable and degraded.  


Additional info:

This problem was originally noticed because the extended/router/grpc-interop.go and extended/router/http2.go tests in openshift/origin create custom ingresscontrollers, which are briefly degraded and unavailable after their creation, which causes the ingress clusteroperator to report Available=False and Degraded=True.

Comment 2 jechen 2021-05-12 23:18:53 UTC
verified in 4.8.0-0.nightly-2021-05-12-122225

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-12-122225   True        False         23m     Cluster version is 4.8.0-0.nightly-2021-05-12-122225

1. create an ingresscontroller named xyz
$ cat ingresscontroll-BZ1955854
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: xyz
  namespace: openshift-ingress-operator
spec:
  replicas: 1
  domain: xyz.com
  endpointPublishingStrategy:
    type: Private


$ oc create -f ingresscontroll-BZ1955854


2. $ oc get clusteroperators/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.8.0-0.nightly-2021-05-12-122225   True        True          False      36m


3. scale up the xyz ingresscontroller
$ oc -n openshift-ingress-operator scale ingresscontrollers/xyz --replicas=20
ingresscontroller.operator.openshift.io/xyz scaled


4. waited more than 15 minutes, 3 out of 20 xyz deployment became ready, the rest are still pending
$ oc -n openshift-ingress get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/router-default-5fd58fd757-cbf8k   1/1     Running   0          47m
pod/router-default-5fd58fd757-qmqg5   1/1     Running   0          47m
pod/router-xyz-6bb7549fc9-44xqw       1/1     Running   0          21m
pod/router-xyz-6bb7549fc9-5sblt       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-89mvk       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-8k9gb       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-9thwx       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-9z5pn       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-b66dr       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-bwk6z       1/1     Running   0          21m
pod/router-xyz-6bb7549fc9-c5w4n       1/1     Running   0          23m
pod/router-xyz-6bb7549fc9-cctx4       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-gfxkv       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-gp4cd       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-gwfc7       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-hlmls       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-nf7h6       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-nf99g       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-p8g52       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-tmzhm       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-vcw4q       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-z4zxx       0/1     Pending   0          21m

NAME                              TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)                      AGE
service/router-default            LoadBalancer   172.30.192.133   35.231.8.178   80:32358/TCP,443:30227/TCP   47m
service/router-internal-default   ClusterIP      172.30.144.6     <none>         80/TCP,443/TCP,1936/TCP      47m
service/router-internal-xyz       ClusterIP      172.30.158.15    <none>         80/TCP,443/TCP,1936/TCP      23m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/router-default   2/2     2            2           47m
deployment.apps/router-xyz       3/20    20           3           23m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/router-default-5fd58fd757   2         2         2       47m
replicaset.apps/router-xyz-6bb7549fc9       20        20        3       23m


5. check the ingress clusteroperator again, its AVAILABLE is still True and DEGRADED is still False
$ oc get clusteroperators/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.8.0-0.nightly-2021-05-12-122225   True        True          False      46m
6

Comment 3 jechen 2021-05-12 23:21:31 UTC
Unavailable and Degraded alerts have been received for xyz ingresscontroller
May 12, 2021, 6:45 PM
The openshift-ingress-operator/xyz ingresscontroller is degraded: .
View details
May 12, 2021, 6:45 PM
Pod openshift-ingress/router-xyz-6bb7549fc9-gwfc7 has been in a non-ready state for longer than 15 minutes.
View details

<--snip-->
May 12, 2021, 6:45 PM
The openshift-ingress-operator/xyz ingresscontroller is unavailable: .

Comment 6 errata-xmlrpc 2021-07-27 23:05:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.