1955854 – Ingress clusteroperator reports Degraded=True/Available=False if any ingresscontroller is degraded or unavailable

Bug 1955854 - Ingress clusteroperator reports Degraded=True/Available=False if any ingresscontroller is degraded or unavailable

Summary: Ingress clusteroperator reports Degraded=True/Available=False if any ingressc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	jechen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-01 01:15 UTC by Miciah Dashiel Butler Masters
Modified:	2022-08-04 22:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:05:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 607	0	None	Merged	Bug 1955854: Compute Available and Degraded from default ingress	2022-07-25 19:42:04 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:05:56 UTC

Description Miciah Dashiel Butler Masters 2021-05-01 01:15:01 UTC

Description of problem:

The ingress operator reports Available=False in the ingress clusteroperator status conditions if any ingresscontroller is unavailable, and Degraded=True if any ingresscontroller is degraded.

The operator should report Available=False or Degraded=True only if the *default* ingresscontroller is unavailable or degraded, respectively.

In addition, the ingress operator should report metrics and have alerting rules to report if ingresscontrollers are unavailable or degraded.

Version-Release number of selected component (if applicable):

4.8.0.

How reproducible:

100%.

Steps to Reproduce:

1. Launch a new cluster with <20 nodes.

2. On the cluster from Step 1, create an ingresscontroller with 1 replica:

% oc create -f - <<'EOF'
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: xyz
namespace: openshift-ingress-operator
spec:
replicas: 1
domain: xyz.com
endpointPublishingStrategy:
type: Private
EOF

3. Check the ingress clusteroperator:

% oc get clusteroperators/ingress
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
ingress 4.8.0-0.ci-2021-04-30-171538 True False False 2m43s

4. Scale the ingresscontroller from Step 2 to 20 replicas:

% oc -n openshift-ingress-operator scale ingresscontrollers/xyz --replicas=20
ingresscontroller.operator.openshift.io/xyz scaled

5. Wait about 30 seconds and check the ingress clusteroperator again:

% oc get clusteroperators/ingress
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
ingress 4.8.0-0.ci-2021-04-30-171538 False True True 32s

Actual results:

After Step 5, the ingress clusteroperator reports Available=False and Degraded=True.

Expected results:

After Step 5, the ingress clusteroperator should continue to report Available=True and Degraded=False. However, a metric or alert should indicate that the ingresscontroller from Step 2 is unavailable and degraded.

Additional info:

This problem was originally noticed because the extended/router/grpc-interop.go and extended/router/http2.go tests in openshift/origin create custom ingresscontrollers, which are briefly degraded and unavailable after their creation, which causes the ingress clusteroperator to report Available=False and Degraded=True.

Comment 2 jechen 2021-05-12 23:18:53 UTC

verified in 4.8.0-0.nightly-2021-05-12-122225

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-12-122225   True        False         23m     Cluster version is 4.8.0-0.nightly-2021-05-12-122225

1. create an ingresscontroller named xyz
$ cat ingresscontroll-BZ1955854
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: xyz
  namespace: openshift-ingress-operator
spec:
  replicas: 1
  domain: xyz.com
  endpointPublishingStrategy:
    type: Private


$ oc create -f ingresscontroll-BZ1955854


2. $ oc get clusteroperators/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.8.0-0.nightly-2021-05-12-122225   True        True          False      36m


3. scale up the xyz ingresscontroller
$ oc -n openshift-ingress-operator scale ingresscontrollers/xyz --replicas=20
ingresscontroller.operator.openshift.io/xyz scaled


4. waited more than 15 minutes, 3 out of 20 xyz deployment became ready, the rest are still pending
$ oc -n openshift-ingress get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/router-default-5fd58fd757-cbf8k   1/1     Running   0          47m
pod/router-default-5fd58fd757-qmqg5   1/1     Running   0          47m
pod/router-xyz-6bb7549fc9-44xqw       1/1     Running   0          21m
pod/router-xyz-6bb7549fc9-5sblt       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-89mvk       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-8k9gb       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-9thwx       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-9z5pn       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-b66dr       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-bwk6z       1/1     Running   0          21m
pod/router-xyz-6bb7549fc9-c5w4n       1/1     Running   0          23m
pod/router-xyz-6bb7549fc9-cctx4       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-gfxkv       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-gp4cd       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-gwfc7       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-hlmls       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-nf7h6       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-nf99g       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-p8g52       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-tmzhm       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-vcw4q       0/1     Pending   0          21m
pod/router-xyz-6bb7549fc9-z4zxx       0/1     Pending   0          21m

NAME                              TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)                      AGE
service/router-default            LoadBalancer   172.30.192.133   35.231.8.178   80:32358/TCP,443:30227/TCP   47m
service/router-internal-default   ClusterIP      172.30.144.6     <none>         80/TCP,443/TCP,1936/TCP      47m
service/router-internal-xyz       ClusterIP      172.30.158.15    <none>         80/TCP,443/TCP,1936/TCP      23m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/router-default   2/2     2            2           47m
deployment.apps/router-xyz       3/20    20           3           23m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/router-default-5fd58fd757   2         2         2       47m
replicaset.apps/router-xyz-6bb7549fc9       20        20        3       23m


5. check the ingress clusteroperator again, its AVAILABLE is still True and DEGRADED is still False
$ oc get clusteroperators/ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.8.0-0.nightly-2021-05-12-122225   True        True          False      46m
6

Comment 3 jechen 2021-05-12 23:21:31 UTC

Unavailable and Degraded alerts have been received for xyz ingresscontroller
May 12, 2021, 6:45 PM
The openshift-ingress-operator/xyz ingresscontroller is degraded: .
View details
May 12, 2021, 6:45 PM
Pod openshift-ingress/router-xyz-6bb7549fc9-gwfc7 has been in a non-ready state for longer than 15 minutes.
View details

<--snip-->
May 12, 2021, 6:45 PM
The openshift-ingress-operator/xyz ingresscontroller is unavailable: .

Comment 6 errata-xmlrpc 2021-07-27 23:05:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.