1972977 – The removed ingresscontrollers should not be counted in ingress_controller_conditions metrics

Bug 1972977 - The removed ingresscontrollers should not be counted in ingress_controller_conditions metrics

Summary: The removed ingresscontrollers should not be counted in ingress_controller_co...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Andrey Lebedev
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1998103
TreeView+	depends on / blocked

Reported:	2021-06-17 02:25 UTC by Junqi Zhao
Modified:	2022-08-04 22:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:34:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
IngressControllerUnavailable alert from console (191.99 KB, image/png) 2021-06-17 02:25 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 640	0	None	None	None	2021-08-19 06:55:37 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:35:22 UTC

Description Junqi Zhao 2021-06-17 02:25:11 UTC

Created attachment 1791654 [details]
IngressControllerUnavailable alert from console

Description of problem:
IngressControllerUnavailable alert detail
***********************************
alert: IngressControllerUnavailable
expr: ingress_controller_conditions{condition="Available"} == 0
for: 5m
labels:
  severity: warning
annotations:
  message: The {{ $labels.namespace }}/{{ $labels.name }} ingresscontroller is unavailable: {{ $labels.reason }}.
***********************************

created custom ingresscontrollers named as test-** and removed after the scenarios, a few minutes/hours later we still the IngressControllerUnavailable alert from console for the removed ingresscontrollers, see the attached picture.
search in prometheus with
ingress_controller_conditions{condition="Available"} == 0
result
***********************************
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-22633", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-23169", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-27560", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-30066", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-30192", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-40747", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-40748", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-40821", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"}
0
***********************************
# oc get ingresscontroller --all-namespaces
NAMESPACE                    NAME      AGE
openshift-ingress-operator   default   3d14h

the removed ingresscontrollers should not be counted in this case, it will make us confused.

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-12-174011

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Junqi Zhao 2021-07-29 13:06:59 UTC

Hongan knows how to create a custom ingress controller

Comment 3 Hongan Li 2021-07-30 02:28:03 UTC

please follow the steps below to create the custom ingresscontroller:

1. get baseDomain
$ oc get dnses.config.openshift.io cluster -ojson | jq -r '.spec.baseDomain'

2. create custom ingresscontroller file below and replace the <base_domain> 
$ cat custom-ingresscontroller.yaml
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: test-12345
  namespace: openshift-ingress-operator
spec:
  defaultCertificate:
    name: router-certs-default
  domain: test-12345.<base_domain>                    <---------replace the base_domain
  replicas: 1
  endpointPublishingStrategy:
    type: NodePortService

3. oc create -f custom-ingresscontroller.yaml

And you can update "test-12345" with other value to create more custom ingresscontrollers.

Comment 4 Hongan Li 2021-07-30 03:01:33 UTC

Please note: this bug was found in an abnormal cluster, that means except the default ingresscontroller, all custom ingresscontroller created by Admin were unavailable due to unscheduled worker nodes. 

The reproduce steps:

1. ensure default ingresscontroller works well

2. cordon all worker nodes
$ oc adm cordon -l node-role.kubernetes.io/worker=
$ oc get node
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-128-134.us-east-2.compute.internal   Ready                      master   77m   v1.21.1+051ac4f
ip-10-0-159-68.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   74m   v1.21.1+051ac4f
ip-10-0-179-137.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   72m   v1.21.1+051ac4f
ip-10-0-185-129.us-east-2.compute.internal   Ready                      master   77m   v1.21.1+051ac4f
ip-10-0-210-43.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   72m   v1.21.1+051ac4f
ip-10-0-211-113.us-east-2.compute.internal   Ready                      master   81m   v1.21.1+051ac4f

3. create some custom ingresscontrollers (see previous Comment 3)

4. check the custom ingresscontroller status: Available=False
$ oc -n openshift-ingress-operator get ingresscontroller/test-12345 -ojson | jq '.status.conditions[] | select(.type=="Available")'
{
  "lastTransitionTime": "2021-07-30T02:35:28Z",
  "message": "One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)",
  "reason": "IngressControllerUnavailable",
  "status": "False",
  "type": "Available"
}

5. wait for some minutes to trigger the alerts

6. delete all custom ingress controllers, and still see those alerts on the console

Comment 5 Andrey Lebedev 2021-07-30 14:09:03 UTC

Hello,

I managed to reproduce the problem. The root cause seems to be in the fact that cluster-ingress-operator doesn't clean the metrics at the controller finalization time.
The result is the metrics for the controllers which don't exist anymore. Here is an example of metrics dumped from cluster-ingress-operator some time after "test-12345" ingerss-controller was removed:

ingress_controller_conditions{condition="Available",name="default"} 1
ingress_controller_conditions{condition="Available",name="test-12345"} 0
ingress_controller_conditions{condition="Degraded",name="default"} 0
ingress_controller_conditions{condition="Degraded",name="test-12345"} 1

A potential fix would be to enhance the finalization of the deleted controllers [1] with the deletion [2] of all the timeseries related the controller [3].




[1] https://github.com/openshift/cluster-ingress-operator/blob/e2cdf40beea5700785cc32db32b464850bd124dd/pkg/operator/controller/ingress/controller.go#L618
[2] https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#MetricVec.Delete
[3] https://github.com/openshift/cluster-ingress-operator/blob/e2cdf40beea5700785cc32db32b464850bd124dd/pkg/operator/controller/ingress/metrics.go#L24

Comment 6 Andrey Lebedev 2021-08-12 23:30:08 UTC

PullRequest: https://github.com/openshift/cluster-ingress-operator/pull/640

Comment 8 Hongan Li 2021-08-23 08:19:55 UTC

Verified with 4.9.0-0.nightly-2021-08-22-070405 and passed.

Created test ingresscontroller and could view below alerts on Console > Observe > Alerting page:
--->IngressControllerUnavailable
--->IngressControllerDegraded

After deleting the test ingresscontroller, then above alerts were removed as well.


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-22-070405   True        False         5h48m   Cluster version is 4.9.0-0.nightly-2021-08-22-070405

Comment 11 errata-xmlrpc 2021-10-18 17:34:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.