Created attachment 1791654 [details] IngressControllerUnavailable alert from console Description of problem: IngressControllerUnavailable alert detail *********************************** alert: IngressControllerUnavailable expr: ingress_controller_conditions{condition="Available"} == 0 for: 5m labels: severity: warning annotations: message: The {{ $labels.namespace }}/{{ $labels.name }} ingresscontroller is unavailable: {{ $labels.reason }}. *********************************** created custom ingresscontrollers named as test-** and removed after the scenarios, a few minutes/hours later we still the IngressControllerUnavailable alert from console for the removed ingresscontrollers, see the attached picture. search in prometheus with ingress_controller_conditions{condition="Available"} == 0 result *********************************** ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-22633", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-23169", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-27560", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-30066", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-30192", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-40747", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-40748", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 ingress_controller_conditions{condition="Available", container="kube-rbac-proxy", endpoint="metrics", instance="10.130.0.13:9393", job="metrics", name="test-40821", namespace="openshift-ingress-operator", pod="ingress-operator-547bbdcd9d-wbjjg", service="metrics"} 0 *********************************** # oc get ingresscontroller --all-namespaces NAMESPACE NAME AGE openshift-ingress-operator default 3d14h the removed ingresscontrollers should not be counted in this case, it will make us confused. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-12-174011 How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: Expected results: Additional info:
Hongan knows how to create a custom ingress controller
please follow the steps below to create the custom ingresscontroller: 1. get baseDomain $ oc get dnses.config.openshift.io cluster -ojson | jq -r '.spec.baseDomain' 2. create custom ingresscontroller file below and replace the <base_domain> $ cat custom-ingresscontroller.yaml kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: test-12345 namespace: openshift-ingress-operator spec: defaultCertificate: name: router-certs-default domain: test-12345.<base_domain> <---------replace the base_domain replicas: 1 endpointPublishingStrategy: type: NodePortService 3. oc create -f custom-ingresscontroller.yaml And you can update "test-12345" with other value to create more custom ingresscontrollers.
Please note: this bug was found in an abnormal cluster, that means except the default ingresscontroller, all custom ingresscontroller created by Admin were unavailable due to unscheduled worker nodes. The reproduce steps: 1. ensure default ingresscontroller works well 2. cordon all worker nodes $ oc adm cordon -l node-role.kubernetes.io/worker= $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-134.us-east-2.compute.internal Ready master 77m v1.21.1+051ac4f ip-10-0-159-68.us-east-2.compute.internal Ready,SchedulingDisabled worker 74m v1.21.1+051ac4f ip-10-0-179-137.us-east-2.compute.internal Ready,SchedulingDisabled worker 72m v1.21.1+051ac4f ip-10-0-185-129.us-east-2.compute.internal Ready master 77m v1.21.1+051ac4f ip-10-0-210-43.us-east-2.compute.internal Ready,SchedulingDisabled worker 72m v1.21.1+051ac4f ip-10-0-211-113.us-east-2.compute.internal Ready master 81m v1.21.1+051ac4f 3. create some custom ingresscontrollers (see previous Comment 3) 4. check the custom ingresscontroller status: Available=False $ oc -n openshift-ingress-operator get ingresscontroller/test-12345 -ojson | jq '.status.conditions[] | select(.type=="Available")' { "lastTransitionTime": "2021-07-30T02:35:28Z", "message": "One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)", "reason": "IngressControllerUnavailable", "status": "False", "type": "Available" } 5. wait for some minutes to trigger the alerts 6. delete all custom ingress controllers, and still see those alerts on the console
Hello, I managed to reproduce the problem. The root cause seems to be in the fact that cluster-ingress-operator doesn't clean the metrics at the controller finalization time. The result is the metrics for the controllers which don't exist anymore. Here is an example of metrics dumped from cluster-ingress-operator some time after "test-12345" ingerss-controller was removed: ingress_controller_conditions{condition="Available",name="default"} 1 ingress_controller_conditions{condition="Available",name="test-12345"} 0 ingress_controller_conditions{condition="Degraded",name="default"} 0 ingress_controller_conditions{condition="Degraded",name="test-12345"} 1 A potential fix would be to enhance the finalization of the deleted controllers [1] with the deletion [2] of all the timeseries related the controller [3]. [1] https://github.com/openshift/cluster-ingress-operator/blob/e2cdf40beea5700785cc32db32b464850bd124dd/pkg/operator/controller/ingress/controller.go#L618 [2] https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#MetricVec.Delete [3] https://github.com/openshift/cluster-ingress-operator/blob/e2cdf40beea5700785cc32db32b464850bd124dd/pkg/operator/controller/ingress/metrics.go#L24
PullRequest: https://github.com/openshift/cluster-ingress-operator/pull/640
Verified with 4.9.0-0.nightly-2021-08-22-070405 and passed. Created test ingresscontroller and could view below alerts on Console > Observe > Alerting page: --->IngressControllerUnavailable --->IngressControllerDegraded After deleting the test ingresscontroller, then above alerts were removed as well. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-22-070405 True False 5h48m Cluster version is 4.9.0-0.nightly-2021-08-22-070405
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759