Bug 2030034

Summary: prometheusrules.openshift.io: dial tcp: lookup prometheus-operator.openshift-monitoring.svc on 172.30.0.10:53: no such host
Product: OpenShift Container Platform Reporter: Luis Sanchez <sanchezl>
Component: MonitoringAssignee: Jayapriya Pai <janantha>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: medium    
Version: 4.10CC: amuller, anpicker, aos-bugs, pgough, spasquie, sthaha
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:40:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luis Sanchez 2021-12-07 20:20:24 UTC
As part of "Improving Supportability of API webhooks"[1] enhancement, the kube-apiserver-operator will be checking webhook configurations and making the kube-apiserver go degraded if a problem is detected with a webhook configuration.

Occasionally, we observe:

Dec 07 15:36:32.866 - 8s    E clusteroperator/kube-apiserver condition/Degraded status/True reason/ValidatingAdmissionWebhookConfigurationDegraded: prometheusrules.openshift.io: dial tcp: lookup prometheus-operator.openshift-monitoring.svc on 172.30.0.10:53: no such host


The issue resolves itself quickly enough that tests continue to run successfully (hence the low severity), but this seems to point to a race condition where a ValidatingAdmissionWebhookConfiguration is being created before the underlying service has been created. Please take a look.

[1]:  https://github.com/openshift/enhancements/blob/e878f045a66950b3436d00150b178681906ea2d8/enhancements/kube-apiserver/api-webhook-supportability.md

Comment 4 Philip Gough 2021-12-23 15:33:02 UTC
Thanks for flagging - I took a look at CMO based on your suggestion and the relevant block of code [1] can be summarised as:

1. Create the Service
...
2. Create the Deployment
...
3. Create the ValidatingWebhook

So it is not clear to me that CMO is doing anything wrong in that regards. Could it be solely DNS related?


Right now, the prometheus operator exposes that endpoint for the validating webhook, but we have an upstream issue[2] to decouple that and make the webhook a standalone, HA component.


I've dug around in Prometheus and the service does exist at the time the log was omitted according to `kube_service_info{namespace="openshift-monitoring", service="prometheus-operator"}`[3].
There is a blip in `kube_endpoint_address_not_ready{namespace="openshift-monitoring",endpoint=~"prometheus-opera.+"}`[4], but it recovers more than 20 minutes prior to the error log.

It should be noted that because we only deploy the prometheus-operator as a single replica[5], during upgrades, there can be a transient disruption to service which we will be addressed by the linked issue[2].


[1] https://github.com/openshift/cluster-monitoring-operator/blob/release-4.10/pkg/tasks/prometheusoperator.go#L77-L120
[2] https://github.com/prometheus-operator/prometheus-operator/issues/4437
[3] https://bugzilla.redhat.com/attachment.cgi?id=1847600
[4] https://bugzilla.redhat.com/attachment.cgi?id=1847601
[5] https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-operator/deployment.yaml#L13

Comment 8 Junqi Zhao 2022-06-06 07:33:59 UTC
checked with 4.11.0-0.nightly-2022-06-04-014713, did not see prometheus-operator/prometheus-operator-admission-webhook degraded, see the picture

Comment 14 errata-xmlrpc 2022-08-10 10:40:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069