2030034 – prometheusrules.openshift.io: dial tcp: lookup prometheus-operator.openshift-monitoring.svc on 172.30.0.10:53: no such host

Bug 2030034 - prometheusrules.openshift.io: dial tcp: lookup prometheus-operator.openshift-monitoring.svc on 172.30.0.10:53: no such host

Summary: prometheusrules.openshift.io: dial tcp: lookup prometheus-operator.openshift-...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Jayapriya Pai
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-07 20:20 UTC by Luis Sanchez
Modified:	2022-08-10 10:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:40:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:40:47 UTC

Description Luis Sanchez 2021-12-07 20:20:24 UTC

As part of "Improving Supportability of API webhooks"[1] enhancement, the kube-apiserver-operator will be checking webhook configurations and making the kube-apiserver go degraded if a problem is detected with a webhook configuration.

Occasionally, we observe:

Dec 07 15:36:32.866 - 8s    E clusteroperator/kube-apiserver condition/Degraded status/True reason/ValidatingAdmissionWebhookConfigurationDegraded: prometheusrules.openshift.io: dial tcp: lookup prometheus-operator.openshift-monitoring.svc on 172.30.0.10:53: no such host


The issue resolves itself quickly enough that tests continue to run successfully (hence the low severity), but this seems to point to a race condition where a ValidatingAdmissionWebhookConfiguration is being created before the underlying service has been created. Please take a look.

[1]:  https://github.com/openshift/enhancements/blob/e878f045a66950b3436d00150b178681906ea2d8/enhancements/kube-apiserver/api-webhook-supportability.md

Comment 4 Philip Gough 2021-12-23 15:33:02 UTC

Thanks for flagging - I took a look at CMO based on your suggestion and the relevant block of code [1] can be summarised as:

1. Create the Service
...
2. Create the Deployment
...
3. Create the ValidatingWebhook

So it is not clear to me that CMO is doing anything wrong in that regards. Could it be solely DNS related?


Right now, the prometheus operator exposes that endpoint for the validating webhook, but we have an upstream issue[2] to decouple that and make the webhook a standalone, HA component.


I've dug around in Prometheus and the service does exist at the time the log was omitted according to `kube_service_info{namespace="openshift-monitoring", service="prometheus-operator"}`[3].
There is a blip in `kube_endpoint_address_not_ready{namespace="openshift-monitoring",endpoint=~"prometheus-opera.+"}`[4], but it recovers more than 20 minutes prior to the error log.

It should be noted that because we only deploy the prometheus-operator as a single replica[5], during upgrades, there can be a transient disruption to service which we will be addressed by the linked issue[2].


[1] https://github.com/openshift/cluster-monitoring-operator/blob/release-4.10/pkg/tasks/prometheusoperator.go#L77-L120
[2] https://github.com/prometheus-operator/prometheus-operator/issues/4437
[3] https://bugzilla.redhat.com/attachment.cgi?id=1847600
[4] https://bugzilla.redhat.com/attachment.cgi?id=1847601
[5] https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-operator/deployment.yaml#L13

Comment 8 Junqi Zhao 2022-06-06 07:33:59 UTC

checked with 4.11.0-0.nightly-2022-06-04-014713, did not see prometheus-operator/prometheus-operator-admission-webhook degraded, see the picture

Comment 14 errata-xmlrpc 2022-08-10 10:40:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.