As part of "Improving Supportability of API webhooks"[1] enhancement, the kube-apiserver-operator will be checking webhook configurations and making the kube-apiserver go degraded if a problem is detected with a webhook configuration. Occasionally, we observe: Dec 07 15:36:32.866 - 8s E clusteroperator/kube-apiserver condition/Degraded status/True reason/ValidatingAdmissionWebhookConfigurationDegraded: prometheusrules.openshift.io: dial tcp: lookup prometheus-operator.openshift-monitoring.svc on 172.30.0.10:53: no such host The issue resolves itself quickly enough that tests continue to run successfully (hence the low severity), but this seems to point to a race condition where a ValidatingAdmissionWebhookConfiguration is being created before the underlying service has been created. Please take a look. [1]: https://github.com/openshift/enhancements/blob/e878f045a66950b3436d00150b178681906ea2d8/enhancements/kube-apiserver/api-webhook-supportability.md
Thanks for flagging - I took a look at CMO based on your suggestion and the relevant block of code [1] can be summarised as: 1. Create the Service ... 2. Create the Deployment ... 3. Create the ValidatingWebhook So it is not clear to me that CMO is doing anything wrong in that regards. Could it be solely DNS related? Right now, the prometheus operator exposes that endpoint for the validating webhook, but we have an upstream issue[2] to decouple that and make the webhook a standalone, HA component. I've dug around in Prometheus and the service does exist at the time the log was omitted according to `kube_service_info{namespace="openshift-monitoring", service="prometheus-operator"}`[3]. There is a blip in `kube_endpoint_address_not_ready{namespace="openshift-monitoring",endpoint=~"prometheus-opera.+"}`[4], but it recovers more than 20 minutes prior to the error log. It should be noted that because we only deploy the prometheus-operator as a single replica[5], during upgrades, there can be a transient disruption to service which we will be addressed by the linked issue[2]. [1] https://github.com/openshift/cluster-monitoring-operator/blob/release-4.10/pkg/tasks/prometheusoperator.go#L77-L120 [2] https://github.com/prometheus-operator/prometheus-operator/issues/4437 [3] https://bugzilla.redhat.com/attachment.cgi?id=1847600 [4] https://bugzilla.redhat.com/attachment.cgi?id=1847601 [5] https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-operator/deployment.yaml#L13
checked with 4.11.0-0.nightly-2022-06-04-014713, did not see prometheus-operator/prometheus-operator-admission-webhook degraded, see the picture
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069