Description of problem: We found out that since ~December 10th, our openstack-proxy CI job fails with this error: level=error msg=Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/ipi-install-install/artifacts/.openshift_install.log Version-Release number of selected component (if applicable): All releases, including 4.10 kube-rbac-proxy-rules.log shows that 9093 is correctly listening: I0104 18:29:51.069593 1 main.go:151] Reading config file: /etc/kube-rbac-proxy/config.yaml I0104 18:29:51.070637 1 main.go:181] Valid token audiences: I0104 18:29:51.070693 1 main.go:305] Reading certificate files I0104 18:29:51.070830 1 main.go:339] Starting TCP socket on 0.0.0.0:9093 I0104 18:29:51.071260 1 main.go:346] Listening securely on 0.0.0.0:9093 Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-77668cc68f-ml9c7_kube-rbac-proxy-rules.log But alertmanager-proxy.log shows that the probe fails to connect: 2022/01/04 18:34:51 reverseproxy.go:490: http: proxy error: dial tcp [::1]:9093: connect: connection refused Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager-proxy.log Not sure if this is related but we also found out this error in thanos-query.log: level=error ts=2022-01-04T18:29:54.429239319Z caller=query.go:488 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": could not resolve \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: exchange: read udp 10.128.2.15:38317->172.30.0.10:53: i/o timeout;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.openshift-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.openshift-monitoring.svc.cluster.local.: exchange: read udp 10.128.2.15:33625->172.30.0.10:53: i/o timeout;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.svc.cluster.local.: exchange: read udp 10.128.2.15:39156->172.30.0.10:53: i/o timeout" Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-77668cc68f-ml9c7_thanos-query.log We are wondering if this could be caused by https://github.com/openshift/cluster-monitoring-operator/commit/49960048ab89409fbe70f0c7cbf3b0a53895c681 Note: our OCP cluster is running behind a proxy and using httpProxy into install-config.yaml. It's testing the use case of an OpenShift on OpenStack deployment where OpenShift is deployed on an isolated network, where a proxy is deployed between Internet and the isolated network. The job worked fine early December and suddenly broke. We can only reproduce it when openshift/release/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-commands.sh is executed in the cluster.
Note that we run into a similar, if not the same, error in 4.9: time="2022-01-04T11:16:16Z" level=error msg="Cluster operator monitoring Degraded is True with UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas" Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/ipi-install-install/artifacts/.openshift_install.log
(In reply to Emilien Macchi from comment #0) > kube-rbac-proxy-rules.log shows that 9093 is correctly listening: > > I0104 18:29:51.069593 1 main.go:151] Reading config file: > /etc/kube-rbac-proxy/config.yaml > I0104 18:29:51.070637 1 main.go:181] Valid token audiences: > I0104 18:29:51.070693 1 main.go:305] Reading certificate files > I0104 18:29:51.070830 1 main.go:339] Starting TCP socket on > 0.0.0.0:9093 > I0104 18:29:51.071260 1 main.go:346] Listening securely on 0.0.0.0:9093 > > Source: > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr- > logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack- > shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/ > artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift- > monitoring_thanos-querier-77668cc68f-ml9c7_kube-rbac-proxy-rules.log Note that this is the log of the kube-rbac-proxy wrapping thanos-querier. In the alertmanager pod, port 9093 is served by alertmanager itself. The alertmanager container fails to start. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager_previous.log The next step in the startup phase should be to "wait for gossip to settle".
@emilien thanks for the detailed report! Looking at the events, the kubelet restarts the alertmanager-main-* pods because it fails the startup probe and as Jan pointed out, it looks like the alertmanager-main container never reaches the point where it starts listening on localhost:9093. Indeed https://github.com/openshift/cluster-monitoring-operator/commit/49960048ab89409fbe70f0c7cbf3b0a53895c681 has added the startup probe but I'm confused as to why Alertmanager isn't binding to the 9093 port in the first place. The startup probe has an initial delay of 20 seconds which should be plenty of time.
Looking at a previous run of the job (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1468522995436752896) and it appears that there was already a long delay (about 24s) before alertmanager starts listening on localhost:9093. From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1468522995436752896/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-1_alertmanager.log: level=info ts=2021-12-08T10:38:56.326Z caller=main.go:221 msg="Starting Alertmanager" version="(version=0.22.2, branch=rhaos-4.10-rhel-8, revision=9e57964e63000a473388ad1d1e7d58e8e4385bea)" level=info ts=2021-12-08T10:38:56.326Z caller=main.go:222 build_context="(go=go1.17.2, user=root@b543cf51da38, date=20211109-19:30:02)" level=info ts=2021-12-08T10:39:20.703Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s level=info ts=2021-12-08T10:39:20.751Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml level=info ts=2021-12-08T10:39:20.751Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml level=info ts=2021-12-08T10:39:20.753Z caller=main.go:514 msg=Listening address=127.0.0.1:9093 Looking at a standard openshift/origin e2e job run (on AWS), it takes approximately 5s between the start of the process and the bind operation, which is still longer than what I would have expected (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1526/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-operator/1478568578566328320/artifacts/e2e-agnostic-operator/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager.log).
Regarding comment 1 about a similar error in 4.9, it looks like an issue with the persistent volume that couldn't be attached to the pod but I'm not sure why the issue doesn't resolve itself (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/gather-extra/artifacts/persistentvolumes.json, the PVs are ok). From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/gather-extra/artifacts/events.json: { "apiVersion": "v1", "count": 25, "eventTime": null, "firstTimestamp": "2022-01-04T10:43:38Z", "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "prometheus-k8s-0", "namespace": "openshift-monitoring", "resourceVersion": "25576", "uid": "1841a9e8-0df4-4efe-a006-bb139eca3df3" }, "kind": "Event", "lastTimestamp": "2022-01-04T11:18:30Z", "message": "AttachVolume.Attach failed for volume \"pvc-cef8ec82-d0e7-4c65-a299-b6cf14ca0801\" : status of volume 936074d2-627c-4719-8110-ae634c42fd17 is error", "metadata": { "creationTimestamp": "2022-01-04T10:43:38Z", "name": "prometheus-k8s-0.16c70b4ddbacd93a", "namespace": "openshift-monitoring", "resourceVersion": "36627", "uid": "1b1b5569-0d96-491c-816a-7b3a0827a4a5" }, "reason": "FailedAttachVolume", "reportingComponent": "", "reportingInstance": "", "source": { "component": "attachdetach-controller" }, "type": "Warning" }
checked in openstack http_proxy 4.10.0-0.nightly-2022-01-13-061145, startupProbe is updated to below and all pods are healthy # oc -n openshift-monitoring get alertmanager main -oyaml ... - name: alertmanager resources: {} startupProbe: exec: command: - sh - -c - exec curl http://localhost:9093/-/ready failureThreshold: 4 initialDelaySeconds: 20 periodSeconds: 10 timeoutSeconds: 3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056