Bug 2037073
Summary: | Alertmanager container fails to start because of startup probe never being successful | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Emilien Macchi <emacchi> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.10 | CC: | amuller, anpicker, aos-bugs, emilien, erooth, jfajersk, wking |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-12 04:40:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Emilien Macchi
2022-01-04 20:36:48 UTC
Note that we run into a similar, if not the same, error in 4.9: time="2022-01-04T11:16:16Z" level=error msg="Cluster operator monitoring Degraded is True with UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas" Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/ipi-install-install/artifacts/.openshift_install.log (In reply to Emilien Macchi from comment #0) > kube-rbac-proxy-rules.log shows that 9093 is correctly listening: > > I0104 18:29:51.069593 1 main.go:151] Reading config file: > /etc/kube-rbac-proxy/config.yaml > I0104 18:29:51.070637 1 main.go:181] Valid token audiences: > I0104 18:29:51.070693 1 main.go:305] Reading certificate files > I0104 18:29:51.070830 1 main.go:339] Starting TCP socket on > 0.0.0.0:9093 > I0104 18:29:51.071260 1 main.go:346] Listening securely on 0.0.0.0:9093 > > Source: > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr- > logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack- > shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/ > artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift- > monitoring_thanos-querier-77668cc68f-ml9c7_kube-rbac-proxy-rules.log Note that this is the log of the kube-rbac-proxy wrapping thanos-querier. In the alertmanager pod, port 9093 is served by alertmanager itself. The alertmanager container fails to start. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager_previous.log The next step in the startup phase should be to "wait for gossip to settle". @emilien thanks for the detailed report! Looking at the events, the kubelet restarts the alertmanager-main-* pods because it fails the startup probe and as Jan pointed out, it looks like the alertmanager-main container never reaches the point where it starts listening on localhost:9093. Indeed https://github.com/openshift/cluster-monitoring-operator/commit/49960048ab89409fbe70f0c7cbf3b0a53895c681 has added the startup probe but I'm confused as to why Alertmanager isn't binding to the 9093 port in the first place. The startup probe has an initial delay of 20 seconds which should be plenty of time. Looking at a previous run of the job (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1468522995436752896) and it appears that there was already a long delay (about 24s) before alertmanager starts listening on localhost:9093. From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1468522995436752896/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-1_alertmanager.log: level=info ts=2021-12-08T10:38:56.326Z caller=main.go:221 msg="Starting Alertmanager" version="(version=0.22.2, branch=rhaos-4.10-rhel-8, revision=9e57964e63000a473388ad1d1e7d58e8e4385bea)" level=info ts=2021-12-08T10:38:56.326Z caller=main.go:222 build_context="(go=go1.17.2, user=root@b543cf51da38, date=20211109-19:30:02)" level=info ts=2021-12-08T10:39:20.703Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s level=info ts=2021-12-08T10:39:20.751Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml level=info ts=2021-12-08T10:39:20.751Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml level=info ts=2021-12-08T10:39:20.753Z caller=main.go:514 msg=Listening address=127.0.0.1:9093 Looking at a standard openshift/origin e2e job run (on AWS), it takes approximately 5s between the start of the process and the bind operation, which is still longer than what I would have expected (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1526/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-operator/1478568578566328320/artifacts/e2e-agnostic-operator/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager.log). Regarding comment 1 about a similar error in 4.9, it looks like an issue with the persistent volume that couldn't be attached to the pod but I'm not sure why the issue doesn't resolve itself (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/gather-extra/artifacts/persistentvolumes.json, the PVs are ok). From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/gather-extra/artifacts/events.json: { "apiVersion": "v1", "count": 25, "eventTime": null, "firstTimestamp": "2022-01-04T10:43:38Z", "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "prometheus-k8s-0", "namespace": "openshift-monitoring", "resourceVersion": "25576", "uid": "1841a9e8-0df4-4efe-a006-bb139eca3df3" }, "kind": "Event", "lastTimestamp": "2022-01-04T11:18:30Z", "message": "AttachVolume.Attach failed for volume \"pvc-cef8ec82-d0e7-4c65-a299-b6cf14ca0801\" : status of volume 936074d2-627c-4719-8110-ae634c42fd17 is error", "metadata": { "creationTimestamp": "2022-01-04T10:43:38Z", "name": "prometheus-k8s-0.16c70b4ddbacd93a", "namespace": "openshift-monitoring", "resourceVersion": "36627", "uid": "1b1b5569-0d96-491c-816a-7b3a0827a4a5" }, "reason": "FailedAttachVolume", "reportingComponent": "", "reportingInstance": "", "source": { "component": "attachdetach-controller" }, "type": "Warning" } checked in openstack http_proxy 4.10.0-0.nightly-2022-01-13-061145, startupProbe is updated to below and all pods are healthy # oc -n openshift-monitoring get alertmanager main -oyaml ... - name: alertmanager resources: {} startupProbe: exec: command: - sh - -c - exec curl http://localhost:9093/-/ready failureThreshold: 4 initialDelaySeconds: 20 periodSeconds: 10 timeoutSeconds: 3 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |