Bug 2037073

Summary:	Alertmanager container fails to start because of startup probe never being successful
Product:	OpenShift Container Platform	Reporter:	Emilien Macchi <emacchi>
Component:	Monitoring	Assignee:	Simon Pasquier <spasquie>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10	CC:	amuller, anpicker, aos-bugs, emilien, erooth, jfajersk, wking
Target Milestone:	---
Target Release:	4.10.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-12 04:40:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Emilien Macchi 2022-01-04 20:36:48 UTC

Description of problem:
We found out that since ~December 10th, our openstack-proxy CI job fails with this error:

level=error msg=Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas

Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/ipi-install-install/artifacts/.openshift_install.log


Version-Release number of selected component (if applicable):
All releases, including 4.10


kube-rbac-proxy-rules.log shows that 9093 is correctly listening:

I0104 18:29:51.069593       1 main.go:151] Reading config file: /etc/kube-rbac-proxy/config.yaml
I0104 18:29:51.070637       1 main.go:181] Valid token audiences: 
I0104 18:29:51.070693       1 main.go:305] Reading certificate files
I0104 18:29:51.070830       1 main.go:339] Starting TCP socket on 0.0.0.0:9093
I0104 18:29:51.071260       1 main.go:346] Listening securely on 0.0.0.0:9093

Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-77668cc68f-ml9c7_kube-rbac-proxy-rules.log


But alertmanager-proxy.log shows that the probe fails to connect:

2022/01/04 18:34:51 reverseproxy.go:490: http: proxy error: dial tcp [::1]:9093: connect: connection refused

Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager-proxy.log


Not sure if this is related but we also found out this error in thanos-query.log:

level=error ts=2022-01-04T18:29:54.429239319Z caller=query.go:488 msg="failed to resolve addresses for storeAPIs" err="lookup SRV records \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": could not resolve \"_grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local\": all servers responded with errors to at least one search domain. Errs ;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.: exchange: read udp 10.128.2.15:38317->172.30.0.10:53: i/o timeout;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.openshift-monitoring.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.openshift-monitoring.svc.cluster.local.: exchange: read udp 10.128.2.15:33625->172.30.0.10:53: i/o timeout;could not resolve _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.svc.cluster.local.: no servers returned a viable answer. Errs ;resolution against server 172.30.0.10 for _grpc._tcp.prometheus-operated.openshift-monitoring.svc.cluster.local.svc.cluster.local.: exchange: read udp 10.128.2.15:39156->172.30.0.10:53: i/o timeout"

Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_thanos-querier-77668cc68f-ml9c7_thanos-query.log



We are wondering if this could be caused by https://github.com/openshift/cluster-monitoring-operator/commit/49960048ab89409fbe70f0c7cbf3b0a53895c681

Note: our OCP cluster is running behind a proxy and using httpProxy into install-config.yaml. It's testing the use case of an OpenShift on OpenStack deployment where OpenShift is deployed on an isolated network, where a proxy is deployed between Internet and the isolated network.
The job worked fine early December and suddenly broke.

We can only reproduce it when openshift/release/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-commands.sh is executed in the cluster.

Comment 1 Emilien Macchi 2022-01-04 20:54:22 UTC

Note that we run into a similar, if not the same, error in 4.9:
time="2022-01-04T11:16:16Z" level=error msg="Cluster operator monitoring Degraded is True with UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas"

Source: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/ipi-install-install/artifacts/.openshift_install.log

Comment 2 Jan Fajerski 2022-01-05 06:57:21 UTC

(In reply to Emilien Macchi from comment #0) 
> kube-rbac-proxy-rules.log shows that 9093 is correctly listening:
> 
> I0104 18:29:51.069593       1 main.go:151] Reading config file:
> /etc/kube-rbac-proxy/config.yaml
> I0104 18:29:51.070637       1 main.go:181] Valid token audiences: 
> I0104 18:29:51.070693       1 main.go:305] Reading certificate files
> I0104 18:29:51.070830       1 main.go:339] Starting TCP socket on
> 0.0.0.0:9093
> I0104 18:29:51.071260       1 main.go:346] Listening securely on 0.0.0.0:9093
> 
> Source:
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-
> logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-
> shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/
> artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-
> monitoring_thanos-querier-77668cc68f-ml9c7_kube-rbac-proxy-rules.log

Note that this is the log of the kube-rbac-proxy wrapping thanos-querier. In the alertmanager pod, port 9093 is served by alertmanager itself.

The alertmanager container fails to start.
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/24796/rehearse-24796-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1478421936122892288/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager_previous.log
The next step in the startup phase should be to "wait for gossip to settle".

Comment 3 Simon Pasquier 2022-01-05 08:36:43 UTC

@emilien thanks for the detailed report! Looking at the events, the kubelet restarts the alertmanager-main-* pods because it fails the startup probe and as Jan pointed out, it looks like the alertmanager-main container never reaches the point where it starts listening on localhost:9093.

Indeed https://github.com/openshift/cluster-monitoring-operator/commit/49960048ab89409fbe70f0c7cbf3b0a53895c681 has added the startup probe but I'm confused as to why Alertmanager isn't binding to the 9093 port in the first place. The startup probe has an initial delay of 20 seconds which should be plenty of time.

Comment 4 Simon Pasquier 2022-01-05 08:56:06 UTC

Looking at a previous run of the job (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1468522995436752896) and it appears that there was already a long delay (about 24s) before alertmanager starts listening on localhost:9093.

From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-e2e-openstack-proxy/1468522995436752896/artifacts/e2e-openstack-proxy/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-1_alertmanager.log:

level=info ts=2021-12-08T10:38:56.326Z caller=main.go:221 msg="Starting Alertmanager" version="(version=0.22.2, branch=rhaos-4.10-rhel-8, revision=9e57964e63000a473388ad1d1e7d58e8e4385bea)"
level=info ts=2021-12-08T10:38:56.326Z caller=main.go:222 build_context="(go=go1.17.2, user=root@b543cf51da38, date=20211109-19:30:02)"
level=info ts=2021-12-08T10:39:20.703Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2021-12-08T10:39:20.751Z caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2021-12-08T10:39:20.751Z caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2021-12-08T10:39:20.753Z caller=main.go:514 msg=Listening address=127.0.0.1:9093

Looking at a standard openshift/origin e2e job run (on AWS), it takes approximately 5s between the start of the process and the bind operation, which is still longer than what I would have expected (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1526/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-operator/1478568578566328320/artifacts/e2e-agnostic-operator/gather-extra/artifacts/pods/openshift-monitoring_alertmanager-main-0_alertmanager.log).

Comment 5 Simon Pasquier 2022-01-05 09:21:40 UTC

Regarding comment 1 about a similar error in 4.9, it looks like an issue with the persistent volume that couldn't be attached to the pod but I'm not sure why the issue doesn't resolve itself (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/gather-extra/artifacts/persistentvolumes.json, the PVs are ok).
From https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-e2e-openstack-proxy/1478308095686873088/artifacts/e2e-openstack-proxy/gather-extra/artifacts/events.json:

        {
            "apiVersion": "v1",
            "count": 25,
            "eventTime": null,
            "firstTimestamp": "2022-01-04T10:43:38Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "prometheus-k8s-0",
                "namespace": "openshift-monitoring",
                "resourceVersion": "25576",
                "uid": "1841a9e8-0df4-4efe-a006-bb139eca3df3"
            },
            "kind": "Event",
            "lastTimestamp": "2022-01-04T11:18:30Z",
            "message": "AttachVolume.Attach failed for volume \"pvc-cef8ec82-d0e7-4c65-a299-b6cf14ca0801\" : status of volume 936074d2-627c-4719-8110-ae634c42fd17 is error",
            "metadata": {
                "creationTimestamp": "2022-01-04T10:43:38Z",
                "name": "prometheus-k8s-0.16c70b4ddbacd93a",
                "namespace": "openshift-monitoring",
                "resourceVersion": "36627",
                "uid": "1b1b5569-0d96-491c-816a-7b3a0827a4a5"
            },
            "reason": "FailedAttachVolume",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "attachdetach-controller"
            },
            "type": "Warning"
        }

Comment 8 Junqi Zhao 2022-01-14 08:03:50 UTC

checked in openstack http_proxy 4.10.0-0.nightly-2022-01-13-061145, startupProbe is updated to below and all pods are healthy
#  oc -n openshift-monitoring get alertmanager main -oyaml
...
  - name: alertmanager
    resources: {}
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - exec curl http://localhost:9093/-/ready
      failureThreshold: 4
      initialDelaySeconds: 20
      periodSeconds: 10
      timeoutSeconds: 3

Comment 11 errata-xmlrpc 2022-03-12 04:40:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056