1963100 – thanos-ruler pods got failed to start up for "cannot unmarshal DNS message"

Bug 1963100 - thanos-ruler pods got failed to start up for "cannot unmarshal DNS message"

Summary: thanos-ruler pods got failed to start up for "cannot unmarshal DNS message"

Keywords:
Status:	CLOSED DUPLICATE of bug 1953518
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-21 12:32 UTC by Sanjaya
Modified:	2021-05-21 14:08 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-21 14:08:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sanjaya 2021-05-21 12:32:27 UTC

Description of problem:
Hi,
trying to configure user-workload-monitoring on OCP 4.7.10, but  thanos-ruler pods got failed to start up for "cannot unmarshal DNS message"
Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.7.10
Server Version: 4.7.10
Kubernetes Version: v1.20.0+e3fdce4
# oc -n openshift-user-workload-monitoring get po
NAME                                 READY   STATUS             RESTARTS   AGE
prometheus-operator-f754fdb6-r7s4g   2/2     Running            0          3m9s
prometheus-user-workload-0           5/5     Running            1          3m4s
prometheus-user-workload-1           5/5     Running            1          3m4s
thanos-ruler-user-workload-0         2/3     CrashLoopBackOff   4          3m2s
thanos-ruler-user-workload-1         2/3     CrashLoopBackOff   4          3m2s

pod/thanos-ruler-user-workload-0 log
--------------------------------------------
level=warn ts=2021-05-19T09:04:11.681836334Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T09:04:11.681850165Z caller=grpc.go:123 component=rules service=gRPC/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T09:04:11.681873189Z caller=grpc.go:136 component=rules service=gRPC/server component=rule msg="gracefully stopping internal server"
level=info ts=2021-05-19T09:04:11.681917501Z caller=grpc.go:149 component=rules service=gRPC/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=warn ts=2021-05-19T09:04:11.681933626Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T09:04:11.681942583Z caller=http.go:65 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=error ts=2021-05-19T09:04:12.086697314Z caller=rule.go:774 component=rules err="read query instant response: perform GET request against https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query: Post \"https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query\": context canceled" query="absent(up{job=\"prometheus-example-app\",namespace=\"mon-ns1\"} == 1)"
level=warn ts=2021-05-19T09:04:12.086746013Z caller=manager.go:598 component=rules group=example msg="Evaluating rule failed" rule="alert: prometheus-example-app-instance-down-alert\nexpr: absent(up{job=\"prometheus-example-app\",namespace=\"mon-ns1\"} == 1)\nlabels:\n  namespace: mon-ns1\n  severity: warning\nannotations:\n  message: Instance down alert triggered, prometheus-example-app instance was down.\n" err="no query API server reachable"
level=info ts=2021-05-19T09:04:12.182198137Z caller=http.go:84 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T09:04:12.182617499Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=error ts=2021-05-19T09:04:12.182982177Z caller=main.go:157 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:157\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_s390x.s:779"
How reproducible:
always
Steps to Reproduce:
1.enableUserWorkload monitoring.
2.
3.

Actual results:
thanos-ruler pods  failed to start up ,error "cannot unmarshal DNS message"

Expected results:
pod should come to running state.

Additional info:

Comment 1 Damien Grisonnet 2021-05-21 14:08:28 UTC


*** This bug has been marked as a duplicate of bug 1953518 ***

Note You need to log in before you can comment on or make changes to this bug.