1961158 – thanos-ruler pods failed to start up for "cannot unmarshal DNS message"

Bug 1961158 - thanos-ruler pods failed to start up for "cannot unmarshal DNS message"

Summary: thanos-ruler pods failed to start up for "cannot unmarshal DNS message"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Prashant Balachandran
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1957646
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-17 11:43 UTC by Prashant Balachandran
Modified:	2021-09-17 05:25 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1957646
Environment:
Last Closed:	2021-06-01 12:10:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift thanos pull 57	0	None	open	Bug 1961158: Changing resolver to address thanos pod issues	2021-05-17 12:26:05 UTC
Red Hat Product Errata	RHBA-2021:2100	0	None	None	None	2021-06-01 12:10:42 UTC

Description Prashant Balachandran 2021-05-17 11:43:01 UTC

+++ This bug was initially created as a clone of Bug #1957646 +++

issue is fixed with 4.7.0-0.nightly-2021-05-07-004616
# oc -n openshift-user-workload-monitoring get po --show-labels
NAME                                  READY   STATUS    RESTARTS   AGE    LABELS
prometheus-operator-8d4d69888-fc8k9   2/2     Running   0          3m8s   app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator,app.kubernetes.io/version=v0.44.1,pod-template-hash=8d4d69888
prometheus-user-workload-0            5/5     Running   1          3m4s   app=prometheus,controller-revision-hash=prometheus-user-workload-99c9d5494,operator.prometheus.io/name=user-workload,operator.prometheus.io/shard=0,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-0
prometheus-user-workload-1            5/5     Running   1          3m4s   app=prometheus,controller-revision-hash=prometheus-user-workload-99c9d5494,operator.prometheus.io/name=user-workload,operator.prometheus.io/shard=0,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-1
thanos-ruler-user-workload-0          3/3     Running   0          3m1s   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7bbdf8c4,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-0,thanos-ruler=user-workload
thanos-ruler-user-workload-1          3/3     Running   0          3m1s   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7bbdf8c4,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-1,thanos-ruler=user-workload

Comment 2 Junqi Zhao 2021-05-19 08:13:12 UTC

reproduced with 4.6.0-0.nightly-2021-05-15-131411
# oc -n openshift-user-workload-monitoring get po | grep thanos-ruler-user-workload
thanos-ruler-user-workload-0           2/3     CrashLoopBackOff   3          84s
thanos-ruler-user-workload-1           2/3     CrashLoopBackOff   3          84s

# oc -n openshift-user-workload-monitoring describe pod thanos-ruler-user-workload-0
  thanos-ruler:
   ...
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   cords \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=warn ts=2021-05-19T08:10:30.051226335Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T08:10:30.051238915Z caller=http.go:64 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T08:10:30.551420564Z caller=http.go:83 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T08:10:30.551515258Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=error ts=2021-05-19T08:10:30.55161247Z caller=main.go:212 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1374"

Comment 3 Junqi Zhao 2021-05-19 08:14:14 UTC

tested with the not merged PR, issue is fixed
# oc -n openshift-user-workload-monitoring get po --show-labels
NAME                                   READY   STATUS    RESTARTS   AGE   LABELS
prometheus-operator-644fd69b76-pwgfk   2/2     Running   0          17m   app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator,app.kubernetes.io/version=v0.42.1,pod-template-hash=644fd69b76
prometheus-user-workload-0             4/4     Running   1          17m   app=prometheus,controller-revision-hash=prometheus-user-workload-587d78bbdc,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-0
prometheus-user-workload-1             4/4     Running   1          17m   app=prometheus,controller-revision-hash=prometheus-user-workload-587d78bbdc,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-1
thanos-ruler-user-workload-0           3/3     Running   0          17m   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7d4c766bc6,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-0,thanos-ruler=user-workload
thanos-ruler-user-workload-1           3/3     Running   0          17m   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7d4c766bc6,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-1,thanos-ruler=user-workload

Comment 8 errata-xmlrpc 2021-06-01 12:10:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2100

Note You need to log in before you can comment on or make changes to this bug.