Bug 1961158

Summary: thanos-ruler pods failed to start up for "cannot unmarshal DNS message"
Product: OpenShift Container Platform Reporter: Prashant Balachandran <pnair>
Component: MonitoringAssignee: Prashant Balachandran <pnair>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: alegrand, anpicker, aos-bugs, erooth, juzhao, kakkoyun, lcosic, openshift-bugzilla-robot, pkrupa, spasquie
Target Milestone: ---Keywords: Regression
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1957646 Environment:
Last Closed: 2021-06-01 12:10:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1957646    
Bug Blocks:    

Description Prashant Balachandran 2021-05-17 11:43:01 UTC
+++ This bug was initially created as a clone of Bug #1957646 +++

issue is fixed with 4.7.0-0.nightly-2021-05-07-004616
# oc -n openshift-user-workload-monitoring get po --show-labels
NAME                                  READY   STATUS    RESTARTS   AGE    LABELS
prometheus-operator-8d4d69888-fc8k9   2/2     Running   0          3m8s   app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator,app.kubernetes.io/version=v0.44.1,pod-template-hash=8d4d69888
prometheus-user-workload-0            5/5     Running   1          3m4s   app=prometheus,controller-revision-hash=prometheus-user-workload-99c9d5494,operator.prometheus.io/name=user-workload,operator.prometheus.io/shard=0,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-0
prometheus-user-workload-1            5/5     Running   1          3m4s   app=prometheus,controller-revision-hash=prometheus-user-workload-99c9d5494,operator.prometheus.io/name=user-workload,operator.prometheus.io/shard=0,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-1
thanos-ruler-user-workload-0          3/3     Running   0          3m1s   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7bbdf8c4,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-0,thanos-ruler=user-workload
thanos-ruler-user-workload-1          3/3     Running   0          3m1s   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7bbdf8c4,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-1,thanos-ruler=user-workload

Comment 2 Junqi Zhao 2021-05-19 08:13:12 UTC
reproduced with 4.6.0-0.nightly-2021-05-15-131411
# oc -n openshift-user-workload-monitoring get po | grep thanos-ruler-user-workload
thanos-ruler-user-workload-0           2/3     CrashLoopBackOff   3          84s
thanos-ruler-user-workload-1           2/3     CrashLoopBackOff   3          84s

# oc -n openshift-user-workload-monitoring describe pod thanos-ruler-user-workload-0
  thanos-ruler:
   ...
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   cords \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=warn ts=2021-05-19T08:10:30.051226335Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T08:10:30.051238915Z caller=http.go:64 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T08:10:30.551420564Z caller=http.go:83 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-05-19T08:10:30.551515258Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=error ts=2021-05-19T08:10:30.55161247Z caller=main.go:212 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1374"

Comment 3 Junqi Zhao 2021-05-19 08:14:14 UTC
tested with the not merged PR, issue is fixed
# oc -n openshift-user-workload-monitoring get po --show-labels
NAME                                   READY   STATUS    RESTARTS   AGE   LABELS
prometheus-operator-644fd69b76-pwgfk   2/2     Running   0          17m   app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator,app.kubernetes.io/version=v0.42.1,pod-template-hash=644fd69b76
prometheus-user-workload-0             4/4     Running   1          17m   app=prometheus,controller-revision-hash=prometheus-user-workload-587d78bbdc,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-0
prometheus-user-workload-1             4/4     Running   1          17m   app=prometheus,controller-revision-hash=prometheus-user-workload-587d78bbdc,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-1
thanos-ruler-user-workload-0           3/3     Running   0          17m   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7d4c766bc6,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-0,thanos-ruler=user-workload
thanos-ruler-user-workload-1           3/3     Running   0          17m   app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7d4c766bc6,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-1,thanos-ruler=user-workload

Comment 8 errata-xmlrpc 2021-06-01 12:10:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2100