Bug 1961158
Summary: | thanos-ruler pods failed to start up for "cannot unmarshal DNS message" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Prashant Balachandran <pnair> |
Component: | Monitoring | Assignee: | Prashant Balachandran <pnair> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.6 | CC: | alegrand, anpicker, aos-bugs, erooth, juzhao, kakkoyun, lcosic, openshift-bugzilla-robot, pkrupa, spasquie |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | 4.6.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1957646 | Environment: | |
Last Closed: | 2021-06-01 12:10:08 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1957646 | ||
Bug Blocks: |
Description
Prashant Balachandran
2021-05-17 11:43:01 UTC
reproduced with 4.6.0-0.nightly-2021-05-15-131411 # oc -n openshift-user-workload-monitoring get po | grep thanos-ruler-user-workload thanos-ruler-user-workload-0 2/3 CrashLoopBackOff 3 84s thanos-ruler-user-workload-1 2/3 CrashLoopBackOff 3 84s # oc -n openshift-user-workload-monitoring describe pod thanos-ruler-user-workload-0 thanos-ruler: ... State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: cords \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=warn ts=2021-05-19T08:10:30.051226335Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=info ts=2021-05-19T08:10:30.051238915Z caller=http.go:64 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=info ts=2021-05-19T08:10:30.551420564Z caller=http.go:83 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=info ts=2021-05-19T08:10:30.551515258Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=error ts=2021-05-19T08:10:30.55161247Z caller=main.go:212 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1374" tested with the not merged PR, issue is fixed # oc -n openshift-user-workload-monitoring get po --show-labels NAME READY STATUS RESTARTS AGE LABELS prometheus-operator-644fd69b76-pwgfk 2/2 Running 0 17m app.kubernetes.io/component=controller,app.kubernetes.io/name=prometheus-operator,app.kubernetes.io/version=v0.42.1,pod-template-hash=644fd69b76 prometheus-user-workload-0 4/4 Running 1 17m app=prometheus,controller-revision-hash=prometheus-user-workload-587d78bbdc,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-0 prometheus-user-workload-1 4/4 Running 1 17m app=prometheus,controller-revision-hash=prometheus-user-workload-587d78bbdc,prometheus=user-workload,statefulset.kubernetes.io/pod-name=prometheus-user-workload-1 thanos-ruler-user-workload-0 3/3 Running 0 17m app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7d4c766bc6,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-0,thanos-ruler=user-workload thanos-ruler-user-workload-1 3/3 Running 0 17m app=thanos-ruler,controller-revision-hash=thanos-ruler-user-workload-7d4c766bc6,statefulset.kubernetes.io/pod-name=thanos-ruler-user-workload-1,thanos-ruler=user-workload Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.31 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2100 |