Bug 1953518 - thanos-ruler pods failed to start up for "cannot unmarshal DNS message"
Summary: thanos-ruler pods failed to start up for "cannot unmarshal DNS message"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard: UpdateRecommendationsBlocked
: 1963100 1967514 (view as bug list)
Depends On:
Blocks: 1957646
TreeView+ depends on / blocked
 
Reported: 2021-04-26 09:50 UTC by Junqi Zhao
Modified: 2022-02-04 08:58 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:03:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
user-workload dump file (299.58 KB, application/gzip)
2021-04-26 09:50 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift thanos pull 55 0 None closed Bug 1953518: cmd/thanos: use miekgdns resolver as default 2021-06-07 11:53:22 UTC
Red Hat Knowledge Base (Solution) 6092191 0 None None None 2021-06-02 08:57:37 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:04:02 UTC

Internal Links: 1970888 1970889

Description Junqi Zhao 2021-04-26 09:50:30 UTC
Created attachment 1775455 [details]
user-workload dump file

Description of problem:
enabled UWM, thanos-ruler pods failed to start up for "cannot unmarshal DNS message"
************************
# oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    enableUserWorkload: true
kind: ConfigMap
metadata:
  creationTimestamp: "2021-04-26T09:12:43Z"
  name: cluster-monitoring-config
  namespace: openshift-monitoring
  resourceVersion: "281105"
  uid: 5c94bba4-1482-4dcd-931d-2e6af50f1723
************************
# oc -n openshift-user-workload-monitoring get po
NAME                                   READY   STATUS             RESTARTS   AGE
prometheus-operator-64b75455b6-54gbv   2/2     Running            0          33m
prometheus-user-workload-0             5/5     Running            1          33m
prometheus-user-workload-1             5/5     Running            1          33m
thanos-ruler-user-workload-0           2/3     CrashLoopBackOff   11         33m
thanos-ruler-user-workload-1           2/3     CrashLoopBackOff   11         33m

# oc -n openshift-user-workload-monitoring get po thanos-ruler-user-workload-0 -oyaml
...
  - containerID: cri-o://20b90fd8fd4d11d0090d749cf5ba203c6544ae11a864594d02c333890ad40f97
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:245581d21fb684b12536764c7b7d0e8e92556cded11b4b768ff910b80dff9fea
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:245581d21fb684b12536764c7b7d0e8e92556cded11b4b768ff910b80dff9fea
    lastState:
      terminated:
        containerID: cri-o://20b90fd8fd4d11d0090d749cf5ba203c6544ae11a864594d02c333890ad40f97
        exitCode: 1
        finishedAt: "2021-04-26T09:39:24Z"
        message: |
          cords \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
          level=warn ts=2021-04-26T09:39:24.605130836Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
          level=info ts=2021-04-26T09:39:24.605137321Z caller=http.go:69 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
          level=info ts=2021-04-26T09:39:24.607301012Z caller=http.go:88 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
          level=info ts=2021-04-26T09:39:24.60733603Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
          level=error ts=2021-04-26T09:39:24.607408597Z caller=main.go:156 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:156\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371"
        reason: Error
        startedAt: "2021-04-26T09:39:24Z"
    name: thanos-ruler
    ready: false
    restartCount: 10
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=thanos-ruler pod=thanos-ruler-user-workload-0_openshift-user-workload-monitoring(2fd1cac1-7bb8-44bf-973c-9c9d862c44df)
        reason: CrashLoopBackOff

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-25-183122

How reproducible:
always

Steps to Reproduce:
1. enabled UWM
2.
3.

Actual results:
thanos-ruler pods failed to start up for "cannot unmarshal DNS message"

Expected results:
no error

Additional info:

Comment 1 Simon Pasquier 2021-04-26 12:21:13 UTC
I suspect that we're hitting the same issue that we had once with Thanos querier where we had to switch from the Go resolver to miekgdns. The root cause might be the same as described in https://github.com/golang/go/issues/36718 and the TL;DR is that the Go resolver is too restrictive when the DNS response for SRV records is compressed.
The short-term fix would be to switch our downstream Thanos to use miekgdns by default (the Prometheus operator doesn't allow to control the DNS resolver type unfortunately).

Comment 4 Junqi Zhao 2021-04-29 08:19:45 UTC
issue is fixed with 
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-29-063720   True        False         59m     Cluster version is 4.8.0-0.nightly-2021-04-29-063720

# oc -n openshift-user-workload-monitoring get po
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-6f6f968f79-h8m44   2/2     Running   0          5m59s
prometheus-user-workload-0             5/5     Running   1          5m52s
prometheus-user-workload-1             5/5     Running   1          5m52s
thanos-ruler-user-workload-0           3/3     Running   0          5m52s
thanos-ruler-user-workload-1           3/3     Running   0          5m52s

Comment 6 Damien Grisonnet 2021-05-21 14:08:25 UTC
*** Bug 1963100 has been marked as a duplicate of this bug. ***

Comment 7 wolfgang.voesch 2021-05-21 16:34:30 UTC
Hi Simon, 

we have seen the same issue in version 4.7.10 on s390x. Please see the closed as duplicate bug 1963100. 

Is it possible to back port the fix to 4.7? 

Should we reopen this BZ to indicate that a back port is needed?

Thank you.

Comment 8 Simon Pasquier 2021-05-25 09:32:31 UTC
It's already been backported in 4.7.11 (see bug 1957646).

Comment 9 wolfgang.voesch 2021-05-25 11:40:43 UTC
Thank you Simon.

Comment 10 alef 2021-05-27 13:31:41 UTC
Hi Simon, 

we have seen the same issue after upgrading from 4.6.21 to 4.6.30.
Is there a workaround to resolve the issue ?


Thank you.

Comment 11 Simon Pasquier 2021-05-27 14:51:58 UTC
The fix has backported to 4.6 (see bug 1961158). It should be available in the next 4.6.z release (4.6.31) but I don't have a good workaround for now unfortunately.

Comment 12 Daniel 2021-05-31 13:28:33 UTC
We have been experiencing the DNS issue with other Go applications since the update from OpenShift 4.7.9 to 4.7.11/4.7.12.

Comment 16 Peter Söderlind 2021-06-03 14:36:00 UTC
Hi Simon,

We have this problem after upgrading from version 4.6.29 to 4.6.30. But we can't upgrade till version 4.6.30 since the the monitoring cluster operator is degraded. Is the a way to force the upgrade?

Comment 17 Peter Söderlind 2021-06-03 14:38:27 UTC
Hi Simon,

We have this problem after upgrading from version 4.6.29 to 4.6.30. But we can't upgrade to version 4.6.31 since the the monitoring cluster operator is degraded. Is the a way to force the upgrade?

Comment 18 Simon Pasquier 2021-06-03 15:36:41 UTC
One workaround would be to disable user workload monitoring to have CMO back to Available. Then upgrade to 4.6.31 and enable back user workload monitoring.

Comment 19 Peter Söderlind 2021-06-04 06:26:29 UTC
Simon,
Thanks for the reply. I accually tried that by removing "user workload monitoring" from the monitoring operators related resources list. But something put it back again. I guess I have to do it someware else? Any suggestions?

Comment 20 Junqi Zhao 2021-06-04 08:03:37 UTC
(In reply to Peter Söderlind from comment #19)
> Simon,
> Thanks for the reply. I accually tried that by removing "user workload
> monitoring" from the monitoring operators related resources list. But
> something put it back again. I guess I have to do it someware else? Any
> suggestions?

should edit confimap cluster-monitoring-config, remove enableUserWorkload: true or set it to false
# oc -n openshift-monitoring edit confimap cluster-monitoring-config

Comment 21 Peter Söderlind 2021-06-04 08:35:59 UTC
(In reply to Junqi Zhao from comment #20)
> (In reply to Peter Söderlind from comment #19)
> > Simon,
> > Thanks for the reply. I accually tried that by removing "user workload
> > monitoring" from the monitoring operators related resources list. But
> > something put it back again. I guess I have to do it someware else? Any
> > suggestions?
> 
> should edit confimap cluster-monitoring-config, remove enableUserWorkload:
> true or set it to false
> # oc -n openshift-monitoring edit confimap cluster-monitoring-config

Thank you for the reply. We ended up forcing an upgrade to 4.7.13 since the only problem we had in the cluster was this bug. And tha upgrade seem to have worked fine. It was intresting that we could have disabled "user workload monitoring" via the configmap. I'm new to openshift so that didn't cross my mind. Thank you for that insight!

Comment 22 Lalatendu Mohanty 2021-06-04 14:20:26 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 23 Simon Pasquier 2021-06-04 14:54:06 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

The bug only impacts customers that have enabled user workload monitoring.
Customers on OCP 4.6.x prior to 4.6.30 are blocked when they try to upgrade to 4.6.30. 
From telemetry numbers, it looks like about 50% of supported clusters <= v4.6.30 have user workload enabled.

What is the impact?  Is it serious enough to warrant blocking edges?

CMO reports unavailable and the upgrade never completes.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

Upgrading to 4.6.31 is the only way to solve the issue.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

Yes it's a regression due to a change to what the DNS responses for SRV records. We haven't identified the root cause though.

Comment 24 Simon Pasquier 2021-06-07 11:41:22 UTC
The root cause has been identified and it is due to [1] that enabled the bufsize plugin in the DNS server. This change has been shipped in 4.6.30.

On 4.7, a similar backport [2] has been shipped as part of 4.7.11. Because bug 1957646 has been shipped in the same z version, there's no regression to expect in 4.7.z.

[1] https://github.com/openshift/cluster-dns-operator/pull/272
[2] https://github.com/openshift/cluster-dns-operator/pull/267

Comment 26 Tyler Lisowski 2021-07-21 17:30:28 UTC
We are still seeing the potential for this to occur in some IBM Cloud ROKS environments on 4.6.34

Comment 27 Tyler Lisowski 2021-07-21 18:06:33 UTC
```
RouteHealthDegraded: failed to GET route (https://console-openshift-console.noprod-rhos-02-cd4fe78b28480af45445d9cd9f0cf84b-0000.eu-de.containers.appdomain.cloud/health): Get "https://console-openshift-console.noprod-rhos-02-cd4fe78b28480af45445d9cd9f0cf84b-0000.eu-de.containers.appdomain.cloud/health": dial tcp: lookup console-openshift-console.noprod-rhos-02-cd4fe78b28480af45445d9cd9f0cf84b-0000.eu-de.containers.appdomain.cloud on 172.21.0.10:53: cannot unmarshal DNS message
```

Comment 29 Simon Pasquier 2021-07-22 07:32:43 UTC
@Tyler that doesn't seem to be related to Thanos Ruler. You need to file a bug against the DNS component.

Comment 32 errata-xmlrpc 2021-07-27 23:03:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 33 W. Trevor King 2021-08-18 21:31:55 UTC
We did end up blocking * -> 4.6.30 over this [1], as suggested in the impact statement from comment 23.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/837

Comment 34 Scott Dodson 2021-08-20 15:53:26 UTC
*** Bug 1967514 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.