Hide Forgot
Created attachment 1775455 [details] user-workload dump file Description of problem: enabled UWM, thanos-ruler pods failed to start up for "cannot unmarshal DNS message" ************************ # oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml apiVersion: v1 data: config.yaml: | enableUserWorkload: true kind: ConfigMap metadata: creationTimestamp: "2021-04-26T09:12:43Z" name: cluster-monitoring-config namespace: openshift-monitoring resourceVersion: "281105" uid: 5c94bba4-1482-4dcd-931d-2e6af50f1723 ************************ # oc -n openshift-user-workload-monitoring get po NAME READY STATUS RESTARTS AGE prometheus-operator-64b75455b6-54gbv 2/2 Running 0 33m prometheus-user-workload-0 5/5 Running 1 33m prometheus-user-workload-1 5/5 Running 1 33m thanos-ruler-user-workload-0 2/3 CrashLoopBackOff 11 33m thanos-ruler-user-workload-1 2/3 CrashLoopBackOff 11 33m # oc -n openshift-user-workload-monitoring get po thanos-ruler-user-workload-0 -oyaml ... - containerID: cri-o://20b90fd8fd4d11d0090d749cf5ba203c6544ae11a864594d02c333890ad40f97 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:245581d21fb684b12536764c7b7d0e8e92556cded11b4b768ff910b80dff9fea imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:245581d21fb684b12536764c7b7d0e8e92556cded11b4b768ff910b80dff9fea lastState: terminated: containerID: cri-o://20b90fd8fd4d11d0090d749cf5ba203c6544ae11a864594d02c333890ad40f97 exitCode: 1 finishedAt: "2021-04-26T09:39:24Z" message: | cords \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=warn ts=2021-04-26T09:39:24.605130836Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=info ts=2021-04-26T09:39:24.605137321Z caller=http.go:69 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=info ts=2021-04-26T09:39:24.607301012Z caller=http.go:88 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=info ts=2021-04-26T09:39:24.60733603Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message" level=error ts=2021-04-26T09:39:24.607408597Z caller=main.go:156 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:156\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371" reason: Error startedAt: "2021-04-26T09:39:24Z" name: thanos-ruler ready: false restartCount: 10 started: false state: waiting: message: back-off 5m0s restarting failed container=thanos-ruler pod=thanos-ruler-user-workload-0_openshift-user-workload-monitoring(2fd1cac1-7bb8-44bf-973c-9c9d862c44df) reason: CrashLoopBackOff Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-25-183122 How reproducible: always Steps to Reproduce: 1. enabled UWM 2. 3. Actual results: thanos-ruler pods failed to start up for "cannot unmarshal DNS message" Expected results: no error Additional info:
I suspect that we're hitting the same issue that we had once with Thanos querier where we had to switch from the Go resolver to miekgdns. The root cause might be the same as described in https://github.com/golang/go/issues/36718 and the TL;DR is that the Go resolver is too restrictive when the DNS response for SRV records is compressed. The short-term fix would be to switch our downstream Thanos to use miekgdns by default (the Prometheus operator doesn't allow to control the DNS resolver type unfortunately).
issue is fixed with # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-29-063720 True False 59m Cluster version is 4.8.0-0.nightly-2021-04-29-063720 # oc -n openshift-user-workload-monitoring get po NAME READY STATUS RESTARTS AGE prometheus-operator-6f6f968f79-h8m44 2/2 Running 0 5m59s prometheus-user-workload-0 5/5 Running 1 5m52s prometheus-user-workload-1 5/5 Running 1 5m52s thanos-ruler-user-workload-0 3/3 Running 0 5m52s thanos-ruler-user-workload-1 3/3 Running 0 5m52s
*** Bug 1963100 has been marked as a duplicate of this bug. ***
Hi Simon, we have seen the same issue in version 4.7.10 on s390x. Please see the closed as duplicate bug 1963100. Is it possible to back port the fix to 4.7? Should we reopen this BZ to indicate that a back port is needed? Thank you.
It's already been backported in 4.7.11 (see bug 1957646).
Thank you Simon.
Hi Simon, we have seen the same issue after upgrading from 4.6.21 to 4.6.30. Is there a workaround to resolve the issue ? Thank you.
The fix has backported to 4.6 (see bug 1961158). It should be available in the next 4.6.z release (4.6.31) but I don't have a good workaround for now unfortunately.
We have been experiencing the DNS issue with other Go applications since the update from OpenShift 4.7.9 to 4.7.11/4.7.12.
Hi Simon, We have this problem after upgrading from version 4.6.29 to 4.6.30. But we can't upgrade till version 4.6.30 since the the monitoring cluster operator is degraded. Is the a way to force the upgrade?
Hi Simon, We have this problem after upgrading from version 4.6.29 to 4.6.30. But we can't upgrade to version 4.6.31 since the the monitoring cluster operator is degraded. Is the a way to force the upgrade?
One workaround would be to disable user workload monitoring to have CMO back to Available. Then upgrade to 4.6.31 and enable back user workload monitoring.
Simon, Thanks for the reply. I accually tried that by removing "user workload monitoring" from the monitoring operators related resources list. But something put it back again. I guess I have to do it someware else? Any suggestions?
(In reply to Peter Söderlind from comment #19) > Simon, > Thanks for the reply. I accually tried that by removing "user workload > monitoring" from the monitoring operators related resources list. But > something put it back again. I guess I have to do it someware else? Any > suggestions? should edit confimap cluster-monitoring-config, remove enableUserWorkload: true or set it to false # oc -n openshift-monitoring edit confimap cluster-monitoring-config
(In reply to Junqi Zhao from comment #20) > (In reply to Peter Söderlind from comment #19) > > Simon, > > Thanks for the reply. I accually tried that by removing "user workload > > monitoring" from the monitoring operators related resources list. But > > something put it back again. I guess I have to do it someware else? Any > > suggestions? > > should edit confimap cluster-monitoring-config, remove enableUserWorkload: > true or set it to false > # oc -n openshift-monitoring edit confimap cluster-monitoring-config Thank you for the reply. We ended up forcing an upgrade to 4.7.13 since the only problem we had in the cluster was this bug. And tha upgrade seem to have worked fine. It was intresting that we could have disabled "user workload monitoring" via the configmap. I'm new to openshift so that didn't cross my mind. Thank you for that insight!
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? The bug only impacts customers that have enabled user workload monitoring. Customers on OCP 4.6.x prior to 4.6.30 are blocked when they try to upgrade to 4.6.30. From telemetry numbers, it looks like about 50% of supported clusters <= v4.6.30 have user workload enabled. What is the impact? Is it serious enough to warrant blocking edges? CMO reports unavailable and the upgrade never completes. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Upgrading to 4.6.31 is the only way to solve the issue. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes it's a regression due to a change to what the DNS responses for SRV records. We haven't identified the root cause though.
The root cause has been identified and it is due to [1] that enabled the bufsize plugin in the DNS server. This change has been shipped in 4.6.30. On 4.7, a similar backport [2] has been shipped as part of 4.7.11. Because bug 1957646 has been shipped in the same z version, there's no regression to expect in 4.7.z. [1] https://github.com/openshift/cluster-dns-operator/pull/272 [2] https://github.com/openshift/cluster-dns-operator/pull/267
We are still seeing the potential for this to occur in some IBM Cloud ROKS environments on 4.6.34
``` RouteHealthDegraded: failed to GET route (https://console-openshift-console.noprod-rhos-02-cd4fe78b28480af45445d9cd9f0cf84b-0000.eu-de.containers.appdomain.cloud/health): Get "https://console-openshift-console.noprod-rhos-02-cd4fe78b28480af45445d9cd9f0cf84b-0000.eu-de.containers.appdomain.cloud/health": dial tcp: lookup console-openshift-console.noprod-rhos-02-cd4fe78b28480af45445d9cd9f0cf84b-0000.eu-de.containers.appdomain.cloud on 172.21.0.10:53: cannot unmarshal DNS message ```
@Tyler that doesn't seem to be related to Thanos Ruler. You need to file a bug against the DNS component.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
We did end up blocking * -> 4.6.30 over this [1], as suggested in the impact statement from comment 23. [1]: https://github.com/openshift/cincinnati-graph-data/pull/837
*** Bug 1967514 has been marked as a duplicate of this bug. ***