Bug 1967890

Summary: Observability Thanos store shard crashing - cannot unmarshal DNS message
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Martin Ouimet <mouimet>
Component: Core Services / ObservabilityAssignee: Chunlin Yang <chuyang>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: rhacm-2.2CC: borazem
Target Milestone: ---Flags: ming: rhacm-2.2.z+
Target Release: rhacm-2.2.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-16 19:28:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Ouimet 2021-06-04 10:20:17 UTC
Description of the problem:

Since the upgrade to Openshift 4.7.11, Observability thanos-store-shard pods are in crashloopback state. 

observability-observatorium-thanos-store-shard-0-0                0/1     CrashLoopBackOff   194        16h
observability-observatorium-thanos-store-shard-1-0                0/1     CrashLoopBackOff   195        16h
observability-observatorium-thanos-store-shard-2-0                0/1     CrashLoopBackOff   195        16h

It complain about the DNS request for memcached service. 

lookup _client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc on 172.30.0.10:53: cannot unmarshal DNS message"

I found a similar issue, not sure if this is related.
https://bugzilla.redhat.com/show_bug.cgi?id=1953518

Release version: Advanced Cluster Management 2.2.3
OCP version: Openshift 4.7.13


Steps to reproduce
1 - Install ACM operator
2 - Install MCM resource
3 - Install Observability ressource 

Thanks,

Comment 4 borazem 2021-06-14 15:28:05 UTC
Not sure if this is important, but I have experience the similar situation with OCP version 4.6.31 and RHACM 2.2.3

reviewing the above mentioned bugzilla record https://bugzilla.redhat.com/show_bug.cgi?id=1953518 and links referenced there mainly: 
https://bugzilla.redhat.com/show_bug.cgi?id=1966116 and 
https://access.redhat.com/solutions/5984291

with slide modification of patch deployment procedure described in https://access.redhat.com/solutions/5984291 I was able to get the observability-observatorium-thanos-store-shard-X-0 pods out of CrashLoopBackOff and without the following errors related to "unmarshal DNS message" messages like the following one.

level=error ts=2021-06-14T13:41:24.803494563Z caller=memcached_client.go:560 msg="failed to resolve addresses for memcached" addresses=dnssrv+_client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc err="lookup SRV records \"_client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc\": lookup _client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc on 172.30.0.10:53: cannot unmarshal DNS message"

I tried both settings (bufsize as well as force_tcp) but even if I only used bufsize: 512 was enough to solve issues in my case.

can you disclose how will this issue be solved in RHACM 2.2.4?

Comment 8 errata-xmlrpc 2021-06-16 19:28:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.2.4 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2461