Bug 1967890 - Observability Thanos store shard crashing - cannot unmarshal DNS message
Summary: Observability Thanos store shard crashing - cannot unmarshal DNS message
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Core Services / Observability
Version: rhacm-2.2
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: rhacm-2.2.4
Assignee: Chunlin Yang
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-04 10:20 UTC by Martin Ouimet
Modified: 2021-06-16 19:28 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: ---
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-16 19:28:30 UTC
Target Upstream Version:
Embargoed:
ming: rhacm-2.2.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github open-cluster-management backlog issues 13071 0 None None None 2021-06-04 20:28:04 UTC
Red Hat Product Errata RHSA-2021:2461 0 None None None 2021-06-16 19:28:42 UTC

Internal Links: 1970888 1970889

Description Martin Ouimet 2021-06-04 10:20:17 UTC
Description of the problem:

Since the upgrade to Openshift 4.7.11, Observability thanos-store-shard pods are in crashloopback state. 

observability-observatorium-thanos-store-shard-0-0                0/1     CrashLoopBackOff   194        16h
observability-observatorium-thanos-store-shard-1-0                0/1     CrashLoopBackOff   195        16h
observability-observatorium-thanos-store-shard-2-0                0/1     CrashLoopBackOff   195        16h

It complain about the DNS request for memcached service. 

lookup _client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc on 172.30.0.10:53: cannot unmarshal DNS message"

I found a similar issue, not sure if this is related.
https://bugzilla.redhat.com/show_bug.cgi?id=1953518

Release version: Advanced Cluster Management 2.2.3
OCP version: Openshift 4.7.13


Steps to reproduce
1 - Install ACM operator
2 - Install MCM resource
3 - Install Observability ressource 

Thanks,

Comment 4 borazem 2021-06-14 15:28:05 UTC
Not sure if this is important, but I have experience the similar situation with OCP version 4.6.31 and RHACM 2.2.3

reviewing the above mentioned bugzilla record https://bugzilla.redhat.com/show_bug.cgi?id=1953518 and links referenced there mainly: 
https://bugzilla.redhat.com/show_bug.cgi?id=1966116 and 
https://access.redhat.com/solutions/5984291

with slide modification of patch deployment procedure described in https://access.redhat.com/solutions/5984291 I was able to get the observability-observatorium-thanos-store-shard-X-0 pods out of CrashLoopBackOff and without the following errors related to "unmarshal DNS message" messages like the following one.

level=error ts=2021-06-14T13:41:24.803494563Z caller=memcached_client.go:560 msg="failed to resolve addresses for memcached" addresses=dnssrv+_client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc err="lookup SRV records \"_client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc\": lookup _client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc on 172.30.0.10:53: cannot unmarshal DNS message"

I tried both settings (bufsize as well as force_tcp) but even if I only used bufsize: 512 was enough to solve issues in my case.

can you disclose how will this issue be solved in RHACM 2.2.4?

Comment 8 errata-xmlrpc 2021-06-16 19:28:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.2.4 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2461


Note You need to log in before you can comment on or make changes to this bug.