1967890 – Observability Thanos store shard crashing - cannot unmarshal DNS message

Bug 1967890 - Observability Thanos store shard crashing - cannot unmarshal DNS message

Summary: Observability Thanos store shard crashing - cannot unmarshal DNS message

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Core Services / Observability
Sub Component:
Version:	rhacm-2.2
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	rhacm-2.2.4
Assignee:	Chunlin Yang
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-04 10:20 UTC by Martin Ouimet
Modified:	2021-06-16 19:28 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	---
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-16 19:28:30 UTC
Target Upstream Version:
Embargoed:
Flags:	ming: rhacm-2.2.z+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 13071	0	None	None	None	2021-06-04 20:28:04 UTC
Red Hat Product Errata	RHSA-2021:2461	0	None	None	None	2021-06-16 19:28:42 UTC

Internal Links: 1970888 1970889

Description Martin Ouimet 2021-06-04 10:20:17 UTC

Description of the problem:

Since the upgrade to Openshift 4.7.11, Observability thanos-store-shard pods are in crashloopback state. 

observability-observatorium-thanos-store-shard-0-0                0/1     CrashLoopBackOff   194        16h
observability-observatorium-thanos-store-shard-1-0                0/1     CrashLoopBackOff   195        16h
observability-observatorium-thanos-store-shard-2-0                0/1     CrashLoopBackOff   195        16h

It complain about the DNS request for memcached service. 

lookup _client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc on 172.30.0.10:53: cannot unmarshal DNS message"

I found a similar issue, not sure if this is related.
https://bugzilla.redhat.com/show_bug.cgi?id=1953518

Release version: Advanced Cluster Management 2.2.3
OCP version: Openshift 4.7.13


Steps to reproduce
1 - Install ACM operator
2 - Install MCM resource
3 - Install Observability ressource 

Thanks,

Comment 4 borazem 2021-06-14 15:28:05 UTC

Not sure if this is important, but I have experience the similar situation with OCP version 4.6.31 and RHACM 2.2.3

reviewing the above mentioned bugzilla record https://bugzilla.redhat.com/show_bug.cgi?id=1953518 and links referenced there mainly: 
https://bugzilla.redhat.com/show_bug.cgi?id=1966116 and 
https://access.redhat.com/solutions/5984291

with slide modification of patch deployment procedure described in https://access.redhat.com/solutions/5984291 I was able to get the observability-observatorium-thanos-store-shard-X-0 pods out of CrashLoopBackOff and without the following errors related to "unmarshal DNS message" messages like the following one.

level=error ts=2021-06-14T13:41:24.803494563Z caller=memcached_client.go:560 msg="failed to resolve addresses for memcached" addresses=dnssrv+_client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc err="lookup SRV records \"_client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc\": lookup _client._tcp.observability-observatorium-thanos-store-memcached.open-cluster-management-observability.svc on 172.30.0.10:53: cannot unmarshal DNS message"

I tried both settings (bufsize as well as force_tcp) but even if I only used bufsize: 512 was enough to solve issues in my case.

can you disclose how will this issue be solved in RHACM 2.2.4?

Comment 8 errata-xmlrpc 2021-06-16 19:28:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.2.4 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2461

Note You need to log in before you can comment on or make changes to this bug.