Bug 2043568

Summary: Error with Redis connection (connection refused)
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Ryan Spagnola <rspagnol>
Component: Search / AnalyticsAssignee: Jorge Padilla <jpadilla>
Status: CLOSED DUPLICATE QA Contact: Xiang Yin <xiyin>
Severity: high Docs Contact: Mikela Dockery <mdockery>
Priority: unspecified    
Version: rhacm-2.4CC: ashafi, clasohm, jpadilla
Target Milestone: ---Flags: ashafi: qe_test_coverage-
Target Release: rhacm-2.5   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-09 20:57:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ryan Spagnola 2022-01-21 14:27:37 UTC
Description of the problem:
In the Web UI of RHACM it no longer shows the status icons in the lower left corner of the resources. In addition to the missing status icons, the "spinning" icon constantly showed next to "Last update" in the top right.

Checking the pod logs in namespace open-cluster-management showed the following errors:

pod/search-prod-1ff3f-search-aggregator-67977597b7-htzkh
E0117 19:02:42.528237       1 redisinterfacev2.go:47] Error fetching results from RedisGraph V2 : read tcp XX.XXX.XX.XX:54518->XXX.XX.X.XXX:6380: read: connection reset by peer
W0117 19:02:42.528248       1 clusterWatch.go:150] Error on UpdateByName() read tcp XX.XXX.XX.XX:54518->XXX.XX.X.XXX:6380: read: connection reset by peer
E0117 19:02:42.539686       1 pool.go:104] Error authenticating Redis client. Original error: read tcp XX.XXX.XX.XX:54520->XXX.XX.X.XXX:6380: read: connection reset by peer
W0117 19:02:42.539721       1 pool.go:107] Failed to close redis connection. Original error: read tcp XX.XXX.XX.XX:54520->XXX.XX.X.XXX:6380: read: connection reset by peer

pod/search-prod-1ff3f-search-api-bb9c56998-wbjjk
[2022-01-17T19:02:59.521] [INFO] [search-api] [server] Redis Client connected.
[2022-01-17T19:02:59.521] [INFO] [search-api] [server] Error with Redis connection.
[2022-01-17T19:02:59.521] [INFO] [search-api] [server] The Redis connection has ended.

pod/search-prod-1ff3f-search-collector-d7d7cdd8-btkw7
I0117 19:02:36.117648       1 sender.go:184] Sending Resources { request: 479928, add: 23, update: 51, delete: 2 edge add: 106 edge delete: 8 }
W0117 19:02:36.172833       1 sender.go:169] Received error response [POST to: https://search-aggregator.open-cluster-management.svc:3010/aggregator/clusters/local-cluster/sync responded with error. StatusCode: 400  Message: 400 Bad Request] from Aggregator. Resending in 600000 ms after resetting config.

Checking the logs of pod search-redisgraph-0 showed messages like these:

19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: TLS accepted: new session negotiated
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: TLSv1.2 ciphersuite: ECDHE-RSA-AES256-GCM-SHA384 (256-bit encryption)
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Compression: null, expansion: null
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Deallocating application specific data for session connect address
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: failover: priority, starting at entry #0
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: s_connect: connecting ::1:6379
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: s_connect: s_poll_wait ::1:6379: waiting 10 seconds
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=6 events=0x2001 revents=0x0
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=9 events=0x2005 revents=0x1
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG3[593833]: s_connect: connect ::1:6379: Connection refused (111)
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: s_connect: connecting 127.0.0.1:6379
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: s_connect: s_poll_wait 127.0.0.1:6379: waiting 10 seconds
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=6 events=0x2001 revents=0x0
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=9 events=0x2005 revents=0x201D
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG3[593833]: s_connect: connect 127.0.0.1:6379: Connection refused (111)
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG3[593833]: No more addresses to connect
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG5[593833]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Local descriptor (FD=3) closed
19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Service [redis] finished (0 left)

The Redis server running in this container should have been listening on port 6379, but according to the log, stunnel (in the same container) was not able to connect to it.

Deleting the search-redisgraph-0 pod and waiting a few minutes for Redis to load its dataset in the newly created pod fixed the problem.

At the very least, the readinessProbe and livenessProbe should check if Redis is running in the container, not if stunnel is listening on port 6380.


Release version:
2.4.1

Operator snapshot version:

OCP version:

Browser Info:

Steps to reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 2 Jorge Padilla 2022-01-25 18:55:39 UTC
We believe that the secret containing the redis password got updated and some pods are still using the old password. Redis is rejecting the connections with an authentication error, which is consistent with an incorrect password.

The search-operator logs could help us confirm if the `redisgraph-user-secret` was recreated or updated after the initial ACM install. We could also look at the secret itself for clues that it was modified after the initial installation.

Was this cluster upgraded from previous version?

Comment 5 Jorge Padilla 2022-01-27 18:53:46 UTC
- Compared to the secret creation date, could you provide a timeline of when ACM was first installed, when ACM was upgraded, and when this problem was noticed?

- Is there a reason why the search operator pod was created yesterday? This shouldn't be a problem, but I'm trying to understand if there's any condition in the cluster causing our pods to restart and getting in a bad state. One example could be if a node goes down, the pods get recreated in a different node and everything should recover.

Comment 7 Jorge Padilla 2022-02-09 20:57:18 UTC
This is caused by BZ 2046553 both issues were observed in the same ACM instance.

*** This bug has been marked as a duplicate of bug 2046553 ***

Comment 8 Red Hat Bugzilla 2023-09-15 01:18:59 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days