Description of the problem: In the Web UI of RHACM it no longer shows the status icons in the lower left corner of the resources. In addition to the missing status icons, the "spinning" icon constantly showed next to "Last update" in the top right. Checking the pod logs in namespace open-cluster-management showed the following errors: pod/search-prod-1ff3f-search-aggregator-67977597b7-htzkh E0117 19:02:42.528237 1 redisinterfacev2.go:47] Error fetching results from RedisGraph V2 : read tcp XX.XXX.XX.XX:54518->XXX.XX.X.XXX:6380: read: connection reset by peer W0117 19:02:42.528248 1 clusterWatch.go:150] Error on UpdateByName() read tcp XX.XXX.XX.XX:54518->XXX.XX.X.XXX:6380: read: connection reset by peer E0117 19:02:42.539686 1 pool.go:104] Error authenticating Redis client. Original error: read tcp XX.XXX.XX.XX:54520->XXX.XX.X.XXX:6380: read: connection reset by peer W0117 19:02:42.539721 1 pool.go:107] Failed to close redis connection. Original error: read tcp XX.XXX.XX.XX:54520->XXX.XX.X.XXX:6380: read: connection reset by peer pod/search-prod-1ff3f-search-api-bb9c56998-wbjjk [2022-01-17T19:02:59.521] [INFO] [search-api] [server] Redis Client connected. [2022-01-17T19:02:59.521] [INFO] [search-api] [server] Error with Redis connection. [2022-01-17T19:02:59.521] [INFO] [search-api] [server] The Redis connection has ended. pod/search-prod-1ff3f-search-collector-d7d7cdd8-btkw7 I0117 19:02:36.117648 1 sender.go:184] Sending Resources { request: 479928, add: 23, update: 51, delete: 2 edge add: 106 edge delete: 8 } W0117 19:02:36.172833 1 sender.go:169] Received error response [POST to: https://search-aggregator.open-cluster-management.svc:3010/aggregator/clusters/local-cluster/sync responded with error. StatusCode: 400 Message: 400 Bad Request] from Aggregator. Resending in 600000 ms after resetting config. Checking the logs of pod search-redisgraph-0 showed messages like these: 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: TLS accepted: new session negotiated 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: TLSv1.2 ciphersuite: ECDHE-RSA-AES256-GCM-SHA384 (256-bit encryption) 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Compression: null, expansion: null 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Deallocating application specific data for session connect address 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: failover: priority, starting at entry #0 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: s_connect: connecting ::1:6379 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: s_connect: s_poll_wait ::1:6379: waiting 10 seconds 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=6 events=0x2001 revents=0x0 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=9 events=0x2005 revents=0x1 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG3[593833]: s_connect: connect ::1:6379: Connection refused (111) 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG6[593833]: s_connect: connecting 127.0.0.1:6379 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: s_connect: s_poll_wait 127.0.0.1:6379: waiting 10 seconds 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=6 events=0x2001 revents=0x0 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: FD=9 events=0x2005 revents=0x201D 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG3[593833]: s_connect: connect 127.0.0.1:6379: Connection refused (111) 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG3[593833]: No more addresses to connect 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG5[593833]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Local descriptor (FD=3) closed 19:23:50 stunnel.1 | 2022.01.17 19:23:50 LOG7[593833]: Service [redis] finished (0 left) The Redis server running in this container should have been listening on port 6379, but according to the log, stunnel (in the same container) was not able to connect to it. Deleting the search-redisgraph-0 pod and waiting a few minutes for Redis to load its dataset in the newly created pod fixed the problem. At the very least, the readinessProbe and livenessProbe should check if Redis is running in the container, not if stunnel is listening on port 6380. Release version: 2.4.1 Operator snapshot version: OCP version: Browser Info: Steps to reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We believe that the secret containing the redis password got updated and some pods are still using the old password. Redis is rejecting the connections with an authentication error, which is consistent with an incorrect password. The search-operator logs could help us confirm if the `redisgraph-user-secret` was recreated or updated after the initial ACM install. We could also look at the secret itself for clues that it was modified after the initial installation. Was this cluster upgraded from previous version?
- Compared to the secret creation date, could you provide a timeline of when ACM was first installed, when ACM was upgraded, and when this problem was noticed? - Is there a reason why the search operator pod was created yesterday? This shouldn't be a problem, but I'm trying to understand if there's any condition in the cluster causing our pods to restart and getting in a bad state. One example could be if a node goes down, the pods get recreated in a different node and everything should recover.
This is caused by BZ 2046553 both issues were observed in the same ACM instance. *** This bug has been marked as a duplicate of bug 2046553 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days