Bug 2092863 - search-aggregator pod is continuously getting OOMkilled on the hub
Summary: search-aggregator pod is continuously getting OOMkilled on the hub
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Search / Analytics
Version: rhacm-2.4.z
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: rhacm-2.5.3
Assignee: Jorge Padilla
QA Contact: Atif
Mikela Dockery
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-02 11:53 UTC by Mihir Lele
Modified: 2022-12-20 22:41 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-13 19:42:47 UTC
Target Upstream Version:
Embargoed:
ashafi: qe_test_coverage+
njean: rhacm-2.5.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github stolostron backlog issues 22937 0 None None None 2022-06-02 16:19:33 UTC
Red Hat Product Errata RHSA-2022:6954 0 None None None 2022-10-13 19:42:54 UTC

Description Mihir Lele 2022-06-02 11:53:56 UTC
Description of the problem:

search-aggregator pod is continously getting OOMkilled on the hub and throwing 504 gateway timeout errors, while managed clusters throws 503 service unavailable error

Additional info:

search-aggregator pod is getting OOMkilled and restarted. While the pod throws 504 Gateway timeout errror:
Running inside klusterlet. Aggregator URL: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses
E0602 10:03:09.808478       1 sender.go:253] Sync sender error. POST to: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses/cluster11/aggregator/sync responded with error. StatusCode: 504  Message: 504 Gateway Timeout
E0602 10:03:09.808487       1 sender.go:305] SEND ERROR: POST to: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses/cluster11/aggregator/sync responded with error. StatusCode: 504  Message: 504 Gateway Timeout
W0602 10:03:09.808495       1 sender.go:319] Backing off send interval because of error response from aggregator. Sleeping for 10m0s


On the other hand, the managed cluster throws 503 errors:
W0602 09:52:46.964045       1 sender.go:174] Received busy response from Aggregator. Resending in 105000 ms.
I0602 09:54:31.964363       1 sender.go:193] Sending Resources { request: 48149, add: 10664, update: 0, delete: 0 edge add: 5682 edge delete: 0 }
W0602 09:54:34.322656       1 sender.go:178] Received error response [POST to: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses/cluster11/aggregator/sync responded with error. StatusCode: 503  Message: 503 Service Unavailable] from Aggregator. Resending in 120000 ms after resetting config.

Comment 3 Mihir Lele 2022-06-02 15:02:34 UTC
Slight correction: 500 errors are both on the managed cluster and not on the hub cluster

Comment 15 Jorge Padilla 2022-09-28 02:23:51 UTC
Code fix has been merged for the next 2.6.z and 2.5.z releases.

Comment 16 Atif 2022-10-07 13:27:02 UTC
Created test caseshttps://polarion.engineering.redhat.com/polarion/#/project/RHACM4K/workitem?id=RHACM4K-20159, verified on v2.5.3-FC0.

Comment 19 bot-tracker-sync 2022-10-12 21:12:28 UTC
G2Bsync 1276545670 comment 
 oafischer Wed, 12 Oct 2022 18:02:15 UTC 
 G2Bsync

Can someone provide a quick summary of the fix? Thanks!

Comment 20 bot-tracker-sync 2022-10-12 21:12:30 UTC
G2Bsync 1276565270 comment 
 jlpadilla Wed, 12 Oct 2022 18:21:38 UTC 
 G2Bsync

### Change Summary
The search aggregator logic was updated to avoid concurrent sync requests from a given managed cluster.  This problem appeared in larger clusters where the aggregator needs longer to sync the data.

Comment 21 bot-tracker-sync 2022-10-12 21:12:32 UTC
G2Bsync 1276568142 comment 
 oafischer Wed, 12 Oct 2022 18:24:32 UTC 
 G2Bsync

Thanks @jlpadilla! I appreciate the quick response.

Comment 24 errata-xmlrpc 2022-10-13 19:42:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.5.3 security fixes and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6954


Note You need to log in before you can comment on or make changes to this bug.