Description of the problem: search-aggregator pod is continously getting OOMkilled on the hub and throwing 504 gateway timeout errors, while managed clusters throws 503 service unavailable error Additional info: search-aggregator pod is getting OOMkilled and restarted. While the pod throws 504 Gateway timeout errror: Running inside klusterlet. Aggregator URL: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses E0602 10:03:09.808478 1 sender.go:253] Sync sender error. POST to: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses/cluster11/aggregator/sync responded with error. StatusCode: 504 Message: 504 Gateway Timeout E0602 10:03:09.808487 1 sender.go:305] SEND ERROR: POST to: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses/cluster11/aggregator/sync responded with error. StatusCode: 504 Message: 504 Gateway Timeout W0602 10:03:09.808495 1 sender.go:319] Backing off send interval because of error response from aggregator. Sleeping for 10m0s On the other hand, the managed cluster throws 503 errors: W0602 09:52:46.964045 1 sender.go:174] Received busy response from Aggregator. Resending in 105000 ms. I0602 09:54:31.964363 1 sender.go:193] Sending Resources { request: 48149, add: 10664, update: 0, delete: 0 edge add: 5682 edge delete: 0 } W0602 09:54:34.322656 1 sender.go:178] Received error response [POST to: https://api.aws-emea.com:6443/apis/proxy.open-cluster-management.io/v1beta1/namespaces/cluster11/clusterstatuses/cluster11/aggregator/sync responded with error. StatusCode: 503 Message: 503 Service Unavailable] from Aggregator. Resending in 120000 ms after resetting config.
Slight correction: 500 errors are both on the managed cluster and not on the hub cluster
Code fix has been merged for the next 2.6.z and 2.5.z releases.
Created test caseshttps://polarion.engineering.redhat.com/polarion/#/project/RHACM4K/workitem?id=RHACM4K-20159, verified on v2.5.3-FC0.
G2Bsync 1276545670 comment oafischer Wed, 12 Oct 2022 18:02:15 UTC G2Bsync Can someone provide a quick summary of the fix? Thanks!
G2Bsync 1276565270 comment jlpadilla Wed, 12 Oct 2022 18:21:38 UTC G2Bsync ### Change Summary The search aggregator logic was updated to avoid concurrent sync requests from a given managed cluster. This problem appeared in larger clusters where the aggregator needs longer to sync the data.
G2Bsync 1276568142 comment oafischer Wed, 12 Oct 2022 18:24:32 UTC G2Bsync Thanks @jlpadilla! I appreciate the quick response.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.5.3 security fixes and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6954