Created attachment 1848318 [details] Screenshot of situatiion Description of problem: after some workarounds and several restarts the Application overview in the Web - UI stays empty, browser changes etc. do not help. Chrome says: GraphQL Error: Search service is unavailable in the Dev - Tools view. Version-Release number of selected component (if applicable): - OCP 4.8.18 - ACM 2.4 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Application shows a blank page Expected results: Additional info: from logs open-cluster-management/multicluster-operators-application-5968d497c5-grh4r/: 2021-12-29T09:04:14.766799176Z I1229 09:04:14.766648 1 event.go:282] Event(v1.ObjectReference{Kind:"Application", Namespace:"openshift-config", Name:"openshift-config", UID:"317b6fc0-15f6-43a4-b95a-919518f10917", APIVersion:"app.k8s.io/v1beta1", ResourceVersion:"1028375760", FieldPath:""}): type: 'Normal' reason: 'Update' The app annotations updated. App:openshift-config/openshift-config 2021-12-29T09:04:14.866829429Z I1229 09:04:14.866289 1 application_controller.go:199] Reconciling Application:openshift-gitops/gitops-operator with Get err:<nil> 2021-12-29T09:04:15.067522210Z I1229 09:04:15.067433 1 event.go:282] Event(v1.ObjectReference{Kind:"Application", Namespace:"openshift-gitops", Name:"gitops-operator", UID:"92030c52-f4ab-49ff-a22c-0e3b131b228b", APIVersion:"app.k8s.io/v1beta1", ResourceVersion:"1028375779", FieldPath:""}): type: 'Normal' reason: 'Update' The app annotations updated. App:openshift-gitops/gitops-operator 2021-12-29T09:04:15.078776542Z I1229 09:04:15.078713 1 application_controller.go:199] Reconciling Application:openshift-operators/gitops-operator-base with Get err:<nil> 2021-12-29T09:04:15.080274555Z I1229 09:04:15.079000 1 application_controller.go:199] Reconciling Application:harbor/harbor with Get err:<nil> 2021-12-29T09:04:15.080274555Z I1229 09:04:15.079118 1 application_controller.go:199] Reconciling Application:openshift-config/openshift-config with Get err:<nil>
Hi, is there anything new with this issue?
G2Bsync 1004900920 comment KevinFCormier Tue, 04 Jan 2022 15:21:19 UTC G2Bsync Please check the status all pods in `open-cluster-management`. The error message you provided suggests the search pods may not be running properly. However, usually the UI can tolerate this scenario and degrades gracefully, but it looks like you are seeing a completely blank screen. Can you provide the full console output from the browser dev tools? Can you also check if clearing your browser helps?
oc logs -l 'component in (search-ui,search-api,redisgraph)' [2022-01-10T08:41:45.417] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:41:52.549] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:41:58.973] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:41:58.978] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:41:58.978] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:41:58.980] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:41:58.982] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:41:58.982] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:42:43.341] [INFO] [search-api] [server] Role configuration has changed. User RBAC cache has been deleted [2022-01-10T08:44:01.558] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:01.559] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:01.560] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:01.562] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:03.302] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:03.303] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:03.312] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:03.313] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:03.314] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:01.571] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. [2022-01-10T08:44:04.111] [ERROR] [search-api] [server] Unable to resolve search request because RedisGraph is unavailable. However, all pods are running: oc get po NAME READY STATUS RESTARTS AGE application-chart-a9e0f-applicationui-6978c7b79c-btfsf 1/1 Running 0 89m application-chart-a9e0f-applicationui-6978c7b79c-cgjbc 1/1 Running 0 89m application-chart-a9e0f-consoleapi-b79f7dbf7-4jwtc 1/1 Running 0 89m application-chart-a9e0f-consoleapi-b79f7dbf7-g7wjf 1/1 Running 0 89m cluster-curator-controller-85f9864cf8-5nmqj 1/1 Running 0 89m cluster-curator-controller-85f9864cf8-d8jrp 1/1 Running 0 89m cluster-manager-9847c6469-bwv7b 1/1 Running 0 89m cluster-manager-9847c6469-m4vgq 1/1 Running 0 89m cluster-manager-9847c6469-zgfnn 1/1 Running 0 89m clusterclaims-controller-54657d4f4-qhvzf 2/2 Running 0 89m clusterclaims-controller-54657d4f4-sxjsv 2/2 Running 0 89m clusterlifecycle-state-metrics-v2-6f6f9d64bf-prt59 1/1 Running 4 89m console-chart-f367a-console-v2-76b7b98fbb-l6lgv 1/1 Running 0 89m console-chart-f367a-console-v2-76b7b98fbb-wqc6h 1/1 Running 0 89m discovery-operator-65cf659f5b-dc956 1/1 Running 0 89m grc-3d889-grcui-6576675b94-4nljz 1/1 Running 0 89m grc-3d889-grcui-6576675b94-stl9p 1/1 Running 0 89m grc-3d889-grcuiapi-65f6cbf74b-8xhd7 1/1 Running 0 89m grc-3d889-grcuiapi-65f6cbf74b-94ktb 1/1 Running 0 89m grc-3d889-policy-propagator-6b989f5f4f-bzx24 2/2 Running 0 89m grc-3d889-policy-propagator-6b989f5f4f-ffvsk 2/2 Running 0 89m hive-operator-747df56cf8-c8xft 1/1 Running 0 89m infrastructure-operator-6cd64dbd97-zbn2f 1/1 Running 0 89m klusterlet-addon-controller-v2-5bd775d554-jzn8x 1/1 Running 0 88m klusterlet-addon-controller-v2-5bd775d554-r785q 1/1 Running 0 89m managedcluster-import-controller-v2-6fd89db8f5-jnxz8 1/1 Running 0 88m managedcluster-import-controller-v2-6fd89db8f5-wxskm 1/1 Running 0 88m management-ingress-dc13c-546c5d798d-2pgtj 2/2 Running 0 88m management-ingress-dc13c-546c5d798d-xrdh6 2/2 Running 0 88m multicluster-observability-operator-7dd888d9bb-zvbdx 1/1 Running 0 88m multicluster-operators-application-5968d497c5-dpdgt 4/4 Running 0 88m multicluster-operators-channel-c996f46d-l5qx6 1/1 Running 0 88m multicluster-operators-hub-subscription-7c9ff6cfc8-wzk28 1/1 Running 0 88m multicluster-operators-standalone-subscription-59db98c89-2tkhm 1/1 Running 0 88m multiclusterhub-operator-75b9cc9858-sz4j8 1/1 Running 0 88m multiclusterhub-repo-86947c88c8-8h647 1/1 Running 0 88m ocm-controller-5cdd8568b-dkdb5 1/1 Running 0 88m ocm-controller-5cdd8568b-pcj5q 1/1 Running 0 88m ocm-proxyserver-7558bcd957-bj7rc 1/1 Running 0 88m ocm-proxyserver-7558bcd957-cxmjv 1/1 Running 0 88m ocm-webhook-79664f88fb-64jfb 1/1 Running 0 88m ocm-webhook-79664f88fb-vhsl4 1/1 Running 0 88m policyreport-d58bb-insights-client-6c84d88f4d-7h2jp 1/1 Running 0 88m policyreport-d58bb-metrics-7c8ff59785-7t7ts 2/2 Running 0 88m provider-credential-controller-6c5547dc45-8fcbf 2/2 Running 0 88m search-operator-57549b59c7-gd8t9 1/1 Running 0 88m search-prod-df6df-search-aggregator-6799d668b6-nl6td 1/1 Running 1 88m search-prod-df6df-search-api-866f9dc-bvdmf 1/1 Running 0 88m search-prod-df6df-search-api-866f9dc-w9gzh 1/1 Running 0 88m search-prod-df6df-search-collector-6645d984b5-hhl9n 1/1 Running 0 88m search-redisgraph-0 1/1 Running 0 88m search-ui-d4cdd8554-8r6sx 1/1 Running 0 88m search-ui-d4cdd8554-hcrss 1/1 Running 0 88m submariner-addon-7cc684b685-vp4tn 1/1 Running 0 88m Customer has issue with redis, search-redisgraph-0 consumes ~40GB RAM. Asked for details about this pod: oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE trident-premium-search-redisgraph-0 Bound pvc-2c662619-d0bf-4346-ab48-3ef6bf860330 99Gi RWO trident-premium 6d23 oc describe pvc trident-premium-search-redisgraph-0 Name: trident-premium-search-redisgraph-0 Namespace: open-cluster-management StorageClass: trident-premium Status: Bound Volume: pvc-2c662619-d0bf-4346-ab48-3ef6bf860330 Labels: <none> Annotations: pv.kubernetes.io/bind-completed: yes pv.kubernetes.io/bound-by-controller: yes volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io Finalizers: [kubernetes.io/pvc-protection] Capacity: 99Gi Access Modes: RWO VolumeMode: Filesystem Used By: search-redisgraph-0 Events: <none> The PVC is attached and 12GB is used: sh-4.4$ df Filesystem 1K-blocks Used Available Use% Mounted on overlay 125293548 44352500 80941048 36% / tmpfs 65536 0 65536 0% /dev tmpfs 32844544 0 32844544 0% /sys/fs/cgroup shm 65536 12 65524 1% /dev/shm tmpfs 32844544 139944 32704600 1% /etc/hostname tmpfs 32844544 8 32844536 1% /certs /dev/sda4 125293548 44352500 80941048 36% /rg fs-trident13-data.dst.tk-inline.net:/osmgmt_trident_premium_pvc_2c662619_d0bf_4346_ab48_3ef6bf860330 103809024 12846080 90962944 13% /redis-data tmpfs 32844544 28 32844516 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 32844544 0 32844544 0% /proc/acpi tmpfs 32844544 0 32844544 0% /proc/scsi tmpfs 32844544 0 32844544 0% /sys/firmware sh-4.4$ cd /redis-data/ sh-4.4$ ls dump.rdb sh-4.4$ ls -lah total 12G drwxrwxrwx. 2 root root 4.0K Jan 8 17:14 . dr-xr-xr-x. 1 root root 59 Jan 10 09:41 .. -rw-r--r--. 1 redis root 12G Jan 8 17:08 dump.rdb Details from web dev tools not sent.
G2Bsync 1009007731 comment KevinFCormier Mon, 10 Jan 2022 15:42:47 UTC G2Bsync Re-assigning to observability-usa to investigate.
The search service seems to be under scalability stress. Looks like the Application UI behaves differently when search is disabled vs. enabled and experiencing problems. We have some options, but first I want to understand this specific scenario. - Could I get the must-gather data for this cluster? - How many clusters are managed by this instance of ACM? A possible workaround is to disable search collection for some of the managed clusters. - Another option is to completely disable the search service, this should restore the Application UI, but it will run on degraded mode.
Logs from web browser attached, but must-gather is too big...
there are 5 managed clusters with a total of around 150 nodes.
5 managed clusters and 150 nodes isn't big, so we'll need the logs for the search service pods to understand what is happening. Is it possible to extract only this data from the must-gather output? There should 5 pods with the name starting with `search-*`. These are in the open-cluster-management namespace. Also, could you please confirm that there's a similar error in ACM Search UI?
I missed that the must-gather is not compressed. So attached m-g is almost complete only missing are logs in ns open-cluster-management-observability.
And customer verified that the search ui throws an error aswell. And CU once fixed the issue by deleting the redis PVC, but the error came back a few days later.
the problem comes back every 2-3 days, we need a workaround that will work longer.
First, I want to confirm how much memory the Redisgraph pod is actually using. I see that the memory limit is set to 24Gi, but Kubernetes doesn't guarantee the memory if needed by another process with higher priority. You can use the OCP console to monitor if the pod is reaching the memory limit or if it's being terminated before. If the pod is being terminated before reaching the memory limit, the solution is to add a memory REQUEST so it's guaranteed that the memory is reserved for the pod. If the pod is reaching the memory limits, then our next option is to disable the search collector for some of the managed clusters. Unfortunately, this means that we won't be able to show some data from those clusters. To disable search for a managed cluster, edit the KlusterletAddonConfig resource and change searchCollector enabled to false. `oc edit klusterletaddonconfig <clusterName> -n <clusterName>`
Thanks for the additional information. I was going in the wrong direction. This problem isn't caused by memory. We can confirm that because there is no restarts of the pod and the memory graph shows very low consumption. Redis is in a bad state, I see several issues, but need to investigate more to determine the root cause. The main suspect is a problem reading from the PVC. As reported in comment #12, deleting the PVC temporarily resolves the problem, but I couldn't find any details in the attached logs. A current copy of the logs could help. Can I get clarification on when the problem appeared? From comment #3 the age of the ACM pods is only 89 minutes. So the problem appeared quickly the first time, but from comment #13 it took 2-3 days after the workaround of deleting the PVC. I'm trying to understand how we can recreate it. Other notes: 1. There seems to be a discrepancy with the PVC size. From comment #3 PVC capacity is 99Gi, with 12Gi used. However, I see storageSize set to 10Gi from the operator logs. {"persistence? ": true, " storageClass? ": "", " storageSize? ": "10Gi", " fallbackToEmptyDir? ": true} 2. Redis log shows a high number of dropped connections. SSL_accept: Peer suddenly disconnected 3. Aggregator log suggest that Redis is stuck in a LOADING state. redisinterfacev2.go:47] Error fetching results from RedisGraph V2 : LOADING Redis is loading the dataset in memory clusterWatch.go:150] Error on UpdateByName() LOADING Redis is loading the dataset in memory 4. API log show frequent connection drops and reconnects. [2021-12-28T18:43:44.988] [INFO] [search-api] [server] Error with Redis connection. [2021-12-28T18:44:00.809] [INFO] [search-api] [server] Redis Client connected.
Adding to the comment to Jorge's comment #20 . (1) Notice from the logs , that the queries to redis is not working after a restart on search-aggregator pod or search-api pod . Does the redis connection problem occur ONLY after search-redisgraph-0 pod restarts OR all functions are working fine - then , all of a sudden the search pods throw the redis connection error [ Error on UpdateByName() LOADING Redis is loading the dataset in memory] ? Please clarify. (2) The data in memory is periodically synchronized to the PVC . If there is a search-redisgraph-0 pod restart , the data from the PVC is read to rebuild the cache quickly . My impression is that, the data in the PVC is somehow not in a good state ,which is being read to the memory . This theory is true, only if you are seeing the problem after search-redisgraph-0 pod is restarted. (3) If we get to this situation again, may I request to collect must-gather before any restarts of the pods .
In the latest log, the search-aggregator pod was OOMKilled at 2022-01-19T16:36:32Z. I believe that the aggregator was killed in the middle of a write and it corrupted the data in Redis. Unfortunately, the logs for search-redisgraph and search-aggregator pods around this time aren't in the must-gather datas. The earliest log starts at 2022-01-22T11:16:17 for redis and 2022-01-22T08:46:06 for the aggregator. A potential workaround is to increase the memory limit on the search-aggregator. We'll need a code fix if we confirm this is the real root cause.
Customer verified that the redis pod eats up to ~40GB of RAM before OOMkilled. He has increased RAM to 48GB for this pod.
Thank you @mheppler . (1) We noticed that the , redis connection failed after there was a OOMKilled from search-aggregator pod - So along with the search-redisgraph pod , please increase the limit for search-aggregator pod . ( Try double the size for now ) (2) Fundamentally, the search uses -in memory database- which will increase the requirement of memory as the number of resources from managed clusters go up . We realize at this point , customers ACM has a lot of resources, pushing the redisgraph memory requirement to go as high as 48Gi. This is near our limits . At this point , can we suggest the customer to turn off search on some of the managed clusters . This will reduce the memory requirement - downside is that, the search results from those managed clusters will not be available in the UI. You can disable search for a managed cluster, edit the KlusterletAddonConfig resource and change searchCollector enabled to false. `oc edit klusterletaddonconfig <clusterName> -n <clusterName>` Please note that this is a work around for now. We are working on replacing in-memory database with scalable database for the future releases.
This topic has more information about turning off the search add-on on managed clusters. https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.4/html/clusters/managing-your-clusters#modifying-the-klusterlet-add-ons-settings-of-your-cluster
Customer accepted workaround, but he is not very happy. Searching is a huge benefit of ACM for him. Please, do you have some time plan for replacing in-memory database, as this workaround is limiting for customer...?
Product management, Scott Berens, has offered to meet with the customer to discuss the Search replacement roadmap.
Thank you @mheppler . We are working to meet the customers technical contact.Meantime, We also want to suggest if customer can disable the search persistence for now, which will make the system stop backing up the search data in PVC. This will not affect search function. You can do it using by creating searchcustomization CR. You can read more here https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.4/html/web_console/web-console#search-customization Example : apiVersion: search.open-cluster-management.io/v1alpha1 kind: SearchCustomization metadata: name: searchcustomization namespace: open-cluster-management spec: persistence: false
(In reply to Xavier from comment #29) We had the same issue: the "Overview" page just showed "Search service is unavailable". Don't know if you have some applications, but after applying > apiVersion: search.open-cluster-management.io/v1alpha1 > kind: SearchCustomization > metadata: > name: searchcustomization > namespace: open-cluster-management > spec: > persistence: false our deployed application vanished from "Applications" view; but happily only there... 2nd, the volume got empty, oc get pvc search-redisgraph-pvc-0 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE search-redisgraph-pvc-0 Bound pvc-d8796b8f-b1e1-4b8c-a349-d9e5795377c4 10Gi RWO gp2 27d what might explain the vanishing of application from the view.
@dmitri.voronov the ACM Overview page and the ACM Applications table, all leverage the Search as backend to provide their UI. If there is a problem with search/redisgraph, then those pages do not properly serve the request, even though in your case, those applications do really exist.
@dmitri.voronov Removing the persistence setting also clears the redis cache, which explains why some data vanished. This should be temporary until the cache is repopulated. This process can take from a few minutes up to a few hours depending on the size of the managed clusters.
Thanks to all! (In reply to Scott Berens from comment #32) That's clear if the search is not working it affects RHACM, at least the UI. (In reply to Jorge Padilla from comment #33) > This process can take from a few minutes up > to a few hours depending on the size of the managed clusters. That's clear too, but I think the prerequisite for cache repopulation is a working search function, which is not working after disabling persistence :-( Or is there any other possibility to enforce the cache repopulation?
After switching search persistence off what swapped the issue to the search functionality, I switched the search persistence ON, but this time also restarted pods of search-operator, search-aggregator and search-collector, the search feature seems to be normalized again; both "overview" as well as "applications" views look good and the search is working again. I'm not sure which restart caused recovering, but finally it is working. The only question: what might have caused such a behavior?
@dmitri.voronov After deleting the database, the search-collector detected the missing data and most likely encountered BZ 2046553 during the resync. The restart of the pod got it out of the error state. There's a permanent fix for BZ 2046553 which is included in ACM 2.4.2