Description of problem: customer is seeing a 504 gateway error in the Kibana GUI after projects populate in the logging stack. did not see resources being an issue in the logging dumps. this was reproducible on their 3.11 cluster as well. Cluster Logging - 4.5.0-202008261951.p0 Elasticsearch Operator - 4.5.0-202008310950.p0 How reproducible: Every time Environment: OCP version 4.5.6 Provider: Azure Steps to reproduce: [1] 1. Install Elasticsearch Operator 2. Install Cluster Logging Operator 3. Create a Cluster Logging instance 4. Access Kibana page 5. Create projects: for i in $(seq 1 50) do ; oc new-project app-$i, done 6. Refresh Kibana page 7. repeat step 5 if necessary, changing the sequence numbers (eg. 51 100) Actual results: 504 gateway error in the kibana gui Expected results: functional gui Additional info: this happens after the customer populates their EFK with over 300+ projects. If the projects are deleted, Kibana is accessible. we are wondering if this is a resource issue or a bug. below is the customers' resources allocated to their cluster: spec: collection: logs: fluentd: resources: limits: memory: 1Gi requests: cpu: 1 memory: 1Gi type: fluentd curation: curator: nodeSelector: node-role.kubernetes.io/infra: "" type: logging resources: limits: memory: 1Gi requests: cpu: 1 memory: 1Gi schedule: 30 3 * * * type: curator logStore: elasticsearch: nodeCount: 3 nodeSelector: node-role.kubernetes.io/infra: "" type: logging redundancyPolicy: SingleRedundancy resources: limits: cpu: "12" memory: 48Gi requests: cpu: "12" memory: 48Gi storage: size: 1024G storageClassName: managed-premium type: elasticsearch managementState: Managed visualization: kibana: nodeSelector: node-role.kubernetes.io/infra: "" type: logging proxy: resources: limits: memory: 16Gi requests: cpu: 4 memory: 16Gi replicas: 1 resources: limits: memory: 16Gi requests: cpu: 4 memory: 16Gi type: kibana
This appears to be a duplicate of another issue already filed. Please provide a snapshot of the environment using [1] [1] https://github.com/openshift/cluster-logging-operator/tree/master/must-gather
Jeff, the requested snapshot has been provided, please, if you can confirm that this problem is the same as already mapped, if so, what is the expected date of the new release with this fix?
Moving this to 4.7.0 target release, because looking on the must-gather, I notice the following: - The cluster is not healthy: epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1600890027 19:40:27 elasticsearch yellow 3 3 15 13 0 5 6 0 - 57.7% - Its indices are still not green: health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .kibana IhEIguVUSj26KB01IcW-Jg 1 1 9 0 0 0 yellow open app-write IThfRVJ2R-e9s8Gkm2g3Mg 5 1 43160171 0 44024 44024 green open .searchguard SX29HbGNS_m4kFRtvwkG6Q 1 1 5 2 50 24 yellow open infra-write 1V6PDsJEQUW0eUuZUA603g 5 1 73137548 0 105572 105572 yellow open .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac YuKlCIlQTGqlljXR_f5U8w 1 1 2 0 0 0 - There are many unassigned shards: app-write 3 r UNASSIGNED CLUSTER_RECOVERED app-write 2 r UNASSIGNED CLUSTER_RECOVERED app-write 0 r UNASSIGNED CLUSTER_RECOVERED .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac 0 r UNASSIGNED CLUSTER_RECOVERED infra-write 1 r UNASSIGNED CLUSTER_RECOVERED infra-write 4 r UNASSIGNED CLUSTER_RECOVERED - It operates close to its limits regarding memory: ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 10.131.22.19 34 93 31 6.23 5.87 5.85 mdi * elasticsearch-cdm-iyeh4ot5-2 10.128.8.16 17 94 35 4.96 5.23 5.22 mdi - elasticsearch-cdm-iyeh4ot5-1 10.130.22.10 52 92 35 7.07 7.74 7.67 mdi - elasticsearch-cdm-iyeh4ot5-3
(In reply to Periklis Tsirakidis from comment #4) > Moving this to 4.7.0 target release, because looking on the must-gather, I > notice the following: > > - The cluster is not healthy: > > epoch timestamp cluster status node.total node.data shards pri > relo init unassign pending_tasks max_task_wait_time active_shards_percent > 1600890027 19:40:27 elasticsearch yellow 3 3 15 13 > 0 5 6 0 - 57.7% > > - Its indices are still not green: > > health status index uuid > pri rep docs.count docs.deleted store.size pri.store.size > green open .kibana > IhEIguVUSj26KB01IcW-Jg 1 1 9 0 0 > 0 > yellow open app-write > IThfRVJ2R-e9s8Gkm2g3Mg 5 1 43160171 0 44024 > 44024 > green open .searchguard > SX29HbGNS_m4kFRtvwkG6Q 1 1 5 2 50 > 24 > yellow open infra-write > 1V6PDsJEQUW0eUuZUA603g 5 1 73137548 0 105572 > 105572 > yellow open .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac > YuKlCIlQTGqlljXR_f5U8w 1 1 2 0 0 > 0 > This is clear indication there was either an incorrect upgrade or a mismatch in the operators which were deployed. The CLO in 4.5 and greater ships logs by writing to an alias with a "-write" suffix which points to an underlying index. The 4.5 version of EO configures ES clusters to block autocreation of any index with this name until the EO can setup the store. Either there is an upgrade where the EO was upgraded after CLO or a 4.4 EO was originally deployed along with a 4.5 CLO. There are safeguards in place which are supposed to block this scenario but something did not work. The work around options: * Undeploy and redeploy logging and remove the PVCS. * Temporarily undeploy the log collector and reindex the existing indices to allow the operator to properly setup the store
(In reply to dtarabor from comment #9) > @Periklis Tsirakidis > > i am curious if your analysis still applies to the newest must-gather > attached in comment #7? > > d Looking at the attachment from #7 makes more sense of what is happening then #c4 though I believe the analysis still applies. The ES deployment is running out of memory which is evidences by: * High number of indices and shards * Long GC times Interesting to also note many of the indices are those from the previous data model (e.g. "project.*") and they don't appear to have been aliased to include "app" or "infra". I'm unable to find any runs of curator to clean up old indices but they seem to be fairly new so I'm not certain it would matter. This will clear up as old indices are rolled off and new logs are written to the new data model. In the interim, the following options are available: * Add more resources (e.g Memory) to the ES nodes * Add another ES node * Evaluate the "project.*" and ".operations*" indices and remove those which are not required (oc exec -c elasticsearch $pod -- es_util --query=$index -XDELETE) * Reduce the redundancy policy to "ZeroRedundancy" We may need to also provide additional input if the cluster remains unable to land indices
(In reply to Bruno Furtado from comment #14) > what catches my attention is the fact that deleting a number x of namespaces > kibana works again. > This makes sense because a certain amount of RAM is necessary to maintain mappings for each index. This is exactly the reason we moved to a new data model starting in 4.5 which modifies the structure to take advantage of Elasticsearch's strengths
*** This bug has been marked as a duplicate of bug 1883357 ***