Bug 1881737 - 504 gateway error in kibana after project data populated
Summary: 504 gateway error in kibana after project data populated
Keywords:
Status: CLOSED DUPLICATE of bug 1883357
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: 4.5.z
Assignee: Periklis Tsirakidis
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-23 00:15 UTC by dtarabor
Modified: 2020-10-02 06:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-29 19:42:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description dtarabor 2020-09-23 00:15:37 UTC
Description of problem:

customer is seeing a 504 gateway error in the Kibana GUI after projects populate in the logging stack.
did not see resources being an issue in the logging dumps.
this was reproducible on their 3.11 cluster as well.

Cluster Logging - 4.5.0-202008261951.p0
Elasticsearch Operator - 4.5.0-202008310950.p0

How reproducible: Every time

Environment: OCP version 4.5.6
Provider: Azure

Steps to reproduce: [1]
1. Install Elasticsearch Operator
2. Install Cluster Logging Operator
3. Create a Cluster Logging instance
4. Access Kibana page
5. Create projects: for i in $(seq 1 50) do ; oc new-project app-$i, done
6. Refresh Kibana page
7. repeat step 5 if necessary, changing the sequence numbers (eg. 51 100)

Actual results:

504 gateway error in the kibana gui

Expected results:

functional gui

Additional info:

this happens after the customer populates their EFK with over 300+ projects. If the projects are deleted, Kibana is accessible.

we are wondering if this is a resource issue or a bug. 
below is the customers' resources allocated to their cluster:

spec:
  collection:
    logs:
      fluentd:
        resources:
          limits:
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi
      type: fluentd
  curation:
    curator:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
        type: logging
      resources:
        limits:
          memory: 1Gi
        requests:
          cpu: 1
          memory: 1Gi
      schedule: 30 3 * * *
    type: curator
  logStore:
    elasticsearch:
      nodeCount: 3
      nodeSelector:
        node-role.kubernetes.io/infra: ""
        type: logging
      redundancyPolicy: SingleRedundancy
      resources:
        limits:
          cpu: "12"
          memory: 48Gi
        requests:
          cpu: "12"
          memory: 48Gi
      storage:
        size: 1024G
        storageClassName: managed-premium
    type: elasticsearch
  managementState: Managed
  visualization:
    kibana:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
        type: logging
      proxy:
        resources:
          limits:
            memory: 16Gi
          requests:
            cpu: 4
            memory: 16Gi
      replicas: 1
      resources:
        limits:
          memory: 16Gi
        requests:
          cpu: 4
          memory: 16Gi
    type: kibana

Comment 1 Jeff Cantrill 2020-09-23 15:42:39 UTC
This appears to be a duplicate of another issue already filed. Please provide a snapshot of the environment using [1]

[1] https://github.com/openshift/cluster-logging-operator/tree/master/must-gather

Comment 3 Bruno Furtado 2020-09-23 22:00:39 UTC
Jeff,

the requested snapshot has been provided, please, if you can confirm that this problem is the same as already mapped, if so, what is the expected date of the new release with this fix?

Comment 4 Periklis Tsirakidis 2020-09-24 14:05:20 UTC
Moving this to 4.7.0 target release, because looking on the must-gather, I notice the following:

- The cluster is not healthy: 

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1600890027 19:40:27  elasticsearch yellow          3         3     15  13    0    5        6             0                  -                 57.7%

- Its indices are still not green:

health status index                                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .kibana                                          IhEIguVUSj26KB01IcW-Jg   1   1          9            0          0              0
yellow open   app-write                                        IThfRVJ2R-e9s8Gkm2g3Mg   5   1   43160171            0      44024          44024
green  open   .searchguard                                     SX29HbGNS_m4kFRtvwkG6Q   1   1          5            2         50             24
yellow open   infra-write                                      1V6PDsJEQUW0eUuZUA603g   5   1   73137548            0     105572         105572
yellow open   .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac YuKlCIlQTGqlljXR_f5U8w   1   1          2            0          0              0

- There are many unassigned shards:
app-write                                        3 r UNASSIGNED   CLUSTER_RECOVERED
app-write                                        2 r UNASSIGNED   CLUSTER_RECOVERED
app-write                                        0 r UNASSIGNED   CLUSTER_RECOVERED
.kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac 0 r UNASSIGNED   CLUSTER_RECOVERED
infra-write                                      1 r UNASSIGNED   CLUSTER_RECOVERED
infra-write                                      4 r UNASSIGNED   CLUSTER_RECOVERED

- It operates close to its limits regarding memory:

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.131.22.19           34          93  31    6.23    5.87     5.85 mdi       *      elasticsearch-cdm-iyeh4ot5-2
10.128.8.16            17          94  35    4.96    5.23     5.22 mdi       -      elasticsearch-cdm-iyeh4ot5-1
10.130.22.10           52          92  35    7.07    7.74     7.67 mdi       -      elasticsearch-cdm-iyeh4ot5-3

Comment 8 Jeff Cantrill 2020-09-24 17:50:55 UTC
(In reply to Periklis Tsirakidis from comment #4)
> Moving this to 4.7.0 target release, because looking on the must-gather, I
> notice the following:
> 
> - The cluster is not healthy: 
> 
> epoch      timestamp cluster       status node.total node.data shards pri
> relo init unassign pending_tasks max_task_wait_time active_shards_percent
> 1600890027 19:40:27  elasticsearch yellow          3         3     15  13   
> 0    5        6             0                  -                 57.7%
> 
> - Its indices are still not green:
> 
> health status index                                            uuid         
> pri rep docs.count docs.deleted store.size pri.store.size
> green  open   .kibana                                         
> IhEIguVUSj26KB01IcW-Jg   1   1          9            0          0           
> 0
> yellow open   app-write                                       
> IThfRVJ2R-e9s8Gkm2g3Mg   5   1   43160171            0      44024         
> 44024
> green  open   .searchguard                                    
> SX29HbGNS_m4kFRtvwkG6Q   1   1          5            2         50           
> 24
> yellow open   infra-write                                     
> 1V6PDsJEQUW0eUuZUA603g   5   1   73137548            0     105572        
> 105572
> yellow open   .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac
> YuKlCIlQTGqlljXR_f5U8w   1   1          2            0          0           
> 0
>

This is clear indication there was either an incorrect upgrade or a mismatch in the operators which were deployed.  The CLO in 4.5 and greater ships logs by writing to an alias with a "-write" suffix which points to an underlying index.  The 4.5 version of EO configures ES clusters to block autocreation of any index with this name until the EO can setup the store. Either there is an upgrade where the EO was upgraded after CLO or a 4.4 EO was originally deployed along with a 4.5 CLO.  There are safeguards in place which are supposed to block this scenario but something did not work.  The work around options:

* Undeploy and redeploy logging and remove the PVCS.
* Temporarily undeploy the log collector and reindex the existing indices to allow the operator to properly setup the store

Comment 10 Jeff Cantrill 2020-09-24 19:23:58 UTC
(In reply to dtarabor from comment #9)
> @Periklis Tsirakidis
> 
> i am curious if your analysis still applies to the newest must-gather
> attached in comment #7?
> 
> d

Looking at the attachment from #7 makes more sense of what is happening then #c4 though I believe the analysis still applies.  The ES deployment is running out of memory which is evidences by:

* High number of indices and shards
* Long GC times

Interesting to also note many of the indices are those from the previous data model (e.g. "project.*") and they don't appear to have been aliased to include "app" or "infra".  I'm unable to find any runs of curator to clean up old indices but they seem to be fairly new so I'm not certain it would matter.  This will clear up as old indices are rolled off and new logs are written to the new data model. In the interim, the following options are available:

* Add more resources (e.g Memory) to the ES nodes
* Add another ES node
* Evaluate the "project.*" and ".operations*" indices and remove those which are not required (oc exec -c elasticsearch $pod -- es_util --query=$index -XDELETE)
* Reduce the redundancy policy to "ZeroRedundancy"

We may need to also provide additional input if the cluster remains unable to land indices

Comment 15 Jeff Cantrill 2020-09-25 18:51:23 UTC
(In reply to Bruno Furtado from comment #14)
> what catches my attention is the fact that deleting a number x of namespaces
> kibana works again.
> 

This makes sense because a certain amount of RAM is necessary to maintain mappings for each index.  This is exactly the reason we moved to a new data model starting in 4.5 which modifies the structure to take advantage of Elasticsearch's strengths

Comment 21 Jeff Cantrill 2020-09-29 19:42:52 UTC

*** This bug has been marked as a duplicate of bug 1883357 ***


Note You need to log in before you can comment on or make changes to this bug.