Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1892362

Summary: Elastic Search container fails at 15m mark with TLS error
Product: OpenShift Container Platform Reporter: Gurenko Alex <agurenko>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED DUPLICATE QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs, periklis
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-exploration
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-09 10:35:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gurenko Alex 2020-10-28 14:45:32 UTC
Description of problem: elasticsearch-cdm pods are going into CrashLoopBackOff state after 15m mark


Version-Release number of selected component (if applicable):
Fresh OCP 4.6 bare metal deployment


How reproducible: 2/2


Steps to Reproduce:
1. Deploy Cluster Logging according to docs
2. Wait 15 minutes

Actual results:

[kni@ocp-edge12 logging-template]$ oc get pods
NAME                                            READY   STATUS             RESTARTS   AGE
cluster-logging-operator-55f44674f7-lgn95       1/1     Running            0          25m
curator-1603895400-h58bm                        0/1     Completed          0          13m
elasticsearch-cdm-2ugld2co-1-659f44c68c-qrggd   1/2     CrashLoopBackOff   7          24m
elasticsearch-cdm-2ugld2co-2-8668f7889b-d2gzk   1/2     CrashLoopBackOff   7          24m
elasticsearch-cdm-2ugld2co-3-577c957945-87s55   1/2     CrashLoopBackOff   7          24m
elasticsearch-delete-app-1603895400-jc2r9       0/1     Completed          0          13m
elasticsearch-delete-audit-1603895400-5bx7w     0/1     Completed          0          13m
elasticsearch-delete-infra-1603895400-nb422     0/1     Completed          0          13m
elasticsearch-rollover-app-1603895400-4bdp9     0/1     Completed          0          13m
elasticsearch-rollover-audit-1603895400-lrwf4   0/1     Completed          0          13m
elasticsearch-rollover-infra-1603895400-gg6bm   0/1     Completed          0          13m
fluentd-6wlkp                                   1/1     Running            0          24m
fluentd-99mnw                                   1/1     Running            0          24m
fluentd-9z4hn                                   1/1     Running            0          24m
fluentd-fclhq                                   1/1     Running            0          24m
fluentd-kvxjj                                   1/1     Running            0          24m
fluentd-lltth                                   1/1     Running            0          24m
fluentd-mgnq9                                   1/1     Running            0          24m
fluentd-nm8kk                                   1/1     Running            0          24m
fluentd-nrshj                                   1/1     Running            0          24m
fluentd-pkvdg                                   1/1     Running            0          24m
fluentd-zgw5g                                   1/1     Running            0          24m
kibana-56c4d996c7-vkphq                         2/2     Running            0          24m


Expected results:

pods stays in Running state


Additional info:

Two deployments in a row cluster logging installs and starts to run as expected but after unverified amount of time in a first deployment and on a 15m mark on a second deployment it goes into CrashLoopBackOff state.

[kni@ocp-edge12 logging-template]$ oc logs elasticsearch-cdm-2ugld2co-1-659f44c68c-qrggd elasticsearch
Error from server: Get "https://10.46.57.23:10250/containerLogs/openshift-logging/elasticsearch-cdm-2ugld2co-1-659f44c68c-qrggd/elasticsearch": remote error: tls: internal error

Comment 1 Gurenko Alex 2020-10-28 17:15:39 UTC
I see that proxy container inside the pod is getting OOMkilled

  proxy:                                
    Container ID:  cri-o://06896fb7b4e996d4059d0201c8074bd094ccdba62e46d6d1788f0c3228e3b7be
    Image:         registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:5af0a2071c9779111b66be4a3c5e593a8192e1924795afc54bd06db8afb58722                                                                 
    Image ID:      registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:0dc5aff589aff9997339aa0d5021ffe5e781919490375987dfc1f59a703e3042                                                                 
    Ports:         60000/TCP, 60001/TCP        
    Host Ports:    0/TCP, 0/TCP     
    Args:                    
      --listening-address=:60000
      --tls-cert=/etc/proxy/elasticsearch/logging-es.crt
      --tls-key=/etc/proxy/elasticsearch/logging-es.key
      --tls-client-ca=/etc/proxy/elasticsearch/admin-ca
      --metrics-listening-address=:60001
      --metrics-tls-cert=/etc/proxy/secrets/tls.crt
      --metrics-tls-key=/etc/proxy/secrets/tls.key               
      --upstream-ca=/etc/proxy/elasticsearch/admin-ca
      --cache-expiry=60s
      --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"}
      --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"}                                                                        
      --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"}                                                                                    
      --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"}                                                                    
      --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"}                                               
      --auth-admin-role=admin_reader
      --auth-default-role=project_user             
    State:          Running
      Started:      Wed, 28 Oct 2020 19:02:04 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 28 Oct 2020 18:56:25 +0200                                                                                                                                                              
      Finished:     Wed, 28 Oct 2020 18:56:58 +0200
    Ready:          True                                                                                                                                                                                         
    Restart Count:  26
    Limits:                                                 
      memory:  64Mi                                                      
    Requests:                                                                                                                                                                                                    
      cpu:     100m                                                                                                                                                                                              
      memory:  64Mi

Is there a knob I can tweak specifically for proxy container?

Comment 2 Periklis Tsirakidis 2020-11-09 10:35:10 UTC
@Gurenko

This is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1880960

*** This bug has been marked as a duplicate of bug 1880960 ***