Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1892362

Summary:	Elastic Search container fails at 15m mark with TLS error
Product:	OpenShift Container Platform	Reporter:	Gurenko Alex <agurenko>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED DUPLICATE	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	aos-bugs, periklis
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-exploration
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-09 10:35:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gurenko Alex 2020-10-28 14:45:32 UTC

Description of problem: elasticsearch-cdm pods are going into CrashLoopBackOff state after 15m mark


Version-Release number of selected component (if applicable):
Fresh OCP 4.6 bare metal deployment


How reproducible: 2/2


Steps to Reproduce:
1. Deploy Cluster Logging according to docs
2. Wait 15 minutes

Actual results:

[kni@ocp-edge12 logging-template]$ oc get pods
NAME                                            READY   STATUS             RESTARTS   AGE
cluster-logging-operator-55f44674f7-lgn95       1/1     Running            0          25m
curator-1603895400-h58bm                        0/1     Completed          0          13m
elasticsearch-cdm-2ugld2co-1-659f44c68c-qrggd   1/2     CrashLoopBackOff   7          24m
elasticsearch-cdm-2ugld2co-2-8668f7889b-d2gzk   1/2     CrashLoopBackOff   7          24m
elasticsearch-cdm-2ugld2co-3-577c957945-87s55   1/2     CrashLoopBackOff   7          24m
elasticsearch-delete-app-1603895400-jc2r9       0/1     Completed          0          13m
elasticsearch-delete-audit-1603895400-5bx7w     0/1     Completed          0          13m
elasticsearch-delete-infra-1603895400-nb422     0/1     Completed          0          13m
elasticsearch-rollover-app-1603895400-4bdp9     0/1     Completed          0          13m
elasticsearch-rollover-audit-1603895400-lrwf4   0/1     Completed          0          13m
elasticsearch-rollover-infra-1603895400-gg6bm   0/1     Completed          0          13m
fluentd-6wlkp                                   1/1     Running            0          24m
fluentd-99mnw                                   1/1     Running            0          24m
fluentd-9z4hn                                   1/1     Running            0          24m
fluentd-fclhq                                   1/1     Running            0          24m
fluentd-kvxjj                                   1/1     Running            0          24m
fluentd-lltth                                   1/1     Running            0          24m
fluentd-mgnq9                                   1/1     Running            0          24m
fluentd-nm8kk                                   1/1     Running            0          24m
fluentd-nrshj                                   1/1     Running            0          24m
fluentd-pkvdg                                   1/1     Running            0          24m
fluentd-zgw5g                                   1/1     Running            0          24m
kibana-56c4d996c7-vkphq                         2/2     Running            0          24m


Expected results:

pods stays in Running state


Additional info:

Two deployments in a row cluster logging installs and starts to run as expected but after unverified amount of time in a first deployment and on a 15m mark on a second deployment it goes into CrashLoopBackOff state.

[kni@ocp-edge12 logging-template]$ oc logs elasticsearch-cdm-2ugld2co-1-659f44c68c-qrggd elasticsearch
Error from server: Get "https://10.46.57.23:10250/containerLogs/openshift-logging/elasticsearch-cdm-2ugld2co-1-659f44c68c-qrggd/elasticsearch": remote error: tls: internal error

Comment 1 Gurenko Alex 2020-10-28 17:15:39 UTC

I see that proxy container inside the pod is getting OOMkilled

  proxy:                                
    Container ID:  cri-o://06896fb7b4e996d4059d0201c8074bd094ccdba62e46d6d1788f0c3228e3b7be
    Image:         registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:5af0a2071c9779111b66be4a3c5e593a8192e1924795afc54bd06db8afb58722                                                                 
    Image ID:      registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:0dc5aff589aff9997339aa0d5021ffe5e781919490375987dfc1f59a703e3042                                                                 
    Ports:         60000/TCP, 60001/TCP        
    Host Ports:    0/TCP, 0/TCP     
    Args:                    
      --listening-address=:60000
      --tls-cert=/etc/proxy/elasticsearch/logging-es.crt
      --tls-key=/etc/proxy/elasticsearch/logging-es.key
      --tls-client-ca=/etc/proxy/elasticsearch/admin-ca
      --metrics-listening-address=:60001
      --metrics-tls-cert=/etc/proxy/secrets/tls.crt
      --metrics-tls-key=/etc/proxy/secrets/tls.key               
      --upstream-ca=/etc/proxy/elasticsearch/admin-ca
      --cache-expiry=60s
      --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"}
      --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"}                                                                        
      --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"}                                                                                    
      --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"}                                                                    
      --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"}                                               
      --auth-admin-role=admin_reader
      --auth-default-role=project_user             
    State:          Running
      Started:      Wed, 28 Oct 2020 19:02:04 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 28 Oct 2020 18:56:25 +0200                                                                                                                                                              
      Finished:     Wed, 28 Oct 2020 18:56:58 +0200
    Ready:          True                                                                                                                                                                                         
    Restart Count:  26
    Limits:                                                 
      memory:  64Mi                                                      
    Requests:                                                                                                                                                                                                    
      cpu:     100m                                                                                                                                                                                              
      memory:  64Mi

Is there a knob I can tweak specifically for proxy container?

Comment 2 Periklis Tsirakidis 2020-11-09 10:35:10 UTC

@Gurenko

This is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1880960

*** This bug has been marked as a duplicate of bug 1880960 ***