Bug 1886856

Summary:	The ES proxy container always restarts.
Product:	OpenShift Container Platform	Reporter:	Periklis Tsirakidis <periklis>
Component:	Logging	Assignee:	Periklis Tsirakidis <periklis>
Status:	CLOSED ERRATA	QA Contact:	Qiaoling Tang <qitang>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	aos-bugs
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-exploration
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 11:21:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1880960

Description Periklis Tsirakidis 2020-10-09 14:28:21 UTC

This bug was initially created as a copy of Bug #1880960

I am copying this bug because: 



Description of problem:

The proxy container restarts frequently:
$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-69c9d6f5bc-njb8q       1/1     Running     0          135m
elasticsearch-cdm-soj6vq70-1-688ddf8c9f-87vt9   2/2     Running     12         135m
elasticsearch-cdm-soj6vq70-2-64876b6bd-sld68    2/2     Running     12         140m
elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww   2/2     Running     12         146m

$ oc describe pod elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww 
  proxy:
    Container ID:  cri-o://9da3f8434af0659f409ba6690dfa1d6b3753cf10c8499e6a6b462381380a22b5
    Image:         registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f
    Image ID:      registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:1d0e41e6c7ffe968149d6b9b90b56081af15a3370fe3de607a84b3c20677dbc9
    Ports:         60000/TCP, 60001/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --listening-address=:60000
      --tls-cert=/etc/proxy/elasticsearch/logging-es.crt
      --tls-key=/etc/proxy/elasticsearch/logging-es.key
      --tls-client-ca=/etc/proxy/elasticsearch/admin-ca
      --metrics-listening-address=:60001
      --metrics-tls-cert=/etc/proxy/secrets/tls.crt
      --metrics-tls-key=/etc/proxy/secrets/tls.key
      --upstream-ca=/etc/proxy/elasticsearch/admin-ca
      --cache-expiry=60s
      --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"}
      --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"}
      --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"}
      --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"}
      --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"}
      --auth-admin-role=admin_reader
      --auth-default-role=project_user
    State:          Running
      Started:      Mon, 21 Sep 2020 05:26:57 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 21 Sep 2020 05:15:06 -0400
      Finished:     Mon, 21 Sep 2020 05:26:56 -0400
    Ready:          True
    Restart Count:  12
    Limits:
      memory:  64Mi
    Requests:
      cpu:     100m
      memory:  64Mi
    Environment:
      LOG_LEVEL:  info
    Mounts:
      /etc/proxy/elasticsearch from certificates (rw)
      /etc/proxy/secrets from elasticsearch-metrics (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-99brw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  elasticsearch-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      elasticsearch
    Optional:  false
  elasticsearch-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  elasticsearch-elasticsearch-cdm-soj6vq70-3
    ReadOnly:   false
  certificates:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch
    Optional:    false
  elasticsearch-metrics:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch-metrics
    Optional:    false
  elasticsearch-token-99brw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch-token-99brw
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From                                                           Message
  ----     ------                  ----                  ----                                                           -------
  Warning  FailedScheduling        144m                                                                                 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling        144m                                                                                 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Normal   Scheduled               140m                                                                                 Successfully assigned openshift-logging/elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww to qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal
  Normal   SuccessfulAttachVolume  140m                  attachdetach-controller                                        AttachVolume.Attach succeeded for volume "pvc-73b52ca6-bf96-4a3a-a12c-15275b34267c"
  Normal   AddedInterface          140m                  multus                                                         Add eth0 [10.129.2.5/23]
  Normal   Pulled                  140m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Container image "registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:d083829ae9a4777f4f070acdd64298e1514e8b7895019186af22f8893656e475" already present on machine
  Normal   Created                 140m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Created container elasticsearch
  Normal   Started                 140m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Started container elasticsearch
  Warning  Unhealthy               140m (x3 over 140m)   kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
  Warning  Unhealthy               139m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
  Warning  BackOff                 24m (x4 over 116m)    kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Back-off restarting failed container
  Normal   Pulled                  115s (x13 over 140m)  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Container image "registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f" already present on machine
  Normal   Created                 114s (x13 over 140m)  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Created container proxy
  Normal   Started                 114s (x13 over 140m)  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Started container proxy 


No error message in the proxy container:
$ oc logs -c proxy elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww 
time="2020-09-21T07:32:21Z" level=info msg="mapping path \"/\" => upstream \"https://localhost:9200/\""
time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60001"
time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60000"

The worker node has enough resources:
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests       Limits
  --------                   --------       ------
  cpu                        1936m (55%)    1 (28%)
  memory                     10123Mi (72%)  8480Mi (61%)
  ephemeral-storage          0 (0%)         0 (0%)
  hugepages-1Gi              0 (0%)         0 (0%)
  hugepages-2Mi              0 (0%)         0 (0%)
  attachable-volumes-gce-pd  0              0

log store configurations:
    logStore:
      elasticsearch:
        nodeCount: 3
        redundancyPolicy: SingleRedundancy
        resources:
          requests:
            memory: 2Gi
        storage:
          size: 20Gi
          storageClassName: standard
      retentionPolicy:
        application:
          maxAge: 1d
        audit:
          maxAge: 2w
        infra:
          maxAge: 12h
      type: elasticsearch

Version-Release number of selected component (if applicable):
elasticsearch-operator.4.6.0-202009192030.p0

How reproducible:
In some clusters, it's 100% reproducible, in some clusters, no such issue.

Steps to Reproduce:
1. deploy logging 4.6
2. check ES pods 
3.

Actual results:


Expected results:


Additional info:

Comment 2 Qiaoling Tang 2020-10-12 01:31:47 UTC

The fix doesn't work, the default request memory of proxy container is still `64Mi`.

$ oc get cl instance -oyaml
  logStore:
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      resources:
        requests:
          memory: 2Gi
      storage:
        size: 20Gi
        storageClassName: standard
    retentionPolicy:
      application:
        maxAge: 1d
      audit:
        maxAge: 2w
      infra:
        maxAge: 3h
    type: elasticsearch

$ oc get es elasticsearch -oyaml
  nodeSpec:
    proxyResources:
      limits:
        memory: 64Mi
      requests:
        cpu: 100m
        memory: 64Mi
    resources:
      requests:
        memory: 2Gi

Image: quay.io/openshift/origin-elasticsearch-operator@sha256:1641e7698155161b3fadc3079c0d4abdbdd7107cc89a4642bb8e58309b8d9700

Comment 3 Qiaoling Tang 2020-10-12 05:32:20 UTC

I manually set the memory to 256Mi, the proxy container restarted after running for about 3 hours.

  proxy:
    Container ID:  cri-o://90af77f73f240cfe3a1be7e3a840671a9cd8da932909aade141681574a9dfe89
    Image:         quay.io/openshift/origin-elasticsearch-proxy:4.7.0
    Image ID:      quay.io/openshift/origin-elasticsearch-proxy@sha256:f8aeb4d918dc4a6d77641af58586fc39ec339c1f95b167d1f62aec5164e5da1f
    Ports:         60000/TCP, 60001/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --listening-address=:60000
      --tls-cert=/etc/proxy/elasticsearch/logging-es.crt
      --tls-key=/etc/proxy/elasticsearch/logging-es.key
      --tls-client-ca=/etc/proxy/elasticsearch/admin-ca
      --metrics-listening-address=:60001
      --metrics-tls-cert=/etc/proxy/secrets/tls.crt
      --metrics-tls-key=/etc/proxy/secrets/tls.key
      --upstream-ca=/etc/proxy/elasticsearch/admin-ca
      --cache-expiry=60s
      --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"}
      --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"}
      --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"}
      --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"}
      --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"}
      --auth-admin-role=admin_reader
      --auth-default-role=project_user
    State:          Running
      Started:      Mon, 12 Oct 2020 01:05:56 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sun, 11 Oct 2020 21:52:49 -0400
      Finished:     Mon, 12 Oct 2020 01:05:55 -0400
    Ready:          True
    Restart Count:  1
    Limits:
      memory:  256Mi
    Requests:
      cpu:     100m
      memory:  256Mi
    Environment:
      LOG_LEVEL:  info
    Mounts:
      /etc/proxy/elasticsearch from certificates (rw)
      /etc/proxy/secrets from elasticsearch-metrics (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-rfc2v (ro)

$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-6b87bdfcd-2nzfc        1/1     Running     0          4h7m
elasticsearch-cdm-pwwtd9g6-1-564d87bdfd-9m7zh   2/2     Running     1          3h39m
elasticsearch-cdm-pwwtd9g6-2-98c975c59-2ptpp    2/2     Running     1          3h38m
elasticsearch-cdm-pwwtd9g6-3-599cdf5df6-9wcf9   2/2     Running     1          3h36m

Comment 5 Qiaoling Tang 2020-11-04 00:05:21 UTC

The ES pods has been running for 13 hours, and the proxy container didn't restart.

$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-54d478856f-h2fnv       1/1     Running     0          13h
elasticsearch-cdm-djppeosj-1-5c9f6ff64-t7r8z    2/2     Running     0          13h
elasticsearch-cdm-djppeosj-2-7d75b56bff-xkbsg   2/2     Running     0          13h
elasticsearch-cdm-djppeosj-3-7f4f86bd86-s5t8r   2/2     Running     0          13h
elasticsearch-delete-app-1604448000-9wxpr       0/1     Completed   0          2m26s
elasticsearch-delete-audit-1604448000-crt2l     0/1     Completed   0          2m26s
elasticsearch-delete-infra-1604448000-xp2v2     0/1     Completed   0          2m26s

$ oc get csv
NAME                                           DISPLAY                  VERSION                 REPLACES   PHASE
clusterlogging.4.7.0-202011021919.p0           Cluster Logging          4.7.0-202011021919.p0              Succeeded
elasticsearch-operator.4.7.0-202011030448.p0   Elasticsearch Operator   4.7.0-202011030448.p0              Succeeded


The default value of proxy container requested memory is 256Mi now.

    managementState: Managed
    nodeSpec:
      proxyResources:
        limits:
          memory: 256Mi
        requests:
          cpu: 100m
          memory: 256Mi

So verified this bz.

Comment 10 errata-xmlrpc 2021-02-24 11:21:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652