Bug 1746542

Summary: resources spec don't get updated by the cluster logging operator
Product: OpenShift Container Platform Reporter: Jeff Cantrill <jcantril>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED WONTFIX QA Contact: Anping Li <anli>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: anli, aos-bugs, christoph.obexer, rmeggins, rspazzol, stwalter
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1732936 Environment:
Last Closed: 2020-01-20 18:16:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1732936    
Bug Blocks:    

Comment 1 Jeff Cantrill 2019-08-28 19:26:33 UTC
*** Bug 1730908 has been marked as a duplicate of this bug. ***

Comment 2 Vladimir Masarik 2019-08-29 19:57:05 UTC
After deploying the logging with 3 ES nodes using `WORKDIR=/home/vmasarik/.openshift TEST_LOGGING=false BUILD_IMAGES=true PUSH_IMAGES=true DEPLOY_CLUSTER=false DEPLOY_LOGGING=true USE_OLM=false USE_CUSTOM_IMAGES=true hack/get-cluster-run-tests.sh` there seems to be one pod whose hash in the name is shorter than others, example 
```
elasticsearch-cdm-wqyx76al-1-5dd95865f-8bt6r    2/2     Running   0          12m
elasticsearch-cdm-wqyx76al-2-5f748cf546-jv6r4   2/2     Running   0          74s
elasticsearch-cdm-wqyx76al-3-79fcb65bd9-s8jxg   2/2     Running   0          14m
```
In this instance I was able to reproduce the fact that pods were not being recreated. Here is my command history.
```
12951  oc describe po elasticsearch-cdm-k8anfnxv-2-9bd4848d-2cv5j
Name:               elasticsearch-cdm-k8anfnxv-2-9bd4848d-2cv5j
Namespace:          openshift-logging
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-0-164-205.ec2.internal/10.0.164.205
Start Time:         Thu, 29 Aug 2019 15:55:42 +0200
Labels:             cluster-name=elasticsearch
                    component=elasticsearch
                    es-node-client=true
                    es-node-data=true
                    es-node-master=true
                    node-name=elasticsearch-cdm-k8anfnxv-2
                    pod-template-hash=9bd4848d
                    tuned.openshift.io/elasticsearch=true
Annotations:        openshift.io/scc: restricted
Status:             Running
IP:                 10.131.0.28
Controlled By:      ReplicaSet/elasticsearch-cdm-k8anfnxv-2-9bd4848d
Containers:
  elasticsearch:
    Container ID:   cri-o://e8fe27929e72b92d32c9bdbdfeb753a9239b5820187f8581185a684ebaafb309
    Image:          image-registry.openshift-image-registry.svc:5000/openshift/origin-logging-elasticsearch5:latest
    Image ID:       image-registry.openshift-image-registry.svc:5000/openshift/origin-logging-elasticsearch5@sha256:54a5df9a07f9d48e32c30d621e7fce062e4cb905275b2a30dae38700c351d6a0
    Ports:          9300/TCP, 9200/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 29 Aug 2019 15:55:55 +0200
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  4Gi
    Requests:
      cpu:      500m
      memory:   4Gi
    Readiness:  exec [/usr/share/elasticsearch/probe/readiness.sh] delay=10s timeout=30s period=5s #success=1 #failure=3
    Environment:
      DC_NAME:                  elasticsearch-cdm-k8anfnxv-2
      NAMESPACE:                openshift-logging (v1:metadata.namespace)
      KUBERNETES_TRUST_CERT:    true
      SERVICE_DNS:              elasticsearch-cluster
      CLUSTER_NAME:             elasticsearch
      INSTANCE_RAM:             4Gi
      HEAP_DUMP_LOCATION:       /elasticsearch/persistent/heapdump.hprof
      RECOVER_AFTER_TIME:       5m
      READINESS_PROBE_TIMEOUT:  30
      POD_LABEL:                cluster=elasticsearch
      IS_MASTER:                true
      HAS_DATA:                 true
    Mounts:
      /elasticsearch/persistent from elasticsearch-storage (rw)
      /etc/openshift/elasticsearch/secret from certificates (rw)
      /usr/share/java/elasticsearch/config from elasticsearch-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-n2vpt (ro)
  proxy:
    Container ID:  cri-o://7948624af558ebe0d6b01503107b203bbae5cd8211bbfa660b3438820720922b
    Image:         registry.redhat.io/openshift4/ose-oauth-proxy:v4.1.13-201908210601
    Image ID:      registry.redhat.io/openshift4/ose-oauth-proxy@sha256:434359f9a7e6aad4a049ffb862fbad5df2bda41c9ffe2fd1e82a49ad28df02f5
    Port:          60000/TCP
    Host Port:     0/TCP
    Args:
      --https-address=:60000
      --provider=openshift
      --upstream=https://127.0.0.1:9200
      --tls-cert=/etc/proxy/secrets/tls.crt
      --tls-key=/etc/proxy/secrets/tls.key
      --upstream-ca=/etc/proxy/elasticsearch/admin-ca
      --openshift-service-account=elasticsearch
      -openshift-sar={"resource": "namespaces", "verb": "get"}
      -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}
      --pass-user-bearer-token
      --cookie-secret=ija/R1W4NfluNTIuD+OqDA==
    State:          Running
      Started:      Thu, 29 Aug 2019 15:55:55 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/proxy/elasticsearch from certificates (rw)
      /etc/proxy/secrets from elasticsearch-metrics (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-n2vpt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  elasticsearch-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      elasticsearch
    Optional:  false
  elasticsearch-storage:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  certificates:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch
    Optional:    false
  elasticsearch-metrics:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch-metrics
    Optional:    false
  elasticsearch-token-n2vpt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch-token-n2vpt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                                   Message
  ----     ------     ----                   ----                                   -------
  Normal   Scheduled  15m                    default-scheduler                      Successfully assigned openshift-logging/elasticsearch-cdm-k8anfnxv-2-9bd4848d-2cv5j to ip-10-0-164-205.ec2.internal
  Normal   Pulling    15m                    kubelet, ip-10-0-164-205.ec2.internal  Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/origin-logging-elasticsearch5:latest"
  Normal   Pulled     15m                    kubelet, ip-10-0-164-205.ec2.internal  Successfully pulled image "image-registry.openshift-image-registry.svc:5000/openshift/origin-logging-elasticsearch5:latest"
  Normal   Created    15m                    kubelet, ip-10-0-164-205.ec2.internal  Created container elasticsearch
  Normal   Started    15m                    kubelet, ip-10-0-164-205.ec2.internal  Started container elasticsearch
  Normal   Pulled     15m                    kubelet, ip-10-0-164-205.ec2.internal  Container image "registry.redhat.io/openshift4/ose-oauth-proxy:v4.1.13-201908210601" already present on machine
  Normal   Created    15m                    kubelet, ip-10-0-164-205.ec2.internal  Created container proxy
  Normal   Started    15m                    kubelet, ip-10-0-164-205.ec2.internal  Started container proxy
  Warning  Unhealthy  10m (x34 over 14m)     kubelet, ip-10-0-164-205.ec2.internal  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
  Warning  Unhealthy  4m57s (x2 over 6m27s)  kubelet, ip-10-0-164-205.ec2.internal  Readiness probe errored: rpc error: code = Unknown desc = command error: command timed out, stdout: , stderr: , exit code -1
  Warning  Unhealthy  22s (x12 over 15m)     kubelet, ip-10-0-164-205.ec2.internal  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
12953  oc delete po elasticsearch-cdm-k8anfnxv-2-9bd4848d-2cv5j

# Pod was redeployed but has short hash as well.

12958  oc delete rs elasticsearch-cdm-k8anfnxv-2-9bd4848d

# RS is not recreated
```

ESO logs:
```
time="2019-08-29T13:56:27Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
time="2019-08-29T14:00:35Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 2 shards in preparation for cluster restart"
time="2019-08-29T14:01:54Z" level=info msg="Waiting for cluster to complete recovery: red / green"
time="2019-08-29T14:01:54Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-k8anfnxv-2: red / green"
time="2019-08-29T14:04:02Z" level=info msg="Waiting for cluster to complete recovery:  / green"
time="2019-08-29T14:04:48Z" level=info msg="Waiting for cluster to complete recovery:  / green"
time="2019-08-29T14:05:35Z" level=info msg="Waiting for cluster to complete recovery:  / green"
time="2019-08-29T14:10:44Z" level=info msg="Waiting for cluster to complete recovery:  / green"
time="2019-08-29T14:10:55Z" level=warning msg="Unknown type for low: <nil>"
time="2019-08-29T14:10:55Z" level=warning msg="Unknown type for high: <nil>"
time="2019-08-29T14:11:20Z" level=info msg="Waiting for cluster to complete recovery: red / green"
time="2019-08-29T14:11:39Z" level=info msg="Waiting for cluster to complete recovery: red / green"
time="2019-08-29T14:12:07Z" level=info msg="Waiting for cluster to complete recovery: red / green"


CLO logs:
```
time="2019-08-29T13:56:32Z" level=info msg="Watching logging.openshift.io/v1, ClusterLogging, openshift-logging, 5000000000"
time="2019-08-29T13:56:38Z" level=info msg="Elasticsearch image change found, updating elasticsearch"
time="2019-08-29T13:56:40Z" level=error msg="Unable to read file to get contents: open /tmp/_working_dir/kibana-proxy-oauth.secret: no such file or directory"
time="2019-08-29T13:56:42Z" level=info msg="Kibana image(s) change found, updating \"kibana\""
time="2019-08-29T13:56:43Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-08-29T13:56:45Z" level=info msg="Curator image change found, updating \"curator\""
time="2019-08-29T13:56:50Z" level=info msg="Collector image change found, updating \"rsyslog\""
time="2019-08-29T13:56:51Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T13:57:04Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-08-29T13:57:11Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T13:57:30Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T13:57:43Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-08-29T13:57:50Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T13:58:03Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-08-29T13:58:09Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T13:58:17Z" level=info msg="Updating status of Elasticsearch"
time="2019-08-29T13:59:07Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:00:24Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:00:32Z" level=info msg="Updating status of Elasticsearch"
time="2019-08-29T14:00:44Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:01:03Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:01:23Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:01:42Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:04:05Z" level=info msg="Updating status of Elasticsearch"
time="2019-08-29T14:04:43Z" level=info msg="Updating status of Elasticsearch"
time="2019-08-29T14:08:27Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:08:35Z" level=info msg="Updating status of Elasticsearch"
time="2019-08-29T14:08:47Z" level=info msg="Updating status of Rsyslog"
time="2019-08-29T14:10:31Z" level=info msg="Updating status of Elasticsearch"
```


I deleted the short hash pod, it got recreated, but if I deleted any RS or any deployments of ES none of them were recreated. Since I did nothing, but deploy logging and delete the short hash pod and some RS/deployments, I thought I could reproduce it with setting the log level of CLO to debug.

I redeployed the logging, changed the logging level. I again had 3 ES pods, one with short hash name. I deleted a pod with long hash name, and it got recreated. No interesting entries in CLO logs, just the starting 5 lines that are printed always (version and such). *SIGH* now that I think about it I never deleted any RS or deployment. So there goes my debugging, and this makes all the information fairly useless, sorry.


Few other things I noticed. The ES pod is in `running 2/2` and it is immediately terminated without any error. Also, sometimes after starting the cluster, all the pods are terminated one by one and replaced. In this case there is no ES pod with short hash name, and if I delete one it is properly replaced by a new pod.
```
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-logging-operator-84fbcff567-ghbkr       1/1     Running   0          4m24s
elasticsearch-cdm-uirubjv6-1-59ddb497c5-jwv9l   1/2     Running   0          5m44s
elasticsearch-cdm-uirubjv6-2-cdfddc54f-svv4p    1/2     Running   0          4m42s
elasticsearch-cdm-uirubjv6-3-68b5c7857b-5l4pc   1/2     Running   0          2m7s
kibana-59d58d4b49-6w8f4                         2/2     Running   0          4m11s
rsyslog-bwjvf                                   1/1     Running   0          3m11s
rsyslog-c862w                                   1/1     Running   0          3m48s
rsyslog-gc4nh                                   1/1     Running   0          3m27s
rsyslog-s5vlz                                   1/1     Running   0          3m16s
rsyslog-sn9g6                                   1/1     Running   0          3m31s
rsyslog-sx6xj                                   1/1     Running   0          3m38s
elasticsearch-cdm-uirubjv6-2-cdfddc54f-svv4p   2/2   Running   0     4m55s
elasticsearch-cdm-uirubjv6-3-68b5c7857b-5l4pc   2/2   Running   0     2m20s
elasticsearch-cdm-uirubjv6-1-59ddb497c5-jwv9l   2/2   Running   0     6m
elasticsearch-cdm-uirubjv6-1-59ddb497c5-jwv9l   2/2   Terminating   0     6m17s
elasticsearch-cdm-uirubjv6-1-59ddb497c5-jwv9l   0/2   Terminating   0     6m18s
elasticsearch-cdm-uirubjv6-1-59ddb497c5-jwv9l   0/2   Terminating   0     6m18s
elasticsearch-cdm-uirubjv6-1-59ddb497c5-jwv9l   0/2   Terminating   0     6m26s
elasticsearch-cdm-uirubjv6-1-59ddb497c5-jwv9l   0/2   Terminating   0     6m26s
elasticsearch-cdm-uirubjv6-1-68948bb74d-whx4q   0/2   Pending   0     0s
elasticsearch-cdm-uirubjv6-1-68948bb74d-whx4q   0/2   Pending   0     0s
elasticsearch-cdm-uirubjv6-1-68948bb74d-whx4q   0/2   ContainerCreating   0     0s
elasticsearch-cdm-uirubjv6-1-68948bb74d-whx4q   1/2   Running   0     27s
rsyslog-s5vlz   0/1   OOMKilled   0     4m27s
rsyslog-s5vlz   1/1   Running   1     4m28s
```