Bug 1827032

Summary: ES pod couldn't start after upgrade from 4.4 to 4.5 -- upgrade CLO firstly
Product: OpenShift Container Platform Reporter: Qiaoling Tang <qitang>
Component: LoggingAssignee: IgorKarpukhin <ikarpukh>
Status: CLOSED ERRATA QA Contact: Qiaoling Tang <qitang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.5CC: aos-bugs, jcantril, periklis
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Not enough resources in the cluster Consequence: EO couldn't start Fix: Set `resources.requests.memory` to 1.5 Gi Result: EO starts normally
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:30:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1809511, 1882495    
Attachments:
Description Flags
CLO, EO and ES pod logs none

Description Qiaoling Tang 2020-04-23 05:37:40 UTC
Created attachment 1681014 [details]
CLO, EO and ES pod logs

Description of problem:
The ES pod stuck in CrashLoopBackOff status after upgrade from 4.4 to 4.5 when upgrade CLO firstly.

$ oc get pod
NAME                                            READY   STATUS             RESTARTS   AGE
cluster-logging-operator-7cbb957c67-5ztpd       1/1     Running            0          66m
curator-1587616200-jkz4q                        1/1     Running            0          58m
curator-1587618000-5rfxf                        1/1     Running            0          28m
elasticsearch-cdm-b1w2fana-1-57bc77596f-958hz   1/2     CrashLoopBackOff   17         64m
elasticsearch-delete-app-1587618900-qsv8m       0/1     Error              0          12m
elasticsearch-delete-audit-1587618900-zs6qj     0/1     Error              0          12m
elasticsearch-delete-infra-1587618900-b8rcv     0/1     Error              0          12m
elasticsearch-rollover-app-1587618900-bxbhv     0/1     Error              0          12m
elasticsearch-rollover-audit-1587618900-rpnfv   0/1     Error              0          12m
elasticsearch-rollover-infra-1587618900-lx492   0/1     Error              0          12m
fluentd-5htfq                                   1/1     Running            0          64m
fluentd-bd9lg                                   1/1     Running            0          65m
fluentd-ghwxz                                   1/1     Running            0          65m
fluentd-jzxxb                                   1/1     Running            0          64m
fluentd-kngdj                                   1/1     Running            0          65m
fluentd-vz4nl                                   1/1     Running            0          64m
kibana-56677488fd-w4qql                         2/2     Running            1          65m


$ oc get csv
NAME                            DISPLAY                  VERSION   REPLACES                                    PHASE
clusterlogging.v4.5.0           Cluster Logging          4.5.0     clusterlogging.4.4.0-202004211517           Succeeded
elasticsearch-operator.v4.5.0   Elasticsearch Operator   4.5.0     elasticsearch-operator.4.4.0-202004211517   Succeeded


Version-Release number of selected component (if applicable):
Logging images are found from payload: 4.5.0-0.ci-2020-04-22-212726
the manifests are copied from the github repo with the latest code.

How reproducible:
Always

Steps to Reproduce:
1.deploy logging 4.4
2.upgrade to 4.5
3.

Actual results:


Expected results:


Additional info:

Comment 1 IgorKarpukhin 2020-04-24 16:03:17 UTC
Anping, when you testing this upgrade. Please add the following section under the `spec` for CLO's CR:
        resources:                                  
          limits:                     
            memory: 2Gi
          requests:                    
            cpu: 100m                      
            memory: 1Gi  

So, 2Gi instead of 4Gi.

I managed to finish the upgrade by editing the CLO's CR resources and removing the ES deployment. Could you please try to do that?

Comment 2 Qiaoling Tang 2020-04-26 02:57:49 UTC
@IgorKarpukhin,

when I deployed logging 4.4, my clusterlogging CR was:
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    elasticsearch:
      nodeCount: 1
      redundancyPolicy: "ZeroRedundancy"
      resources:
        requests:
          cpu: 1
          memory: "4Gi"
      storage:
        storageClassName: "gp2"
        size: "20Gi"
  visualization:
    type: "kibana"
    kibana:
      replicas: 1
  curation:
    type: "curator"
    curator:
      schedule: "*/30 * * * *"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}

> I managed to finish the upgrade by editing the CLO's CR resources and removing the ES deployment. Could you please try to do that?
What should I change in the CLO's CR instance? The elasticsearch's resource configurations? I thought the resource configurations shouldn't affect the upgrading unless the resources are not enough for the pod to start. Please correct me if I'm wrong.

I tried to delete the ES deployment and waited for the EO to create a new ES deployment, then the ES pod could start.

Comment 3 Qiaoling Tang 2020-04-26 03:05:37 UTC
After I delete the ES deployment, the kibana pod became CrashLoopBackOff, I found some error logs in the kibana container:

$ oc logs -c kibana kibana-6f56fc95dc-g78nl 
#The following values dynamically added from environment variable overrides:
Using NODE_OPTIONS: '--max_old_space_size=368' Memory setting is in MB
{"type":"log","@timestamp":"2020-04-26T03:02:58Z","tags":["fatal","root"],"pid":120,"message":"Error: Index .kibana belongs to a version of Kibana that cannot be automatically migrated. Reset it or use the X-Pack upgrade assistant.\n    at assertIsSupportedIndex (/opt/app-root/src/src/server/saved_objects/migrations/core/elastic_index.js:246:15)\n    at Object.fetchInfo (/opt/app-root/src/src/server/saved_objects/migrations/core/elastic_index.js:52:12)"}

 FATAL  Error: Index .kibana belongs to a version of Kibana that cannot be automatically migrated. Reset it or use the X-Pack upgrade assistant.

$ oc exec elasticsearch-cdm-3pt2ggfm-1-5c85f4d596-tkzlz -- indices
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-3pt2ggfm-1-5c85f4d596-tkzlz -n openshift-logging' to see all of the containers in this pod.
Sun Apr 26 03:03:24 UTC 2020
health status index                                                          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   project.qitang.2fa1b0f3-bf11-4419-808b-0fe652635819.2020.04.26 3-c6nKUOTIC5-0kfOTC2eg   1   0        335            0          0              0
green  open   .operations.2020.04.26                                         rku-dZVMQ_O6xgCCD6mUiQ   1   0      30194            0         28             28
green  open   infra-000002                                                   LcIpMVvGTqClb6ewjNGTQg   1   0      34661            0         24             24
green  open   .security                                                      Tv6zJrlbSPmj7Qb0d9A7Vg   1   0          5            0          0              0
green  open   .searchguard                                                   X_FZTPSSQYyQ3JjHhWpo-g   1   0          5            0          0              0
green  open   app-000001                                                     CGbLQOX5Q1mHWaZntYuYeg   1   0        459            0          0              0
green  open   infra-000001                                                   0QyyxL6QSIiMNrJfMt81hw   1   0     151361            0        129            129
green  open   .kibana                                                        dCsGB9d3RkWawJ8WLuJsRA   1   0          1            0          0              0
green  open   audit-000001                                                   XvbY5oxOTFWwX8D8oVUw_g   1   0          0            0          0              0

Comment 4 IgorKarpukhin 2020-04-27 11:13:02 UTC
@Anping,

On the cluster that you gave me last time, I had to modify the CLO's CR, because EO pod couldn't start due to lack of resources:


ES POD:
--------------
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict, 2 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict, 2 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
--------------

Comment 5 Qiaoling Tang 2020-05-13 02:09:32 UTC
The ES pods could upgrade to 4.5 succeeded now, but still have other issues, will file some new bugs to track them.

$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-d857bb6db-2fjrr        1/1     Running     0          9m56s
curator-1589333400-7vl4w                        0/1     Completed   0          35m
curator-1589335200-rg5z2                        0/1     Error       0          5m41s
elasticsearch-cdm-m2j2lxw9-1-596649ffc8-nrz5k   2/2     Running     0          9m4s
elasticsearch-cdm-m2j2lxw9-2-78787b64cf-vvwvk   2/2     Running     0          6m10s
elasticsearch-cdm-m2j2lxw9-3-74d7bd8db4-n9fbf   2/2     Running     0          4m16s
elasticsearch-delete-app-1589335200-2rjlz       0/1     Error       0          5m41s
elasticsearch-delete-audit-1589335200-xwl2h     0/1     Error       0          5m41s
elasticsearch-delete-infra-1589335200-nk5ls     0/1     Error       0          5m41s
elasticsearch-rollover-app-1589335200-sd5q5     0/1     Error       0          5m41s
elasticsearch-rollover-audit-1589335200-flpbd   0/1     Error       0          5m41s
elasticsearch-rollover-infra-1589335200-lxks9   0/1     Error       0          5m41s
fluentd-8dvgn                                   1/1     Running     0          7m8s
fluentd-q89wn                                   1/1     Running     0          8m34s
fluentd-r57cg                                   1/1     Running     0          6m28s
fluentd-sfvhg                                   1/1     Running     0          9m23s
fluentd-t2bdg                                   1/1     Running     0          7m51s
fluentd-xf9mm                                   1/1     Running     0          5m54s
kibana-6ff5c8d8f-s5tgq                          2/2     Running     0          45m


$ oc get csv
NAME                            DISPLAY                  VERSION   REPLACES                                    PHASE
clusterlogging.v4.5.0           Cluster Logging          4.5.0     clusterlogging.4.4.0-202005120551           Succeeded
elasticsearch-operator.v4.5.0   Elasticsearch Operator   4.5.0     elasticsearch-operator.4.4.0-202005120551   Succeeded


$ oc exec elasticsearch-cdm-m2j2lxw9-1-596649ffc8-nrz5k -- indices
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-m2j2lxw9-1-596649ffc8-nrz5k -n openshift-logging' to see all of the containers in this pod.
Wed May 13 02:07:22 UTC 2020
health status index                                                          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   infra-000001                                                   bUZeHMtHShuRwemjAk067Q   3   1     322228            0        818            397
green  open   audit-000001                                                   u87iPiYlSPWfRrNnoi0CwA   3   1          0            0          0              0
green  open   project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 LmyeqjBQTuOVO6-HEr03YA   3   1       2741            0          3              1
green  open   .operations.2020.05.13                                         Ob9V0iMiTQ6uTnuu5qOSGg   3   1    2124104            0       4594           2288
green  open   .kibana.a5f01f00ae88a880fd91ed1dbace3dff08f5c0b2               nJxuXyS4RiOFmzWmSvzI6Q   1   1          0            0          0              0
green  open   .security                                                      cWRUcXT-TcmIzaPFOiowsw   1   1          5            0          0              0
green  open   .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac               CvSIX5rNRVSCMFEIH9hyeQ   1   1          2            0          0              0
green  open   .searchguard                                                   UC4ukmZvTc61M5HNPLiqyg   1   1          5           52          0              0
green  open   app-000001                                                     iJwFUbNATx-R73r2YFITTg   3   1        233            0          1              0
green  open   .kibana                                                        1LLeP9ZRROKU0WyFaGfQfw   1   1          1            0          0              0

Seems the ES images are set in the EO now, no ES image in the CLO. I saw the ES pods only restart once.

Comment 6 Jeff Cantrill 2020-05-15 20:32:51 UTC
Given comments in c#5, please mark this BZ verified or identify what is left specifically for this issue.

Comment 7 Qiaoling Tang 2020-05-18 00:48:44 UTC
ES pod could start now, so move to VERIFIED.

Comment 9 errata-xmlrpc 2020-07-13 17:30:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409