Bug 1905910 - Elasticsearch cluster can't be fully upgraded after upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0
Summary: Elasticsearch cluster can't be fully upgraded after upgrade from elasticsearc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Jeff Cantrill
QA Contact: Qiaoling Tang
Rolfe Dlugy-Hegwer
URL:
Whiteboard: logging-core
Depends On:
Blocks: 1906641
TreeView+ depends on / blocked
 
Reported: 2020-12-09 10:13 UTC by Qiaoling Tang
Modified: 2021-02-24 11:23 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, because of a bug, the software did not find some certificates and regenerated them. Normally this triggers the Elasticsearch operator to perform a full cluster restart on the Elasticsearch cluster. However, if this happens while the operator is performing a rolling restart, it can cause mismatched certificates. The current release fixes this issue. Now the operator consistently reads and writes certificates to the same working directory and only regenerates the certificates if needed. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1905910[*BZ#1905910*])
Clone Of:
Environment:
Last Closed: 2021-02-24 11:22:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather (7.06 MB, application/gzip)
2020-12-09 10:13 UTC, Qiaoling Tang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 847 0 None closed Bug 1905910: Correctly extract master-certs to workind directory 2021-02-12 21:42:31 UTC
Red Hat Product Errata RHBA-2021:0652 0 None None None 2021-02-24 11:23:16 UTC

Description Qiaoling Tang 2020-12-09 10:13:59 UTC
Created attachment 1737865 [details]
must-gather

Description of problem:
Deploy logging 4.7 on 4.7 cluster, then upgrade logging to a new 4.7 version, the ES cluster stuck in yellow status after 1 or 2 ES pod(s) is upgraded, and the upgrading couldn't go on.

cl/instance:
  spec:
    collection:
      logs:
        fluentd: {}
        type: fluentd
    logStore:
      elasticsearch:
        nodeCount: 3
        redundancyPolicy: SingleRedundancy
        resources:
          requests:
            memory: 2Gi
        storage:
          size: 20Gi
          storageClassName: standard
      retentionPolicy:
        application:
          maxAge: 60h
        audit:
          maxAge: 1d
        infra:
          maxAge: 3h
      type: elasticsearch
    managementState: Managed
    visualization:
      kibana:
        replicas: 1
      type: kibana


$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-7b8fb444cc-59sw6       1/1     Running     0          63m
elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks    2/2     Running     0          62m
elasticsearch-cdm-vmehqabi-2-678465b47b-wvwwx   2/2     Running     0          4h1m
elasticsearch-cdm-vmehqabi-3-d79796b57-ctkfw    2/2     Running     0          4h1m
elasticsearch-delete-app-1607504400-7tp98       0/1     Completed   0          63m
elasticsearch-delete-app-1607508000-4h4mq       0/1     Error       0          3m57s
elasticsearch-delete-audit-1607504400-g4sck     0/1     Completed   0          63m
elasticsearch-delete-audit-1607508000-7h2md     0/1     Error       0          3m57s
elasticsearch-delete-infra-1607504400-sqxgj     0/1     Completed   0          63m
elasticsearch-delete-infra-1607508000-86f2b     0/1     Error       0          3m57s
elasticsearch-rollover-app-1607504400-hs4x6     0/1     Completed   0          63m
elasticsearch-rollover-app-1607508000-677dd     0/1     Error       0          3m57s
elasticsearch-rollover-audit-1607504400-ffnqs   0/1     Completed   0          63m
elasticsearch-rollover-audit-1607508000-87vsj   0/1     Error       0          3m57s
elasticsearch-rollover-infra-1607504400-gnkz6   0/1     Completed   0          63m
elasticsearch-rollover-infra-1607508000-pjk2j   0/1     Error       0          3m57s
fluentd-b8nhc                                   1/1     Running     0          4h1m
fluentd-fkm2q                                   1/1     Running     0          4h1m
fluentd-qs25n                                   1/1     Running     0          4h1m
fluentd-sp4m4                                   1/1     Running     0          4h1m
fluentd-sr78d                                   1/1     Running     0          4h1m
fluentd-szq6d                                   1/1     Running     0          4h1m
kibana-74554f87c6-ln88p                         2/2     Running     0          61m


$ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- shards
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod.
infra-000004 1 p STARTED    
infra-000004 1 r UNASSIGNED NODE_LEFT
infra-000004 2 p STARTED    
infra-000004 2 r UNASSIGNED NODE_LEFT
infra-000004 0 p STARTED    
infra-000004 0 r STARTED    
infra-000010 1 r STARTED    
infra-000010 1 p STARTED    
infra-000010 2 p STARTED    
infra-000010 2 r UNASSIGNED NODE_LEFT
infra-000010 0 p STARTED    
infra-000010 0 r UNASSIGNED NODE_LEFT
infra-000009 1 p STARTED    
infra-000009 1 r UNASSIGNED NODE_LEFT
infra-000009 2 p STARTED    
infra-000009 2 r UNASSIGNED NODE_LEFT

$ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- es_util --query=_cat/nodes?v
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod.
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.131.0.49            64         100  15    1.13    0.97     1.26 mdi       *      elasticsearch-cdm-vmehqabi-2
10.129.2.172           13          66  10    0.35    0.69     0.79 mdi       -      elasticsearch-cdm-vmehqabi-1
10.128.2.24            59         100  15    0.75    0.85     0.72 mdi       -      elasticsearch-cdm-vmehqabi-3

The EO keeps repeating the following error messages:
"error":{"cluster":"elasticsearch","msg":"failed to create index template","namespace":"openshift-logging","response_body":null,

"error":{"cluster":"elasticsearch","msg":"failed to get list of index templates","namespace":"openshift-logging","response_body":null,

more details are in the must-gather

Version-Release number of selected component (if applicable):
upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0   

How reproducible:
100%

Steps to Reproduce:
1. deploy logging 4.7 
2. upgrade logging to a new 4.7 version
3.

Actual results:


Expected results:


Additional info:

Comment 3 Anping Li 2020-12-17 01:55:16 UTC
Hit this issue too when upgrading OCP Cluster.
step 1: deploy logging-4.7 on OCP 4.6
Step 2: upgrade OCP4.6-> 4.7.
Result: The hit this issue.

Comment 5 Qiaoling Tang 2021-01-11 03:59:37 UTC
Tested several upgrade path, here are the details:

path 1: upgrade from clusterlogging.4.7.0-202101070834.p0 to clusterlogging.4.7.0-202101092121.p0: succeeded

path 2: upgrade from clusterlogging.4.6.0-202101090741.p0(latest 4.6, but not released yet) to clusterlogging.4.7.0-202101092121.p0: succeeded

path 3: upgrade from clusterlogging.4.6.0-202011221454.p0(latest released 4.6) to clusterlogging.4.7.0-202101092121.p0: failed, same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1906641.

Since we have https://bugzilla.redhat.com/show_bug.cgi?id=1906641 to track the issue in path 3, I move this bz to VERIFIED.

Comment 10 errata-xmlrpc 2021-02-24 11:22:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652


Note You need to log in before you can comment on or make changes to this bug.