Bug 1905910

Summary:

Elasticsearch cluster can't be fully upgraded after upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0

Product:

OpenShift Container Platform

Reporter:

Qiaoling Tang <qitang>

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED ERRATA

QA Contact:

Qiaoling Tang <qitang>

Severity:

urgent

Docs Contact:

Rolfe Dlugy-Hegwer <rdlugyhe>

Priority:

urgent

Version:

4.7

CC:

anli, aos-bugs, ewolinet, jcantril, periklis, rdlugyhe

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

logging-core

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

* Previously, because of a bug, the software did not find some certificates and regenerated them. Normally this triggers the Elasticsearch operator to perform a full cluster restart on the Elasticsearch cluster. However, if this happens while the operator is performing a rolling restart, it can cause mismatched certificates. The current release fixes this issue. Now the operator consistently reads and writes certificates to the same working directory and only regenerates the certificates if needed. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1905910[*BZ#1905910*])

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 11:22:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1906641

Attachments:

Description	Flags
must-gather	none

Description Qiaoling Tang 2020-12-09 10:13:59 UTC

Created attachment 1737865 [details]
must-gather

Description of problem:
Deploy logging 4.7 on 4.7 cluster, then upgrade logging to a new 4.7 version, the ES cluster stuck in yellow status after 1 or 2 ES pod(s) is upgraded, and the upgrading couldn't go on.

cl/instance:
  spec:
    collection:
      logs:
        fluentd: {}
        type: fluentd
    logStore:
      elasticsearch:
        nodeCount: 3
        redundancyPolicy: SingleRedundancy
        resources:
          requests:
            memory: 2Gi
        storage:
          size: 20Gi
          storageClassName: standard
      retentionPolicy:
        application:
          maxAge: 60h
        audit:
          maxAge: 1d
        infra:
          maxAge: 3h
      type: elasticsearch
    managementState: Managed
    visualization:
      kibana:
        replicas: 1
      type: kibana


$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-7b8fb444cc-59sw6       1/1     Running     0          63m
elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks    2/2     Running     0          62m
elasticsearch-cdm-vmehqabi-2-678465b47b-wvwwx   2/2     Running     0          4h1m
elasticsearch-cdm-vmehqabi-3-d79796b57-ctkfw    2/2     Running     0          4h1m
elasticsearch-delete-app-1607504400-7tp98       0/1     Completed   0          63m
elasticsearch-delete-app-1607508000-4h4mq       0/1     Error       0          3m57s
elasticsearch-delete-audit-1607504400-g4sck     0/1     Completed   0          63m
elasticsearch-delete-audit-1607508000-7h2md     0/1     Error       0          3m57s
elasticsearch-delete-infra-1607504400-sqxgj     0/1     Completed   0          63m
elasticsearch-delete-infra-1607508000-86f2b     0/1     Error       0          3m57s
elasticsearch-rollover-app-1607504400-hs4x6     0/1     Completed   0          63m
elasticsearch-rollover-app-1607508000-677dd     0/1     Error       0          3m57s
elasticsearch-rollover-audit-1607504400-ffnqs   0/1     Completed   0          63m
elasticsearch-rollover-audit-1607508000-87vsj   0/1     Error       0          3m57s
elasticsearch-rollover-infra-1607504400-gnkz6   0/1     Completed   0          63m
elasticsearch-rollover-infra-1607508000-pjk2j   0/1     Error       0          3m57s
fluentd-b8nhc                                   1/1     Running     0          4h1m
fluentd-fkm2q                                   1/1     Running     0          4h1m
fluentd-qs25n                                   1/1     Running     0          4h1m
fluentd-sp4m4                                   1/1     Running     0          4h1m
fluentd-sr78d                                   1/1     Running     0          4h1m
fluentd-szq6d                                   1/1     Running     0          4h1m
kibana-74554f87c6-ln88p                         2/2     Running     0          61m


$ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- shards
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod.
infra-000004 1 p STARTED    
infra-000004 1 r UNASSIGNED NODE_LEFT
infra-000004 2 p STARTED    
infra-000004 2 r UNASSIGNED NODE_LEFT
infra-000004 0 p STARTED    
infra-000004 0 r STARTED    
infra-000010 1 r STARTED    
infra-000010 1 p STARTED    
infra-000010 2 p STARTED    
infra-000010 2 r UNASSIGNED NODE_LEFT
infra-000010 0 p STARTED    
infra-000010 0 r UNASSIGNED NODE_LEFT
infra-000009 1 p STARTED    
infra-000009 1 r UNASSIGNED NODE_LEFT
infra-000009 2 p STARTED    
infra-000009 2 r UNASSIGNED NODE_LEFT

$ oc exec elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -- es_util --query=_cat/nodes?v
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-vmehqabi-1-ccffdf6c5-fdwks -n openshift-logging' to see all of the containers in this pod.
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.131.0.49            64         100  15    1.13    0.97     1.26 mdi       *      elasticsearch-cdm-vmehqabi-2
10.129.2.172           13          66  10    0.35    0.69     0.79 mdi       -      elasticsearch-cdm-vmehqabi-1
10.128.2.24            59         100  15    0.75    0.85     0.72 mdi       -      elasticsearch-cdm-vmehqabi-3

The EO keeps repeating the following error messages:
"error":{"cluster":"elasticsearch","msg":"failed to create index template","namespace":"openshift-logging","response_body":null,

"error":{"cluster":"elasticsearch","msg":"failed to get list of index templates","namespace":"openshift-logging","response_body":null,

more details are in the must-gather

Version-Release number of selected component (if applicable):
upgrade from elasticsearch-operator.4.7.0-202012080225.p0 to elasticsearch-operator.4.7.0-202012082225.p0   

How reproducible:
100%

Steps to Reproduce:
1. deploy logging 4.7 
2. upgrade logging to a new 4.7 version
3.

Actual results:


Expected results:


Additional info:

Comment 3 Anping Li 2020-12-17 01:55:16 UTC

Hit this issue too when upgrading OCP Cluster.
step 1: deploy logging-4.7 on OCP 4.6
Step 2: upgrade OCP4.6-> 4.7.
Result: The hit this issue.

Comment 5 Qiaoling Tang 2021-01-11 03:59:37 UTC

Tested several upgrade path, here are the details:

path 1: upgrade from clusterlogging.4.7.0-202101070834.p0 to clusterlogging.4.7.0-202101092121.p0: succeeded

path 2: upgrade from clusterlogging.4.6.0-202101090741.p0(latest 4.6, but not released yet) to clusterlogging.4.7.0-202101092121.p0: succeeded

path 3: upgrade from clusterlogging.4.6.0-202011221454.p0(latest released 4.6) to clusterlogging.4.7.0-202101092121.p0: failed, same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1906641.

Since we have https://bugzilla.redhat.com/show_bug.cgi?id=1906641 to track the issue in path 3, I move this bz to VERIFIED.

Comment 10 errata-xmlrpc 2021-02-24 11:22:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652