1886796 – The EO can't recreate ES deployments after upgrade cluster from 4.5 to 4.6

Bug 1886796 - The EO can't recreate ES deployments after upgrade cluster from 4.5 to 4.6

Summary: The EO can't recreate ES deployments after upgrade cluster from 4.5 to 4.6

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Periklis Tsirakidis
QA Contact:	Giriyamma
Docs Contact:
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:	1880926
TreeView+	depends on / blocked

Reported:	2020-10-09 12:16 UTC by Periklis Tsirakidis
Modified:	2021-02-24 11:22 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 11:21:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 521	0	None	closed	Bug 1886796: Fix in-place filtering missing status nodes	2021-02-08 09:59:36 UTC
Red Hat Product Errata	RHBA-2021:0652	0	None	None	None	2021-02-24 11:22:11 UTC

Description Periklis Tsirakidis 2020-10-09 12:16:30 UTC

This bug was initially created as a copy of Bug #1880926

I am copying this bug because: 



Description of problem:
1. deploy logging 4.5 on a 4.5 cluster
2. upgrade logging to 4.6
3. upgrade cluster to 4.6, the ES deployments are disappeared after the cluster upgraded successfully

There are plenty of below logs in the EO:
{"level":"error","ts":1600656185.4719505,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:213","msg":"Failed to progress update of unschedulable node","node":"elasticsearch-cdm-2wg9lezz-1","error":"Deployment.apps \"elasticsearch-cdm-2wg9lezz-1\" not found"}
{"level":"error","ts":1600656185.4720213,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:65","msg":"unable to progress unschedulable nodes","cluster":"elasticsearch","namespace":"openshift-logging","error":"Deployment.apps \"elasticsearch-cdm-2wg9lezz-1\" not found"}


I found lots of secrets were recreated:

$ oc get secrets
NAME                                       TYPE                                  DATA   AGE
builder-dockercfg-wws6m                    kubernetes.io/dockercfg               1      4h10m
builder-token-qqg2t                        kubernetes.io/service-account-token   4      4h10m
builder-token-xcd6k                        kubernetes.io/service-account-token   4      4h10m
cluster-logging-operator-dockercfg-f29rl   kubernetes.io/dockercfg               1      4h10m
cluster-logging-operator-token-kx45d       kubernetes.io/service-account-token   4      4h10m
cluster-logging-operator-token-lhvg8       kubernetes.io/service-account-token   4      4h10m
default-dockercfg-w45ql                    kubernetes.io/dockercfg               1      4h10m
default-token-7xxvz                        kubernetes.io/service-account-token   4      4h10m
default-token-mk62p                        kubernetes.io/service-account-token   4      4h10m
deployer-dockercfg-bf96k                   kubernetes.io/dockercfg               1      4h10m
deployer-token-4d9p4                       kubernetes.io/service-account-token   4      4h10m
deployer-token-cshkb                       kubernetes.io/service-account-token   4      4h10m
elasticsearch                              Opaque                                7      169m
elasticsearch-dockercfg-rbxv5              kubernetes.io/dockercfg               1      169m
elasticsearch-metrics                      kubernetes.io/tls                     2      169m
elasticsearch-token-lbxgf                  kubernetes.io/service-account-token   4      169m
elasticsearch-token-p594b                  kubernetes.io/service-account-token   4      169m
fluentd                                    Opaque                                3      169m
fluentd-metrics                            kubernetes.io/tls                     2      169m
kibana                                     Opaque                                3      169m
kibana-dockercfg-bn6dk                     kubernetes.io/dockercfg               1      169m
kibana-proxy                               Opaque                                3      169m
kibana-token-qqksm                         kubernetes.io/service-account-token   4      169m
kibana-token-zs7w2                         kubernetes.io/service-account-token   4      169m
logcollector-dockercfg-gbwqn               kubernetes.io/dockercfg               1      169m
logcollector-token-f5qxl                   kubernetes.io/service-account-token   4      169m
logcollector-token-gbvf8                   kubernetes.io/service-account-token   4      169m
master-certs                               Opaque                                2      169m

$ oc get pod
NAME                                            READY   STATUS                  RESTARTS   AGE
cluster-logging-operator-779f857c67-zkq6k       1/1     Running                 0          167m
elasticsearch-delete-app-1600665300-bq9lq       0/1     Error                   0          14m
elasticsearch-delete-audit-1600665300-5pnrd     0/1     Error                   0          14m
elasticsearch-delete-infra-1600665300-kbm6l     0/1     Error                   0          14m
elasticsearch-rollover-app-1600665300-7tlhh     0/1     Error                   0          14m
elasticsearch-rollover-audit-1600665300-27b5q   0/1     Error                   0          14m
elasticsearch-rollover-infra-1600665300-psf7m   0/1     Error                   0          14m
fluentd-7jhf7                                   0/1     Init:CrashLoopBackOff   31         169m
fluentd-fzg9m                                   0/1     Init:CrashLoopBackOff   30         169m
fluentd-k7zqc                                   0/1     Init:CrashLoopBackOff   31         169m
fluentd-l5ms4                                   0/1     Init:CrashLoopBackOff   31         169m
fluentd-m4829                                   0/1     Init:CrashLoopBackOff   29         169m
fluentd-zphfg                                   0/1     Init:CrashLoopBackOff   30         169m
kibana-549dff7bcd-ml6r7                         2/2     Running                 0          167m

Elasticsearch/elasticsearch:
spec:
  indexManagement:
    mappings:
    - aliases:
      - app
      - logs.app
      name: app
      policyRef: app-policy
    - aliases:
      - infra
      - logs.infra
      name: infra
      policyRef: infra-policy
    - aliases:
      - audit
      - logs.audit
      name: audit
      policyRef: audit-policy
    policies:
    - name: app-policy
      phases:
        delete:
          minAge: 1d
        hot:
          actions:
            rollover:
              maxAge: 1h
      pollInterval: 15m
    - name: infra-policy
      phases:
        delete:
          minAge: 12h
        hot:
          actions:
            rollover:
              maxAge: 36m
      pollInterval: 15m
    - name: audit-policy
      phases:
        delete:
          minAge: 2w
        hot:
          actions:
            rollover:
              maxAge: 2h
      pollInterval: 15m
  managementState: Managed
  nodeSpec:
    proxyResources:
      limits:
        memory: 64Mi
      requests:
        cpu: 100m
        memory: 64Mi
    resources:
      requests:
        memory: 2Gi
  nodes:
  - genUUID: 2wg9lezz
    nodeCount: 3
    proxyResources: {}
    resources: {}
    roles:
    - client
    - data
    - master
    storage:
      size: 20Gi
      storageClassName: standard
  redundancyPolicy: SingleRedundancy
status:
  cluster:
    activePrimaryShards: 0
    activeShards: 0
    initializingShards: 0
    numDataNodes: 0
    numNodes: 0
    pendingTasks: 0
    relocatingShards: 0
    status: cluster health unknown
    unassignedShards: 0
  nodes:
  - conditions:
    - lastTransitionTime: "2020-09-21T02:38:19Z"
      message: '0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn''t tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.'
      reason: Unschedulable
      status: "True"
      type: Unschedulable
    deploymentName: elasticsearch-cdm-2wg9lezz-1
    upgradeStatus:
      upgradePhase: controllerUpdated
  - conditions:
    - lastTransitionTime: "2020-09-21T02:40:26Z"
      reason: Error
      status: "True"
      type: ElasticsearchContainerTerminated
    - lastTransitionTime: "2020-09-21T02:40:26Z"
      reason: Error
      status: "True"
      type: ProxyContainerTerminated
    deploymentName: elasticsearch-cdm-2wg9lezz-2
    upgradeStatus:
      upgradePhase: controllerUpdated
  - conditions:
    - lastTransitionTime: "2020-09-21T02:40:29Z"
      reason: ContainerCreating
      status: "True"
      type: ElasticsearchContainerWaiting
    - lastTransitionTime: "2020-09-21T02:40:29Z"
      reason: ContainerCreating
      status: "True"
      type: ProxyContainerWaiting
    deploymentName: elasticsearch-cdm-2wg9lezz-3
    upgradeStatus:
      upgradePhase: controllerUpdated
  pods:
    client:
      failed: []
      notReady: []
      ready: []
    data:
      failed: []
      notReady: []
      ready: []
    master:
      failed: []
      notReady: []
      ready: []
  shardAllocationEnabled: shard allocation unknown

$ oc get deploy
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           4h12m
kibana                     1/1     1            1           171m


Version-Release number of selected component (if applicable):
$ oc get csv
NAME                                           DISPLAY                  VERSION                 REPLACES                                       PHASE
clusterlogging.4.6.0-202009192030.p0           Cluster Logging          4.6.0-202009192030.p0   clusterlogging.4.5.0-202009161248.p0           Succeeded
elasticsearch-operator.4.6.0-202009192030.p0   Elasticsearch Operator   4.6.0-202009192030.p0   elasticsearch-operator.4.5.0-202009182238.p0   Succeeded


How reproducible:
I tried 3 times, only hit once

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
must-gather: http://file.apac.redhat.com/~qitang/must-gather-0921.tar.gz

Comment 3 Anping Li 2020-10-14 07:41:11 UTC

Blocked as there isn't 4.7 operator/bundle images

Comment 4 Giriyamma 2020-11-04 12:48:16 UTC

Verified this on clusterlogging.4.7.0-202011021919.p0, elasticsearch-operator.4.7.0-202011030448.p0 and cluster 4.7.0-0.nightly-2020-10-27-051128.

Comment 9 errata-xmlrpc 2021-02-24 11:21:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.