1751320 – The Elasticsearch deployment disappear after cluster upgrade

Bug 1751320 - The Elasticsearch deployment disappear after cluster upgrade

Summary: The Elasticsearch deployment disappear after cluster upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.z
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1743194
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-11 17:39 UTC by Jeff Cantrill
Modified:	2019-10-16 18:08 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1743194
Environment:
Last Closed:	2019-10-16 18:07:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 187	0	'None'	closed	Bug 1751320: Recreate nodes that have been deleted during upgrade	2021-02-02 04:34:25 UTC
Red Hat Product Errata	RHBA-2019:3004	0	None	None	None	2019-10-16 18:08:08 UTC

Description Jeff Cantrill 2019-09-11 17:39:14 UTC

+++ This bug was initially created as a clone of Bug #1743194 +++

Description of problem:
The elasticsearch deployment resource disapper during cluster upgrade.  The CRD elasicsearch are present. but the elasticearch-operators didn't recreate the deployment using the CRD.


Version-Release number of selected component (if applicable):
ocp 4.1.11-> ocp-4.1.12
registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212

How reproducible:
One time. I am trying to reproduce it.

Steps to Reproduce:
1. deploy logging on v4.1.11
2. upgrade cluster to v4.1.12（note: Logging wasn
t upgrade, Only cluster was upgraded).
3. check the cluster logging status

$oc get deployment -n openshift-logging
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           92m
kibana                     1/1     1            1           79m
$ oc get elasticsearch -n openshift-logging
NAME            AGE
elasticsearch   80m

$ oc logs elasticsearch-operator-6656d85bc6-4fjgq -n 
time="2019-08-19T08:19:18Z" level=info msg="Go Version: go1.10.8"
time="2019-08-19T08:19:18Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-19T08:19:18Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-19T08:19:18Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
E0819 08:19:18.586684       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
E0819 08:20:18.639105       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.643926       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.670411       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: red / green"
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: red / green"
time="2019-08-19T08:21:25Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: red / green"
time="2019-08-19T08:21:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:21:57Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:00Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:22:35Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:22:38Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:48Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:23:13Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:23:26Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
E0819 08:23:48.643256       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:50.500312       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:53.573652       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:56.643711       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:59.715101       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.786923       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.790516       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:10Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
E0819 08:24:21.219505       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:24.291712       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:27.363955       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:30.434815       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:33.506773       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:36.578690       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:46Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:24:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:24:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"


Actual results:
There isn't elasticsearch instance. all elasticsearch deployment have been deleted. The elasticsearch-operators didn't recreate the elasticsearch deployment using the existing CRD elasicsearch.


Expected results:
Cluster logging works well after upgrade.

--- Additional comment from Anping Li on 2019-08-19 12:09:04 UTC ---

Remove regression keyword, as it can not be reproduced.

--- Additional comment from Anping Li on 2019-08-26 09:13:09 UTC ---

Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13.  

1) No ES deployment
oc get deployment
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           91m
kibana                     1/1     1            1           23m

2) The elasticsearch scheduledRedeploy is true.  The reason is 'ElasticsearchContainerWaiting'.

$ oc get elasticsearch elasticsearch -o json |jq '.status'
{
  "clusterHealth": "cluster health unknown",
  "conditions": [],
  "nodes": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-1",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ElasticsearchContainerTerminated"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ProxyContainerTerminated"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-2",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-3",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    }
  ],
  "pods": {
    "client": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "data": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "master": {
      "failed": [],
      "notReady": [],
      "ready": []
    }
  },
  "shardAllocationEnabled": "shard allocation unknown"
}

3) The elasticsearch-operator print waiting Green Message.
[anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft
time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8"
time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1:  / green"
time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2:  / green"
time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3:  / green"

--- Additional comment from Anping Li on 2019-08-26 09:16:38 UTC ---

Quesstion:
1) Why the elasticsearch deployment resources were deleted during cluster upgrade.
2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources.

--- Additional comment from Anping Li on 2019-08-26 11:00:44 UTC ---

Share the cluster:
oc login --token=2mZfsNn0NnTjUdWkRcvWCRi1vbir_wlo-UaGO7qdgtI --server=https://api.qe-411413-upg.qe.devcluster.openshift.com:6443

--- Additional comment from Anping Li on 2019-08-26 13:46:03 UTC ---

The bug was reported on the shared cluster on which the whole Openshift QE team work.  But I couldn't reproduce it in my private clusters.   Is that caused by resource limitation?

--- Additional comment from Anping Li on 2019-08-26 14:00:54 UTC ---

It was reported on same cluster https://bugzilla.redhat.com/show_bug.cgi?id=1745509. and couldn't reproduce on my private clusters too.

--- Additional comment from  on 2019-09-09 21:24:21 UTC ---

Anping,

You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well?
Did the Elasticsearch CR get deleted?
Did the Clusterlogging CR get deleted?

--- Additional comment from Anping Li on 2019-09-11 02:12:01 UTC ---

No, only the elasticsearch deployments were deleted.

--- Additional comment from Wang Haoran on 2019-09-11 02:29:31 UTC ---

We met this problem on our OSD v4 customer cluster after running for sometime.

oc get clusterserviceversion         
NAME                                        AGE
elasticsearch-operator.4.1.4-201906271212   22d

this is the log from operator:
oc logs cluster-logging-operator-7c58ddb664-z8sqh -n openshift-logging
time="2019-09-05T22:11:27Z" level=info msg="Go Version: go1.10.8"
time="2019-09-05T22:11:27Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-09-05T22:11:27Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-09-05T22:11:27Z" level=info msg="Watching logging.openshift.io/v1, ClusterLogging, openshift-logging, 5000000000"
time="2019-09-05T22:11:35Z" level=error msg="Unable to read file to get contents: open /tmp/_working_dir/kibana-proxy-oauth.secret: no such file or directory"
time="2019-09-05T22:11:38Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-09-05T22:11:44Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:12:03Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:12:12Z" level=info msg="Updating status of Elasticsearch"
time="2019-09-05T22:12:47Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:13:46Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure creating Kibana route for \"instance\": the server is currently unable to handle the request"
ERROR: logging before flag.Parse: E0905 22:13:57.544994       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
ERROR: logging before flag.Parse: E0905 22:13:57.567773       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
time="2019-09-05T22:14:13Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:16:28Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:17:06Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:21:55Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:22:14Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:25:07Z" level=info msg="Updating status of Fluentd"
time="2019-09-07T10:39:58Z" level=info msg="Updating status of Fluentd"
time="2019-09-07T10:40:17Z" level=info msg="Updating status of Fluentd"
time="2019-09-07T17:45:22Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure creating Kibana deployment for \"instance\": etcdserver: request timed out"
time="2019-09-07T17:47:56Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update logstore for \"instance\": Failure creating Elasticsearch CR: etcdserver: request timed out"
time="2019-09-08T15:18:03Z" level=error msg="Failed to check for ClusterLogging object: the server was unable to return a response in the time allotted, but may still be processing the request"
time="2019-09-08T15:21:13Z" level=error msg="Failed to check for ClusterLogging object: the server was unable to return a response in the time allotted, but may still be processing the request"
ERROR: logging before flag.Parse: E0909 12:57:45.219640       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3772777, ErrCode=NO_ERROR, debug=""
ERROR: logging before flag.Parse: W0909 12:57:45.382760       1 reflector.go:341] github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:91: watch of *unstructured.Unstructured ended with: unexpected object: &{map[message:too old resource version: 11655411 (13168066) reason:Gone code:410 kind:Status apiVersion:v1 metadata:map[] status:Failure]}
time="2019-09-09T14:26:48Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update collection for \"instance\": Failure creating Log collector \"cluster-logging-metadata-reader\" cluster role binding: etcdserver: request timed out"
time="2019-09-09T14:27:05Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing Kibana service for \"instance\": etcdserver: request timed out"
time="2019-09-10T12:45:26Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": etcdserver: request timed out"
time="2019-09-10T12:45:47Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": etcdserver: request timed out"
time="2019-09-10T12:47:04Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": the server is currently unable to handle the request"


The pvc is not deleted:
oc get pvc -n openshift-logging
NAME                                         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
elasticsearch-elasticsearch-cdm-93d8j7yk-1   Bound     pvc-f8ec6f93-c2be-11e9-af83-02e2d9b4437a   200Gi      RWO            gp2            22d
elasticsearch-elasticsearch-cdm-93d8j7yk-2   Bound     pvc-f90aa2b4-c2be-11e9-af83-02e2d9b4437a   200Gi      RWO            gp2            22d
elasticsearch-elasticsearch-cdm-93d8j7yk-3   Bound     pvc-f92926d9-c2be-11e9-af83-02e2d9b4437a   200Gi      RWO            gp2            22d

We have an 600Gi quota, but I see logs from elasticsearch operator:

oc logs elasticsearch-operator-7c5cc4bff9-5rvps -n openshift-operators
time="2019-09-10T19:30:19Z" level=error msg="Unable to create PersistentVolumeClaim: Unable to create PVC: persistentvolumeclaims \"elasticsearch-elasticsearch-cdm-93d8j7yk-1\" is forbidden: exceeded quota: logging-storage-quota, requested: requests.storage=200Gi, used: requests.storage=600Gi, limited: requests.storage=600Gi"
time="2019-09-10T19:30:19Z" level=error msg="Unable to create PersistentVolumeClaim: Unable to create PVC: persistentvolumeclaims \"elasticsearch-elasticsearch-cdm-93d8j7yk-2\" is forbidden: exceeded quota: logging-storage-quota, requested: requests.storage=200Gi, used: requests.storage=600Gi, limited: requests.storage=600Gi"

--- Additional comment from Wang Haoran on 2019-09-11 02:38:07 UTC ---

image version: registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212

--- Additional comment from Wang Haoran on 2019-09-11 14:42:49 UTC ---

oc get elasticsearch -n openshift-logging -o yaml
apiVersion: v1
items:
- apiVersion: logging.openshift.io/v1
  kind: Elasticsearch
  metadata:
    creationTimestamp: 2019-08-19T20:06:05Z
    generation: 32
    name: elasticsearch
    namespace: openshift-logging
    ownerReferences:
    - apiVersion: logging.openshift.io/v1
      controller: true
      kind: ClusterLogging
      name: instance
      uid: c3ab8626-c2bc-11e9-9112-0200846155f2
    resourceVersion: "10555029"
    selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
    uid: c67240c6-c2bc-11e9-af83-02e2d9b4437a
  spec:
    managementState: Managed
    nodeSpec:
      image: registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.1.4-201906271212
      nodeSelector:
        node-role.kubernetes.io/worker: ""
      resources: {}
    nodes:
    - genUUID: 93d8j7yk
      nodeCount: 3
      resources: {}
      roles:
      - client
      - data
      - master
      storage:
        size: 200Gi
        storageClassName: gp2
    redundancyPolicy: SingleRedundancy
  status:
    clusterHealth: cluster health unknown
    conditions: []
    nodes:
    - deploymentName: elasticsearch-cdm-93d8j7yk-1
      upgradeStatus:
        scheduledRedeploy: "True"
        underUpgrade: "True"
        upgradePhase: nodeRestarting
    - conditions:
      - lastTransitionTime: 2019-09-05T22:09:28Z
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        status: "True"
        type: ElasticsearchContainerTerminated
      - lastTransitionTime: 2019-09-05T22:09:28Z
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        status: "True"
        type: ProxyContainerTerminated
      deploymentName: elasticsearch-cdm-93d8j7yk-2
      upgradeStatus:
        scheduledRedeploy: "True"
    - deploymentName: elasticsearch-cdm-93d8j7yk-3
      upgradeStatus:
        scheduledRedeploy: "True"
    pods:
      client:
        failed: []
        notReady: []
        ready: []
      data:
        failed: []
        notReady: []
        ready: []
      master:
        failed: []
        notReady: []
        ready: []
    shardAllocationEnabled: shard allocation unknown
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

--- Additional comment from  on 2019-09-11 15:30:47 UTC ---

We were able to resolve this with the following:

1. Scale down EO
2. Update the elasticsearch CR to remove the status section
3. Scale EO back up

The reason this worked is because the EO was stuck wanting to perform an upgrade, however there were no Elasticsearch DCs for some reason.
It is unclear why they were removed since the elasticsearch CR wasn't deleted (perhaps timing with K8S and upgrading caused it to be removed temporarily which caused K8S reaping but was later restored?)

I will open a fix to address this for the future as well as address the repeated message regarding the PVC even though it already exists.

--- Additional comment from Jeff Cantrill on 2019-09-11 15:35:20 UTC ---

Trying to get this into 4.2 before close

--- Additional comment from Wang Haoran on 2019-09-11 15:38:06 UTC ---

Here is a workaround:
1. scale down the elasticsearch operator 
$oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators
2. remove the "status" part of elasticsearch cr
$oc edit elasticsearch -n openshift-logging
3. scale back the elasticsearch operator
$oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators

Comment 2 Anping Li 2019-10-10 10:34:55 UTC

The Elasticsearch deployment are recreated after I upgraded to elasticsearch-operartor 4.1.19

Comment 4 errata-xmlrpc 2019-10-16 18:07:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3004

Note You need to log in before you can comment on or make changes to this bug.