1743194 – The Elasticsearch deployment disappear after cluster upgrade

Bug 1743194 - The Elasticsearch deployment disappear after cluster upgrade

Summary: The Elasticsearch deployment disappear after cluster upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1751320
TreeView+	depends on / blocked

Reported:	2019-08-19 10:13 UTC by Anping Li
Modified:	2019-10-16 06:36 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1751320 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:36:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 186	0	None	closed	Bug 1743194: For nodes that have been deleted during upgrade, recreate them	2020-10-27 05:03:05 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:36:36 UTC

Description Anping Li 2019-08-19 10:13:16 UTC

Description of problem:
The elasticsearch deployment resource disapper during cluster upgrade.  The CRD elasicsearch are present. but the elasticearch-operators didn't recreate the deployment using the CRD.


Version-Release number of selected component (if applicable):
ocp 4.1.11-> ocp-4.1.12
registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212

How reproducible:
One time. I am trying to reproduce it.

Steps to Reproduce:
1. deploy logging on v4.1.11
2. upgrade cluster to v4.1.12（note: Logging wasn
t upgrade, Only cluster was upgraded).
3. check the cluster logging status

$oc get deployment -n openshift-logging
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           92m
kibana                     1/1     1            1           79m
$ oc get elasticsearch -n openshift-logging
NAME            AGE
elasticsearch   80m

$ oc logs elasticsearch-operator-6656d85bc6-4fjgq -n 
time="2019-08-19T08:19:18Z" level=info msg="Go Version: go1.10.8"
time="2019-08-19T08:19:18Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-19T08:19:18Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-19T08:19:18Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
E0819 08:19:18.586684       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
E0819 08:20:18.639105       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.643926       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.670411       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: red / green"
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: red / green"
time="2019-08-19T08:21:25Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: red / green"
time="2019-08-19T08:21:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:21:57Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:00Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:22:35Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:22:38Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:48Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:23:13Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:23:26Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
E0819 08:23:48.643256       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:50.500312       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:53.573652       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:56.643711       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:59.715101       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.786923       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.790516       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:10Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
E0819 08:24:21.219505       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:24.291712       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:27.363955       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:30.434815       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:33.506773       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:36.578690       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:46Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:24:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:24:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"


Actual results:
There isn't elasticsearch instance. all elasticsearch deployment have been deleted. The elasticsearch-operators didn't recreate the elasticsearch deployment using the existing CRD elasicsearch.


Expected results:
Cluster logging works well after upgrade.

Comment 1 Anping Li 2019-08-19 12:09:04 UTC

Remove regression keyword, as it can not be reproduced.

Comment 2 Anping Li 2019-08-26 09:13:09 UTC

Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13.  

1) No ES deployment
oc get deployment
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           91m
kibana                     1/1     1            1           23m

2) The elasticsearch scheduledRedeploy is true.  The reason is 'ElasticsearchContainerWaiting'.

$ oc get elasticsearch elasticsearch -o json |jq '.status'
{
  "clusterHealth": "cluster health unknown",
  "conditions": [],
  "nodes": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-1",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ElasticsearchContainerTerminated"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ProxyContainerTerminated"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-2",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-3",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    }
  ],
  "pods": {
    "client": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "data": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "master": {
      "failed": [],
      "notReady": [],
      "ready": []
    }
  },
  "shardAllocationEnabled": "shard allocation unknown"
}

3) The elasticsearch-operator print waiting Green Message.
[anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft
time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8"
time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1:  / green"
time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2:  / green"
time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3:  / green"

Comment 3 Anping Li 2019-08-26 09:16:38 UTC

Quesstion:
1) Why the elasticsearch deployment resources were deleted during cluster upgrade.
2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources.

Comment 5 Anping Li 2019-08-26 13:46:03 UTC

The bug was reported on the shared cluster on which the whole Openshift QE team work.  But I couldn't reproduce it in my private clusters.   Is that caused by resource limitation?

Comment 7 ewolinet 2019-09-09 21:24:21 UTC

Anping,

You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well?
Did the Elasticsearch CR get deleted?
Did the Clusterlogging CR get deleted?

Comment 8 Anping Li 2019-09-11 02:12:01 UTC

No, only the elasticsearch deployments were deleted.

Comment 13 Jeff Cantrill 2019-09-11 15:35:20 UTC

Trying to get this into 4.2 before close

Comment 14 Wang Haoran 2019-09-11 15:38:06 UTC

Here is a workaround:
1. scale down the elasticsearch operator 
$oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators
2. remove the "status" part of elasticsearch cr
$oc edit elasticsearch -n openshift-logging
3. scale back the elasticsearch operator
$oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators

Comment 16 Anping Li 2019-09-16 09:58:26 UTC

I deleted elasticsearch manully in 'Waiting for cluster to complete recovery', after a while, these deployments are recreated. so we can say the fix works. As we didn't figure out what/who cause the ES deployment disappers.  Please feel free to reopen.


[anli@preserve-anli-slave 42]$ oc get pods -n openshift-logging
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-77b67d8b87-zsg85       1/1     Running     0          31m
curator-1568627400-b6d66                        0/1     Completed   0          3m14s
elasticsearch-cd-pjyicydj-1-d5f967777-rcx6d     1/2     Running     0          7m51s
elasticsearch-cdm-my5mbcvj-1-5bc946d955-4nhcr   2/2     Running     0          4m20s
elasticsearch-cdm-my5mbcvj-2-55457749f5-m6gr7   2/2     Running     0          3m19s
elasticsearch-cdm-my5mbcvj-3-598c6d6bc9-c56tb   2/2     Running     0          2m17s

time="2019-09-16T09:24:42Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow"
time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:06Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:07Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:17Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart"
time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-my5mbcvj-3: yellow / green"
time="2019-09-16T09:25:38Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow"
time="2019-09-16T09:25:39Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:42Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:43Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:44Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:26:16Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart"
time="2019-09-16T09:27:29Z" level=info msg="Timed out waiting for elasticsearch-cdm-my5mbcvj-3 to rejoin cluster"
time="2019-09-16T09:27:29Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Node elasticsearch-cdm-my5mbcvj-3 has not rejoined cluster elasticsearch yet"
time="2019-09-16T09:28:00Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:02Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:04Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"

Comment 17 errata-xmlrpc 2019-10-16 06:36:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.