Bug 1743194 - The Elasticsearch deployment disappear after cluster upgrade
Summary: The Elasticsearch deployment disappear after cluster upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.2.0
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks: 1751320
TreeView+ depends on / blocked
 
Reported: 2019-08-19 10:13 UTC by Anping Li
Modified: 2019-10-16 06:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1751320 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:36:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift elasticsearch-operator pull 186 0 None closed Bug 1743194: For nodes that have been deleted during upgrade, recreate them 2020-10-27 05:03:05 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:36:36 UTC

Description Anping Li 2019-08-19 10:13:16 UTC
Description of problem:
The elasticsearch deployment resource disapper during cluster upgrade.  The CRD elasicsearch are present. but the elasticearch-operators didn't recreate the deployment using the CRD.


Version-Release number of selected component (if applicable):
ocp 4.1.11-> ocp-4.1.12
registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212

How reproducible:
One time. I am trying to reproduce it.

Steps to Reproduce:
1. deploy logging on v4.1.11
2. upgrade cluster to v4.1.12(note: Logging wasn
t upgrade, Only cluster was upgraded).
3. check the cluster logging status

$oc get deployment -n openshift-logging
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           92m
kibana                     1/1     1            1           79m
$ oc get elasticsearch -n openshift-logging
NAME            AGE
elasticsearch   80m

$ oc logs elasticsearch-operator-6656d85bc6-4fjgq -n 
time="2019-08-19T08:19:18Z" level=info msg="Go Version: go1.10.8"
time="2019-08-19T08:19:18Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-19T08:19:18Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-19T08:19:18Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
E0819 08:19:18.586684       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
E0819 08:20:18.639105       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.643926       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.670411       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: red / green"
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: red / green"
time="2019-08-19T08:21:25Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: red / green"
time="2019-08-19T08:21:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:21:57Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:00Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:22:35Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:22:38Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:48Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:23:13Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:23:26Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
E0819 08:23:48.643256       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:50.500312       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:53.573652       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:56.643711       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:59.715101       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.786923       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.790516       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:10Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
E0819 08:24:21.219505       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:24.291712       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:27.363955       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:30.434815       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:33.506773       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:36.578690       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:46Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:24:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:24:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"


Actual results:
There isn't elasticsearch instance. all elasticsearch deployment have been deleted. The elasticsearch-operators didn't recreate the elasticsearch deployment using the existing CRD elasicsearch.


Expected results:
Cluster logging works well after upgrade.

Comment 1 Anping Li 2019-08-19 12:09:04 UTC
Remove regression keyword, as it can not be reproduced.

Comment 2 Anping Li 2019-08-26 09:13:09 UTC
Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13.  

1) No ES deployment
oc get deployment
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           91m
kibana                     1/1     1            1           23m

2) The elasticsearch scheduledRedeploy is true.  The reason is 'ElasticsearchContainerWaiting'.

$ oc get elasticsearch elasticsearch -o json |jq '.status'
{
  "clusterHealth": "cluster health unknown",
  "conditions": [],
  "nodes": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-1",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ElasticsearchContainerTerminated"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ProxyContainerTerminated"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-2",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-3",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    }
  ],
  "pods": {
    "client": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "data": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "master": {
      "failed": [],
      "notReady": [],
      "ready": []
    }
  },
  "shardAllocationEnabled": "shard allocation unknown"
}

3) The elasticsearch-operator print waiting Green Message.
[anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft
time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8"
time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1:  / green"
time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2:  / green"
time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3:  / green"

Comment 3 Anping Li 2019-08-26 09:16:38 UTC
Quesstion:
1) Why the elasticsearch deployment resources were deleted during cluster upgrade.
2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources.

Comment 5 Anping Li 2019-08-26 13:46:03 UTC
The bug was reported on the shared cluster on which the whole Openshift QE team work.  But I couldn't reproduce it in my private clusters.   Is that caused by resource limitation?

Comment 7 ewolinet 2019-09-09 21:24:21 UTC
Anping,

You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well?
Did the Elasticsearch CR get deleted?
Did the Clusterlogging CR get deleted?

Comment 8 Anping Li 2019-09-11 02:12:01 UTC
No, only the elasticsearch deployments were deleted.

Comment 13 Jeff Cantrill 2019-09-11 15:35:20 UTC
Trying to get this into 4.2 before close

Comment 14 Wang Haoran 2019-09-11 15:38:06 UTC
Here is a workaround:
1. scale down the elasticsearch operator 
$oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators
2. remove the "status" part of elasticsearch cr
$oc edit elasticsearch -n openshift-logging
3. scale back the elasticsearch operator
$oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators

Comment 16 Anping Li 2019-09-16 09:58:26 UTC
I deleted elasticsearch manully in 'Waiting for cluster to complete recovery', after a while, these deployments are recreated. so we can say the fix works. As we didn't figure out what/who cause the ES deployment disappers.  Please feel free to reopen.


[anli@preserve-anli-slave 42]$ oc get pods -n openshift-logging
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-77b67d8b87-zsg85       1/1     Running     0          31m
curator-1568627400-b6d66                        0/1     Completed   0          3m14s
elasticsearch-cd-pjyicydj-1-d5f967777-rcx6d     1/2     Running     0          7m51s
elasticsearch-cdm-my5mbcvj-1-5bc946d955-4nhcr   2/2     Running     0          4m20s
elasticsearch-cdm-my5mbcvj-2-55457749f5-m6gr7   2/2     Running     0          3m19s
elasticsearch-cdm-my5mbcvj-3-598c6d6bc9-c56tb   2/2     Running     0          2m17s

time="2019-09-16T09:24:42Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow"
time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:06Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:07Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:17Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart"
time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-my5mbcvj-3: yellow / green"
time="2019-09-16T09:25:38Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow"
time="2019-09-16T09:25:39Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:42Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:43Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:44Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:26:16Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart"
time="2019-09-16T09:27:29Z" level=info msg="Timed out waiting for elasticsearch-cdm-my5mbcvj-3 to rejoin cluster"
time="2019-09-16T09:27:29Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Node elasticsearch-cdm-my5mbcvj-3 has not rejoined cluster elasticsearch yet"
time="2019-09-16T09:28:00Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:02Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:04Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"

Comment 17 errata-xmlrpc 2019-10-16 06:36:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.