Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1743194

Summary: The Elasticsearch deployment disappear after cluster upgrade
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.1.zCC: aos-bugs, cblecker, christoph.obexer, ewolinet, haowang, jcantril, rmeggins, wgordon
Target Milestone: ---Keywords: Reopened
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1751320 (view as bug list) Environment:
Last Closed: 2019-10-16 06:36:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1751320    

Description Anping Li 2019-08-19 10:13:16 UTC
Description of problem:
The elasticsearch deployment resource disapper during cluster upgrade.  The CRD elasicsearch are present. but the elasticearch-operators didn't recreate the deployment using the CRD.


Version-Release number of selected component (if applicable):
ocp 4.1.11-> ocp-4.1.12
registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212

How reproducible:
One time. I am trying to reproduce it.

Steps to Reproduce:
1. deploy logging on v4.1.11
2. upgrade cluster to v4.1.12(note: Logging wasn
t upgrade, Only cluster was upgraded).
3. check the cluster logging status

$oc get deployment -n openshift-logging
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           92m
kibana                     1/1     1            1           79m
$ oc get elasticsearch -n openshift-logging
NAME            AGE
elasticsearch   80m

$ oc logs elasticsearch-operator-6656d85bc6-4fjgq -n 
time="2019-08-19T08:19:18Z" level=info msg="Go Version: go1.10.8"
time="2019-08-19T08:19:18Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-19T08:19:18Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-19T08:19:18Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
E0819 08:19:18.586684       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
E0819 08:20:18.639105       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.643926       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.670411       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: red / green"
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: red / green"
time="2019-08-19T08:21:25Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: red / green"
time="2019-08-19T08:21:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:21:57Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:00Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:22:35Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:22:38Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:48Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:23:13Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:23:26Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
E0819 08:23:48.643256       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:50.500312       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:53.573652       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:56.643711       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:59.715101       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.786923       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.790516       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:10Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
E0819 08:24:21.219505       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:24.291712       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:27.363955       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:30.434815       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:33.506773       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:36.578690       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:46Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:24:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:24:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"


Actual results:
There isn't elasticsearch instance. all elasticsearch deployment have been deleted. The elasticsearch-operators didn't recreate the elasticsearch deployment using the existing CRD elasicsearch.


Expected results:
Cluster logging works well after upgrade.

Comment 1 Anping Li 2019-08-19 12:09:04 UTC
Remove regression keyword, as it can not be reproduced.

Comment 2 Anping Li 2019-08-26 09:13:09 UTC
Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13.  

1) No ES deployment
oc get deployment
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           91m
kibana                     1/1     1            1           23m

2) The elasticsearch scheduledRedeploy is true.  The reason is 'ElasticsearchContainerWaiting'.

$ oc get elasticsearch elasticsearch -o json |jq '.status'
{
  "clusterHealth": "cluster health unknown",
  "conditions": [],
  "nodes": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-1",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ElasticsearchContainerTerminated"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ProxyContainerTerminated"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-2",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-3",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    }
  ],
  "pods": {
    "client": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "data": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "master": {
      "failed": [],
      "notReady": [],
      "ready": []
    }
  },
  "shardAllocationEnabled": "shard allocation unknown"
}

3) The elasticsearch-operator print waiting Green Message.
[anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft
time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8"
time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1:  / green"
time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2:  / green"
time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3:  / green"

Comment 3 Anping Li 2019-08-26 09:16:38 UTC
Quesstion:
1) Why the elasticsearch deployment resources were deleted during cluster upgrade.
2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources.

Comment 5 Anping Li 2019-08-26 13:46:03 UTC
The bug was reported on the shared cluster on which the whole Openshift QE team work.  But I couldn't reproduce it in my private clusters.   Is that caused by resource limitation?

Comment 7 ewolinet 2019-09-09 21:24:21 UTC
Anping,

You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well?
Did the Elasticsearch CR get deleted?
Did the Clusterlogging CR get deleted?

Comment 8 Anping Li 2019-09-11 02:12:01 UTC
No, only the elasticsearch deployments were deleted.

Comment 13 Jeff Cantrill 2019-09-11 15:35:20 UTC
Trying to get this into 4.2 before close

Comment 14 Wang Haoran 2019-09-11 15:38:06 UTC
Here is a workaround:
1. scale down the elasticsearch operator 
$oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators
2. remove the "status" part of elasticsearch cr
$oc edit elasticsearch -n openshift-logging
3. scale back the elasticsearch operator
$oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators

Comment 16 Anping Li 2019-09-16 09:58:26 UTC
I deleted elasticsearch manully in 'Waiting for cluster to complete recovery', after a while, these deployments are recreated. so we can say the fix works. As we didn't figure out what/who cause the ES deployment disappers.  Please feel free to reopen.


[anli@preserve-anli-slave 42]$ oc get pods -n openshift-logging
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-77b67d8b87-zsg85       1/1     Running     0          31m
curator-1568627400-b6d66                        0/1     Completed   0          3m14s
elasticsearch-cd-pjyicydj-1-d5f967777-rcx6d     1/2     Running     0          7m51s
elasticsearch-cdm-my5mbcvj-1-5bc946d955-4nhcr   2/2     Running     0          4m20s
elasticsearch-cdm-my5mbcvj-2-55457749f5-m6gr7   2/2     Running     0          3m19s
elasticsearch-cdm-my5mbcvj-3-598c6d6bc9-c56tb   2/2     Running     0          2m17s

time="2019-09-16T09:24:42Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow"
time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:06Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:07Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:17Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart"
time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-my5mbcvj-3: yellow / green"
time="2019-09-16T09:25:38Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow"
time="2019-09-16T09:25:39Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:42Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:43Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:25:44Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:26:16Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart"
time="2019-09-16T09:27:29Z" level=info msg="Timed out waiting for elasticsearch-cdm-my5mbcvj-3 to rejoin cluster"
time="2019-09-16T09:27:29Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Node elasticsearch-cdm-my5mbcvj-3 has not rejoined cluster elasticsearch yet"
time="2019-09-16T09:28:00Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:02Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
time="2019-09-16T09:28:04Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"

Comment 17 errata-xmlrpc 2019-10-16 06:36:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922