Bug 1751320

Summary:	The Elasticsearch deployment disappear after cluster upgrade
Product:	OpenShift Container Platform	Reporter:	Jeff Cantrill <jcantril>
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.1.z	CC:	anli, aos-bugs, cblecker, ewolinet, haowang, jcantril, rmeggins, wgordon
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1743194	Environment:
Last Closed:	2019-10-16 18:07:59 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1743194
Bug Blocks:

Description Jeff Cantrill 2019-09-11 17:39:14 UTC

+++ This bug was initially created as a clone of Bug #1743194 +++

Description of problem:
The elasticsearch deployment resource disapper during cluster upgrade.  The CRD elasicsearch are present. but the elasticearch-operators didn't recreate the deployment using the CRD.


Version-Release number of selected component (if applicable):
ocp 4.1.11-> ocp-4.1.12
registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212

How reproducible:
One time. I am trying to reproduce it.

Steps to Reproduce:
1. deploy logging on v4.1.11
2. upgrade cluster to v4.1.12（note: Logging wasn
t upgrade, Only cluster was upgraded).
3. check the cluster logging status

$oc get deployment -n openshift-logging
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           92m
kibana                     1/1     1            1           79m
$ oc get elasticsearch -n openshift-logging
NAME            AGE
elasticsearch   80m

$ oc logs elasticsearch-operator-6656d85bc6-4fjgq -n 
time="2019-08-19T08:19:18Z" level=info msg="Go Version: go1.10.8"
time="2019-08-19T08:19:18Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-19T08:19:18Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-19T08:19:18Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
E0819 08:19:18.586684       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
E0819 08:20:18.639105       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.643926       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:20:18.670411       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: red / green"
time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: red / green"
time="2019-08-19T08:21:25Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: red / green"
time="2019-08-19T08:21:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:21:57Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:00Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:22:35Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:22:38Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:22:48Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
time="2019-08-19T08:23:13Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:23:26Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
E0819 08:23:48.643256       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:50.500312       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:53.573652       1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:56.643711       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:23:59.715101       1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.786923       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:02.790516       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:10Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"
E0819 08:24:21.219505       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:24.291712       1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:27.363955       1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:30.434815       1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:33.506773       1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E0819 08:24:36.578690       1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
time="2019-08-19T08:24:46Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1:  / green"
time="2019-08-19T08:24:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2:  / green"
time="2019-08-19T08:24:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3:  / green"


Actual results:
There isn't elasticsearch instance. all elasticsearch deployment have been deleted. The elasticsearch-operators didn't recreate the elasticsearch deployment using the existing CRD elasicsearch.


Expected results:
Cluster logging works well after upgrade.

--- Additional comment from Anping Li on 2019-08-19 12:09:04 UTC ---

Remove regression keyword, as it can not be reproduced.

--- Additional comment from Anping Li on 2019-08-26 09:13:09 UTC ---

Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13.  

1) No ES deployment
oc get deployment
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator   1/1     1            1           91m
kibana                     1/1     1            1           23m

2) The elasticsearch scheduledRedeploy is true.  The reason is 'ElasticsearchContainerWaiting'.

$ oc get elasticsearch elasticsearch -o json |jq '.status'
{
  "clusterHealth": "cluster health unknown",
  "conditions": [],
  "nodes": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-1",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ElasticsearchContainerTerminated"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:59Z",
          "reason": "Error",
          "status": "True",
          "type": "ProxyContainerTerminated"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-2",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ElasticsearchContainerWaiting"
        },
        {
          "lastTransitionTime": "2019-08-26T08:40:40Z",
          "reason": "ContainerCreating",
          "status": "True",
          "type": "ProxyContainerWaiting"
        }
      ],
      "deploymentName": "elasticsearch-cdm-g82ncdqr-3",
      "upgradeStatus": {
        "scheduledRedeploy": "True"
      }
    }
  ],
  "pods": {
    "client": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "data": {
      "failed": [],
      "notReady": [],
      "ready": []
    },
    "master": {
      "failed": [],
      "notReady": [],
      "ready": []
    }
  },
  "shardAllocationEnabled": "shard allocation unknown"
}

3) The elasticsearch-operator print waiting Green Message.
[anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft
time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8"
time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1:  / green"
time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2:  / green"
time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3:  / green"

--- Additional comment from Anping Li on 2019-08-26 09:16:38 UTC ---

Quesstion:
1) Why the elasticsearch deployment resources were deleted during cluster upgrade.
2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources.

--- Additional comment from Anping Li on 2019-08-26 11:00:44 UTC ---

Share the cluster:
oc login --token=2mZfsNn0NnTjUdWkRcvWCRi1vbir_wlo-UaGO7qdgtI --server=https://api.qe-411413-upg.qe.devcluster.openshift.com:6443

--- Additional comment from Anping Li on 2019-08-26 13:46:03 UTC ---

The bug was reported on the shared cluster on which the whole Openshift QE team work.  But I couldn't reproduce it in my private clusters.   Is that caused by resource limitation?

--- Additional comment from Anping Li on 2019-08-26 14:00:54 UTC ---

It was reported on same cluster https://bugzilla.redhat.com/show_bug.cgi?id=1745509. and couldn't reproduce on my private clusters too.

--- Additional comment from  on 2019-09-09 21:24:21 UTC ---

Anping,

You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well?
Did the Elasticsearch CR get deleted?
Did the Clusterlogging CR get deleted?

--- Additional comment from Anping Li on 2019-09-11 02:12:01 UTC ---

No, only the elasticsearch deployments were deleted.

--- Additional comment from Wang Haoran on 2019-09-11 02:29:31 UTC ---

We met this problem on our OSD v4 customer cluster after running for sometime.

oc get clusterserviceversion         
NAME                                        AGE
elasticsearch-operator.4.1.4-201906271212   22d

this is the log from operator:
oc logs cluster-logging-operator-7c58ddb664-z8sqh -n openshift-logging
time="2019-09-05T22:11:27Z" level=info msg="Go Version: go1.10.8"
time="2019-09-05T22:11:27Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-09-05T22:11:27Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-09-05T22:11:27Z" level=info msg="Watching logging.openshift.io/v1, ClusterLogging, openshift-logging, 5000000000"
time="2019-09-05T22:11:35Z" level=error msg="Unable to read file to get contents: open /tmp/_working_dir/kibana-proxy-oauth.secret: no such file or directory"
time="2019-09-05T22:11:38Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-09-05T22:11:44Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:12:03Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:12:12Z" level=info msg="Updating status of Elasticsearch"
time="2019-09-05T22:12:47Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:13:46Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure creating Kibana route for \"instance\": the server is currently unable to handle the request"
ERROR: logging before flag.Parse: E0905 22:13:57.544994       1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
ERROR: logging before flag.Parse: E0905 22:13:57.567773       1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request
time="2019-09-05T22:14:13Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:16:28Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:17:06Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:21:55Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:22:14Z" level=info msg="Updating status of Fluentd"
time="2019-09-05T22:25:07Z" level=info msg="Updating status of Fluentd"
time="2019-09-07T10:39:58Z" level=info msg="Updating status of Fluentd"
time="2019-09-07T10:40:17Z" level=info msg="Updating status of Fluentd"
time="2019-09-07T17:45:22Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure creating Kibana deployment for \"instance\": etcdserver: request timed out"
time="2019-09-07T17:47:56Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update logstore for \"instance\": Failure creating Elasticsearch CR: etcdserver: request timed out"
time="2019-09-08T15:18:03Z" level=error msg="Failed to check for ClusterLogging object: the server was unable to return a response in the time allotted, but may still be processing the request"
time="2019-09-08T15:21:13Z" level=error msg="Failed to check for ClusterLogging object: the server was unable to return a response in the time allotted, but may still be processing the request"
ERROR: logging before flag.Parse: E0909 12:57:45.219640       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3772777, ErrCode=NO_ERROR, debug=""
ERROR: logging before flag.Parse: W0909 12:57:45.382760       1 reflector.go:341] github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:91: watch of *unstructured.Unstructured ended with: unexpected object: &{map[message:too old resource version: 11655411 (13168066) reason:Gone code:410 kind:Status apiVersion:v1 metadata:map[] status:Failure]}
time="2019-09-09T14:26:48Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update collection for \"instance\": Failure creating Log collector \"cluster-logging-metadata-reader\" cluster role binding: etcdserver: request timed out"
time="2019-09-09T14:27:05Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing Kibana service for \"instance\": etcdserver: request timed out"
time="2019-09-10T12:45:26Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": etcdserver: request timed out"
time="2019-09-10T12:45:47Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": etcdserver: request timed out"
time="2019-09-10T12:47:04Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": the server is currently unable to handle the request"


The pvc is not deleted:
oc get pvc -n openshift-logging
NAME                                         STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
elasticsearch-elasticsearch-cdm-93d8j7yk-1   Bound     pvc-f8ec6f93-c2be-11e9-af83-02e2d9b4437a   200Gi      RWO            gp2            22d
elasticsearch-elasticsearch-cdm-93d8j7yk-2   Bound     pvc-f90aa2b4-c2be-11e9-af83-02e2d9b4437a   200Gi      RWO            gp2            22d
elasticsearch-elasticsearch-cdm-93d8j7yk-3   Bound     pvc-f92926d9-c2be-11e9-af83-02e2d9b4437a   200Gi      RWO            gp2            22d

We have an 600Gi quota, but I see logs from elasticsearch operator:

oc logs elasticsearch-operator-7c5cc4bff9-5rvps -n openshift-operators
time="2019-09-10T19:30:19Z" level=error msg="Unable to create PersistentVolumeClaim: Unable to create PVC: persistentvolumeclaims \"elasticsearch-elasticsearch-cdm-93d8j7yk-1\" is forbidden: exceeded quota: logging-storage-quota, requested: requests.storage=200Gi, used: requests.storage=600Gi, limited: requests.storage=600Gi"
time="2019-09-10T19:30:19Z" level=error msg="Unable to create PersistentVolumeClaim: Unable to create PVC: persistentvolumeclaims \"elasticsearch-elasticsearch-cdm-93d8j7yk-2\" is forbidden: exceeded quota: logging-storage-quota, requested: requests.storage=200Gi, used: requests.storage=600Gi, limited: requests.storage=600Gi"

--- Additional comment from Wang Haoran on 2019-09-11 02:38:07 UTC ---

image version: registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212

--- Additional comment from Wang Haoran on 2019-09-11 14:42:49 UTC ---

oc get elasticsearch -n openshift-logging -o yaml
apiVersion: v1
items:
- apiVersion: logging.openshift.io/v1
  kind: Elasticsearch
  metadata:
    creationTimestamp: 2019-08-19T20:06:05Z
    generation: 32
    name: elasticsearch
    namespace: openshift-logging
    ownerReferences:
    - apiVersion: logging.openshift.io/v1
      controller: true
      kind: ClusterLogging
      name: instance
      uid: c3ab8626-c2bc-11e9-9112-0200846155f2
    resourceVersion: "10555029"
    selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
    uid: c67240c6-c2bc-11e9-af83-02e2d9b4437a
  spec:
    managementState: Managed
    nodeSpec:
      image: registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.1.4-201906271212
      nodeSelector:
        node-role.kubernetes.io/worker: ""
      resources: {}
    nodes:
    - genUUID: 93d8j7yk
      nodeCount: 3
      resources: {}
      roles:
      - client
      - data
      - master
      storage:
        size: 200Gi
        storageClassName: gp2
    redundancyPolicy: SingleRedundancy
  status:
    clusterHealth: cluster health unknown
    conditions: []
    nodes:
    - deploymentName: elasticsearch-cdm-93d8j7yk-1
      upgradeStatus:
        scheduledRedeploy: "True"
        underUpgrade: "True"
        upgradePhase: nodeRestarting
    - conditions:
      - lastTransitionTime: 2019-09-05T22:09:28Z
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        status: "True"
        type: ElasticsearchContainerTerminated
      - lastTransitionTime: 2019-09-05T22:09:28Z
        message: The container could not be located when the pod was terminated
        reason: ContainerStatusUnknown
        status: "True"
        type: ProxyContainerTerminated
      deploymentName: elasticsearch-cdm-93d8j7yk-2
      upgradeStatus:
        scheduledRedeploy: "True"
    - deploymentName: elasticsearch-cdm-93d8j7yk-3
      upgradeStatus:
        scheduledRedeploy: "True"
    pods:
      client:
        failed: []
        notReady: []
        ready: []
      data:
        failed: []
        notReady: []
        ready: []
      master:
        failed: []
        notReady: []
        ready: []
    shardAllocationEnabled: shard allocation unknown
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

--- Additional comment from  on 2019-09-11 15:30:47 UTC ---

We were able to resolve this with the following:

1. Scale down EO
2. Update the elasticsearch CR to remove the status section
3. Scale EO back up

The reason this worked is because the EO was stuck wanting to perform an upgrade, however there were no Elasticsearch DCs for some reason.
It is unclear why they were removed since the elasticsearch CR wasn't deleted (perhaps timing with K8S and upgrading caused it to be removed temporarily which caused K8S reaping but was later restored?)

I will open a fix to address this for the future as well as address the repeated message regarding the PVC even though it already exists.

--- Additional comment from Jeff Cantrill on 2019-09-11 15:35:20 UTC ---

Trying to get this into 4.2 before close

--- Additional comment from Wang Haoran on 2019-09-11 15:38:06 UTC ---

Here is a workaround:
1. scale down the elasticsearch operator 
$oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators
2. remove the "status" part of elasticsearch cr
$oc edit elasticsearch -n openshift-logging
3. scale back the elasticsearch operator
$oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators

Comment 2 Anping Li 2019-10-10 10:34:55 UTC

The Elasticsearch deployment are recreated after I upgraded to elasticsearch-operartor 4.1.19

Comment 4 errata-xmlrpc 2019-10-16 18:07:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3004