Description of problem: The elasticsearch deployment resource disapper during cluster upgrade. The CRD elasicsearch are present. but the elasticearch-operators didn't recreate the deployment using the CRD. Version-Release number of selected component (if applicable): ocp 4.1.11-> ocp-4.1.12 registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212 How reproducible: One time. I am trying to reproduce it. Steps to Reproduce: 1. deploy logging on v4.1.11 2. upgrade cluster to v4.1.12(note: Logging wasn t upgrade, Only cluster was upgraded). 3. check the cluster logging status $oc get deployment -n openshift-logging NAME READY UP-TO-DATE AVAILABLE AGE cluster-logging-operator 1/1 1 1 92m kibana 1/1 1 1 79m $ oc get elasticsearch -n openshift-logging NAME AGE elasticsearch 80m $ oc logs elasticsearch-operator-6656d85bc6-4fjgq -n time="2019-08-19T08:19:18Z" level=info msg="Go Version: go1.10.8" time="2019-08-19T08:19:18Z" level=info msg="Go OS/Arch: linux/amd64" time="2019-08-19T08:19:18Z" level=info msg="operator-sdk Version: 0.0.7" time="2019-08-19T08:19:18Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000" E0819 08:19:18.586684 1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request E0819 08:20:18.639105 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request E0819 08:20:18.643926 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request E0819 08:20:18.670411 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: red / green" time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: red / green" time="2019-08-19T08:21:25Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: red / green" time="2019-08-19T08:21:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:21:57Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" time="2019-08-19T08:22:00Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" time="2019-08-19T08:22:35Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:22:38Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" time="2019-08-19T08:22:48Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" time="2019-08-19T08:23:13Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:23:26Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" E0819 08:23:48.643256 1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:50.500312 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:53.573652 1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:56.643711 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:59.715101 1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:02.786923 1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:02.790516 1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request time="2019-08-19T08:24:10Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" E0819 08:24:21.219505 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:24.291712 1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:27.363955 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:30.434815 1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:33.506773 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:36.578690 1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request time="2019-08-19T08:24:46Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:24:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" time="2019-08-19T08:24:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" Actual results: There isn't elasticsearch instance. all elasticsearch deployment have been deleted. The elasticsearch-operators didn't recreate the elasticsearch deployment using the existing CRD elasicsearch. Expected results: Cluster logging works well after upgrade.
Remove regression keyword, as it can not be reproduced.
Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13. 1) No ES deployment oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE cluster-logging-operator 1/1 1 1 91m kibana 1/1 1 1 23m 2) The elasticsearch scheduledRedeploy is true. The reason is 'ElasticsearchContainerWaiting'. $ oc get elasticsearch elasticsearch -o json |jq '.status' { "clusterHealth": "cluster health unknown", "conditions": [], "nodes": [ { "conditions": [ { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ElasticsearchContainerWaiting" }, { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ProxyContainerWaiting" } ], "deploymentName": "elasticsearch-cdm-g82ncdqr-1", "upgradeStatus": { "scheduledRedeploy": "True" } }, { "conditions": [ { "lastTransitionTime": "2019-08-26T08:40:59Z", "reason": "Error", "status": "True", "type": "ElasticsearchContainerTerminated" }, { "lastTransitionTime": "2019-08-26T08:40:59Z", "reason": "Error", "status": "True", "type": "ProxyContainerTerminated" } ], "deploymentName": "elasticsearch-cdm-g82ncdqr-2", "upgradeStatus": { "scheduledRedeploy": "True" } }, { "conditions": [ { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ElasticsearchContainerWaiting" }, { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ProxyContainerWaiting" } ], "deploymentName": "elasticsearch-cdm-g82ncdqr-3", "upgradeStatus": { "scheduledRedeploy": "True" } } ], "pods": { "client": { "failed": [], "notReady": [], "ready": [] }, "data": { "failed": [], "notReady": [], "ready": [] }, "master": { "failed": [], "notReady": [], "ready": [] } }, "shardAllocationEnabled": "shard allocation unknown" } 3) The elasticsearch-operator print waiting Green Message. [anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8" time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64" time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7" time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000" time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1: / green" time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2: / green" time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3: / green"
Quesstion: 1) Why the elasticsearch deployment resources were deleted during cluster upgrade. 2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources.
The bug was reported on the shared cluster on which the whole Openshift QE team work. But I couldn't reproduce it in my private clusters. Is that caused by resource limitation?
Anping, You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well? Did the Elasticsearch CR get deleted? Did the Clusterlogging CR get deleted?
No, only the elasticsearch deployments were deleted.
Trying to get this into 4.2 before close
Here is a workaround: 1. scale down the elasticsearch operator $oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators 2. remove the "status" part of elasticsearch cr $oc edit elasticsearch -n openshift-logging 3. scale back the elasticsearch operator $oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators
I deleted elasticsearch manully in 'Waiting for cluster to complete recovery', after a while, these deployments are recreated. so we can say the fix works. As we didn't figure out what/who cause the ES deployment disappers. Please feel free to reopen. [anli@preserve-anli-slave 42]$ oc get pods -n openshift-logging NAME READY STATUS RESTARTS AGE cluster-logging-operator-77b67d8b87-zsg85 1/1 Running 0 31m curator-1568627400-b6d66 0/1 Completed 0 3m14s elasticsearch-cd-pjyicydj-1-d5f967777-rcx6d 1/2 Running 0 7m51s elasticsearch-cdm-my5mbcvj-1-5bc946d955-4nhcr 2/2 Running 0 4m20s elasticsearch-cdm-my5mbcvj-2-55457749f5-m6gr7 2/2 Running 0 3m19s elasticsearch-cdm-my5mbcvj-3-598c6d6bc9-c56tb 2/2 Running 0 2m17s time="2019-09-16T09:24:42Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow" time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:06Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:07Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:17Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart" time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-my5mbcvj-3: yellow / green" time="2019-09-16T09:25:38Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow" time="2019-09-16T09:25:39Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:42Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:43Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:44Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:26:16Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart" time="2019-09-16T09:27:29Z" level=info msg="Timed out waiting for elasticsearch-cdm-my5mbcvj-3 to rejoin cluster" time="2019-09-16T09:27:29Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Node elasticsearch-cdm-my5mbcvj-3 has not rejoined cluster elasticsearch yet" time="2019-09-16T09:28:00Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:02Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:04Z" level=info msg="Waiting for cluster to complete recovery: yellow / green"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922