+++ This bug was initially created as a clone of Bug #1743194 +++ Description of problem: The elasticsearch deployment resource disapper during cluster upgrade. The CRD elasicsearch are present. but the elasticearch-operators didn't recreate the deployment using the CRD. Version-Release number of selected component (if applicable): ocp 4.1.11-> ocp-4.1.12 registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212 How reproducible: One time. I am trying to reproduce it. Steps to Reproduce: 1. deploy logging on v4.1.11 2. upgrade cluster to v4.1.12(note: Logging wasn t upgrade, Only cluster was upgraded). 3. check the cluster logging status $oc get deployment -n openshift-logging NAME READY UP-TO-DATE AVAILABLE AGE cluster-logging-operator 1/1 1 1 92m kibana 1/1 1 1 79m $ oc get elasticsearch -n openshift-logging NAME AGE elasticsearch 80m $ oc logs elasticsearch-operator-6656d85bc6-4fjgq -n time="2019-08-19T08:19:18Z" level=info msg="Go Version: go1.10.8" time="2019-08-19T08:19:18Z" level=info msg="Go OS/Arch: linux/amd64" time="2019-08-19T08:19:18Z" level=info msg="operator-sdk Version: 0.0.7" time="2019-08-19T08:19:18Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000" E0819 08:19:18.586684 1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request E0819 08:20:18.639105 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request E0819 08:20:18.643926 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request E0819 08:20:18.670411 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: red / green" time="2019-08-19T08:21:24Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: red / green" time="2019-08-19T08:21:25Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: red / green" time="2019-08-19T08:21:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:21:57Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" time="2019-08-19T08:22:00Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" time="2019-08-19T08:22:35Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:22:38Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" time="2019-08-19T08:22:48Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" time="2019-08-19T08:23:13Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:23:26Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" E0819 08:23:48.643256 1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:50.500312 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:53.573652 1 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:56.643711 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request E0819 08:23:59.715101 1 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:02.786923 1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:02.790516 1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request time="2019-08-19T08:24:10Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" E0819 08:24:21.219505 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:24.291712 1 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:27.363955 1 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:30.434815 1 memcache.go:147] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:33.506773 1 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request E0819 08:24:36.578690 1 memcache.go:147] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request time="2019-08-19T08:24:46Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-1: / green" time="2019-08-19T08:24:49Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-2: / green" time="2019-08-19T08:24:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-p1ds3ygh-3: / green" Actual results: There isn't elasticsearch instance. all elasticsearch deployment have been deleted. The elasticsearch-operators didn't recreate the elasticsearch deployment using the existing CRD elasicsearch. Expected results: Cluster logging works well after upgrade. --- Additional comment from Anping Li on 2019-08-19 12:09:04 UTC --- Remove regression keyword, as it can not be reproduced. --- Additional comment from Anping Li on 2019-08-26 09:13:09 UTC --- Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13. 1) No ES deployment oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE cluster-logging-operator 1/1 1 1 91m kibana 1/1 1 1 23m 2) The elasticsearch scheduledRedeploy is true. The reason is 'ElasticsearchContainerWaiting'. $ oc get elasticsearch elasticsearch -o json |jq '.status' { "clusterHealth": "cluster health unknown", "conditions": [], "nodes": [ { "conditions": [ { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ElasticsearchContainerWaiting" }, { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ProxyContainerWaiting" } ], "deploymentName": "elasticsearch-cdm-g82ncdqr-1", "upgradeStatus": { "scheduledRedeploy": "True" } }, { "conditions": [ { "lastTransitionTime": "2019-08-26T08:40:59Z", "reason": "Error", "status": "True", "type": "ElasticsearchContainerTerminated" }, { "lastTransitionTime": "2019-08-26T08:40:59Z", "reason": "Error", "status": "True", "type": "ProxyContainerTerminated" } ], "deploymentName": "elasticsearch-cdm-g82ncdqr-2", "upgradeStatus": { "scheduledRedeploy": "True" } }, { "conditions": [ { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ElasticsearchContainerWaiting" }, { "lastTransitionTime": "2019-08-26T08:40:40Z", "reason": "ContainerCreating", "status": "True", "type": "ProxyContainerWaiting" } ], "deploymentName": "elasticsearch-cdm-g82ncdqr-3", "upgradeStatus": { "scheduledRedeploy": "True" } } ], "pods": { "client": { "failed": [], "notReady": [], "ready": [] }, "data": { "failed": [], "notReady": [], "ready": [] }, "master": { "failed": [], "notReady": [], "ready": [] } }, "shardAllocationEnabled": "shard allocation unknown" } 3) The elasticsearch-operator print waiting Green Message. [anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8" time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64" time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7" time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000" time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1: / green" time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2: / green" time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3: / green" --- Additional comment from Anping Li on 2019-08-26 09:16:38 UTC --- Quesstion: 1) Why the elasticsearch deployment resources were deleted during cluster upgrade. 2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources. --- Additional comment from Anping Li on 2019-08-26 11:00:44 UTC --- Share the cluster: oc login --token=2mZfsNn0NnTjUdWkRcvWCRi1vbir_wlo-UaGO7qdgtI --server=https://api.qe-411413-upg.qe.devcluster.openshift.com:6443 --- Additional comment from Anping Li on 2019-08-26 13:46:03 UTC --- The bug was reported on the shared cluster on which the whole Openshift QE team work. But I couldn't reproduce it in my private clusters. Is that caused by resource limitation? --- Additional comment from Anping Li on 2019-08-26 14:00:54 UTC --- It was reported on same cluster https://bugzilla.redhat.com/show_bug.cgi?id=1745509. and couldn't reproduce on my private clusters too. --- Additional comment from on 2019-09-09 21:24:21 UTC --- Anping, You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well? Did the Elasticsearch CR get deleted? Did the Clusterlogging CR get deleted? --- Additional comment from Anping Li on 2019-09-11 02:12:01 UTC --- No, only the elasticsearch deployments were deleted. --- Additional comment from Wang Haoran on 2019-09-11 02:29:31 UTC --- We met this problem on our OSD v4 customer cluster after running for sometime. oc get clusterserviceversion NAME AGE elasticsearch-operator.4.1.4-201906271212 22d this is the log from operator: oc logs cluster-logging-operator-7c58ddb664-z8sqh -n openshift-logging time="2019-09-05T22:11:27Z" level=info msg="Go Version: go1.10.8" time="2019-09-05T22:11:27Z" level=info msg="Go OS/Arch: linux/amd64" time="2019-09-05T22:11:27Z" level=info msg="operator-sdk Version: 0.0.7" time="2019-09-05T22:11:27Z" level=info msg="Watching logging.openshift.io/v1, ClusterLogging, openshift-logging, 5000000000" time="2019-09-05T22:11:35Z" level=error msg="Unable to read file to get contents: open /tmp/_working_dir/kibana-proxy-oauth.secret: no such file or directory" time="2019-09-05T22:11:38Z" level=info msg="Updating status of Kibana for \"instance\"" time="2019-09-05T22:11:44Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:12:03Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:12:12Z" level=info msg="Updating status of Elasticsearch" time="2019-09-05T22:12:47Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:13:46Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure creating Kibana route for \"instance\": the server is currently unable to handle the request" ERROR: logging before flag.Parse: E0905 22:13:57.544994 1 memcache.go:147] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request ERROR: logging before flag.Parse: E0905 22:13:57.567773 1 memcache.go:147] couldn't get resource list for packages.operators.coreos.com/v1: the server is currently unable to handle the request time="2019-09-05T22:14:13Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:16:28Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:17:06Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:21:55Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:22:14Z" level=info msg="Updating status of Fluentd" time="2019-09-05T22:25:07Z" level=info msg="Updating status of Fluentd" time="2019-09-07T10:39:58Z" level=info msg="Updating status of Fluentd" time="2019-09-07T10:40:17Z" level=info msg="Updating status of Fluentd" time="2019-09-07T17:45:22Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure creating Kibana deployment for \"instance\": etcdserver: request timed out" time="2019-09-07T17:47:56Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update logstore for \"instance\": Failure creating Elasticsearch CR: etcdserver: request timed out" time="2019-09-08T15:18:03Z" level=error msg="Failed to check for ClusterLogging object: the server was unable to return a response in the time allotted, but may still be processing the request" time="2019-09-08T15:21:13Z" level=error msg="Failed to check for ClusterLogging object: the server was unable to return a response in the time allotted, but may still be processing the request" ERROR: logging before flag.Parse: E0909 12:57:45.219640 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3772777, ErrCode=NO_ERROR, debug="" ERROR: logging before flag.Parse: W0909 12:57:45.382760 1 reflector.go:341] github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:91: watch of *unstructured.Unstructured ended with: unexpected object: &{map[message:too old resource version: 11655411 (13168066) reason:Gone code:410 kind:Status apiVersion:v1 metadata:map[] status:Failure]} time="2019-09-09T14:26:48Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update collection for \"instance\": Failure creating Log collector \"cluster-logging-metadata-reader\" cluster role binding: etcdserver: request timed out" time="2019-09-09T14:27:05Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing Kibana service for \"instance\": etcdserver: request timed out" time="2019-09-10T12:45:26Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": etcdserver: request timed out" time="2019-09-10T12:45:47Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": etcdserver: request timed out" time="2019-09-10T12:47:04Z" level=error msg="error syncing key (openshift-logging/instance): Unable to create or update visualization for \"instance\": Failure constructing kibana-proxy oauthclient for \"instance\": the server is currently unable to handle the request" The pvc is not deleted: oc get pvc -n openshift-logging NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE elasticsearch-elasticsearch-cdm-93d8j7yk-1 Bound pvc-f8ec6f93-c2be-11e9-af83-02e2d9b4437a 200Gi RWO gp2 22d elasticsearch-elasticsearch-cdm-93d8j7yk-2 Bound pvc-f90aa2b4-c2be-11e9-af83-02e2d9b4437a 200Gi RWO gp2 22d elasticsearch-elasticsearch-cdm-93d8j7yk-3 Bound pvc-f92926d9-c2be-11e9-af83-02e2d9b4437a 200Gi RWO gp2 22d We have an 600Gi quota, but I see logs from elasticsearch operator: oc logs elasticsearch-operator-7c5cc4bff9-5rvps -n openshift-operators time="2019-09-10T19:30:19Z" level=error msg="Unable to create PersistentVolumeClaim: Unable to create PVC: persistentvolumeclaims \"elasticsearch-elasticsearch-cdm-93d8j7yk-1\" is forbidden: exceeded quota: logging-storage-quota, requested: requests.storage=200Gi, used: requests.storage=600Gi, limited: requests.storage=600Gi" time="2019-09-10T19:30:19Z" level=error msg="Unable to create PersistentVolumeClaim: Unable to create PVC: persistentvolumeclaims \"elasticsearch-elasticsearch-cdm-93d8j7yk-2\" is forbidden: exceeded quota: logging-storage-quota, requested: requests.storage=200Gi, used: requests.storage=600Gi, limited: requests.storage=600Gi" --- Additional comment from Wang Haoran on 2019-09-11 02:38:07 UTC --- image version: registry.redhat.io/openshift4/ose-elasticsearch-operator:v4.1.4-201906271212 --- Additional comment from Wang Haoran on 2019-09-11 14:42:49 UTC --- oc get elasticsearch -n openshift-logging -o yaml apiVersion: v1 items: - apiVersion: logging.openshift.io/v1 kind: Elasticsearch metadata: creationTimestamp: 2019-08-19T20:06:05Z generation: 32 name: elasticsearch namespace: openshift-logging ownerReferences: - apiVersion: logging.openshift.io/v1 controller: true kind: ClusterLogging name: instance uid: c3ab8626-c2bc-11e9-9112-0200846155f2 resourceVersion: "10555029" selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch uid: c67240c6-c2bc-11e9-af83-02e2d9b4437a spec: managementState: Managed nodeSpec: image: registry.redhat.io/openshift4/ose-logging-elasticsearch5:v4.1.4-201906271212 nodeSelector: node-role.kubernetes.io/worker: "" resources: {} nodes: - genUUID: 93d8j7yk nodeCount: 3 resources: {} roles: - client - data - master storage: size: 200Gi storageClassName: gp2 redundancyPolicy: SingleRedundancy status: clusterHealth: cluster health unknown conditions: [] nodes: - deploymentName: elasticsearch-cdm-93d8j7yk-1 upgradeStatus: scheduledRedeploy: "True" underUpgrade: "True" upgradePhase: nodeRestarting - conditions: - lastTransitionTime: 2019-09-05T22:09:28Z message: The container could not be located when the pod was terminated reason: ContainerStatusUnknown status: "True" type: ElasticsearchContainerTerminated - lastTransitionTime: 2019-09-05T22:09:28Z message: The container could not be located when the pod was terminated reason: ContainerStatusUnknown status: "True" type: ProxyContainerTerminated deploymentName: elasticsearch-cdm-93d8j7yk-2 upgradeStatus: scheduledRedeploy: "True" - deploymentName: elasticsearch-cdm-93d8j7yk-3 upgradeStatus: scheduledRedeploy: "True" pods: client: failed: [] notReady: [] ready: [] data: failed: [] notReady: [] ready: [] master: failed: [] notReady: [] ready: [] shardAllocationEnabled: shard allocation unknown kind: List metadata: resourceVersion: "" selfLink: "" --- Additional comment from on 2019-09-11 15:30:47 UTC --- We were able to resolve this with the following: 1. Scale down EO 2. Update the elasticsearch CR to remove the status section 3. Scale EO back up The reason this worked is because the EO was stuck wanting to perform an upgrade, however there were no Elasticsearch DCs for some reason. It is unclear why they were removed since the elasticsearch CR wasn't deleted (perhaps timing with K8S and upgrading caused it to be removed temporarily which caused K8S reaping but was later restored?) I will open a fix to address this for the future as well as address the repeated message regarding the PVC even though it already exists. --- Additional comment from Jeff Cantrill on 2019-09-11 15:35:20 UTC --- Trying to get this into 4.2 before close --- Additional comment from Wang Haoran on 2019-09-11 15:38:06 UTC --- Here is a workaround: 1. scale down the elasticsearch operator $oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators 2. remove the "status" part of elasticsearch cr $oc edit elasticsearch -n openshift-logging 3. scale back the elasticsearch operator $oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators
The Elasticsearch deployment are recreated after I upgraded to elasticsearch-operartor 4.1.19
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3004