Bug 1743194
| Summary: | The Elasticsearch deployment disappear after cluster upgrade | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Anping Li <anli> | |
| Component: | Logging | Assignee: | ewolinet | |
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 4.1.z | CC: | aos-bugs, cblecker, christoph.obexer, ewolinet, haowang, jcantril, rmeggins, wgordon | |
| Target Milestone: | --- | Keywords: | Reopened | |
| Target Release: | 4.2.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1751320 (view as bug list) | Environment: | ||
| Last Closed: | 2019-10-16 06:36:23 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1751320 | |||
|
Description
Anping Li
2019-08-19 10:13:16 UTC
Remove regression keyword, as it can not be reproduced. Hit this issue again after the cluster was upgraded from v4.1.11 to 4.1.13.
1) No ES deployment
oc get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
cluster-logging-operator 1/1 1 1 91m
kibana 1/1 1 1 23m
2) The elasticsearch scheduledRedeploy is true. The reason is 'ElasticsearchContainerWaiting'.
$ oc get elasticsearch elasticsearch -o json |jq '.status'
{
"clusterHealth": "cluster health unknown",
"conditions": [],
"nodes": [
{
"conditions": [
{
"lastTransitionTime": "2019-08-26T08:40:40Z",
"reason": "ContainerCreating",
"status": "True",
"type": "ElasticsearchContainerWaiting"
},
{
"lastTransitionTime": "2019-08-26T08:40:40Z",
"reason": "ContainerCreating",
"status": "True",
"type": "ProxyContainerWaiting"
}
],
"deploymentName": "elasticsearch-cdm-g82ncdqr-1",
"upgradeStatus": {
"scheduledRedeploy": "True"
}
},
{
"conditions": [
{
"lastTransitionTime": "2019-08-26T08:40:59Z",
"reason": "Error",
"status": "True",
"type": "ElasticsearchContainerTerminated"
},
{
"lastTransitionTime": "2019-08-26T08:40:59Z",
"reason": "Error",
"status": "True",
"type": "ProxyContainerTerminated"
}
],
"deploymentName": "elasticsearch-cdm-g82ncdqr-2",
"upgradeStatus": {
"scheduledRedeploy": "True"
}
},
{
"conditions": [
{
"lastTransitionTime": "2019-08-26T08:40:40Z",
"reason": "ContainerCreating",
"status": "True",
"type": "ElasticsearchContainerWaiting"
},
{
"lastTransitionTime": "2019-08-26T08:40:40Z",
"reason": "ContainerCreating",
"status": "True",
"type": "ProxyContainerWaiting"
}
],
"deploymentName": "elasticsearch-cdm-g82ncdqr-3",
"upgradeStatus": {
"scheduledRedeploy": "True"
}
}
],
"pods": {
"client": {
"failed": [],
"notReady": [],
"ready": []
},
"data": {
"failed": [],
"notReady": [],
"ready": []
},
"master": {
"failed": [],
"notReady": [],
"ready": []
}
},
"shardAllocationEnabled": "shard allocation unknown"
}
3) The elasticsearch-operator print waiting Green Message.
[anli@preserve-anli-slave u41]$ oc logs elasticsearch-operator-6656d85bc6-s2xft
time="2019-08-26T08:49:44Z" level=info msg="Go Version: go1.10.8"
time="2019-08-26T08:49:44Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-08-26T08:49:44Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-08-26T08:49:44Z" level=info msg="Watching logging.openshift.io/v1, Elasticsearch, , 5000000000"
time="2019-08-26T08:49:53Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-1: / green"
time="2019-08-26T08:50:16Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-2: / green"
time="2019-08-26T08:50:19Z" level=info msg="Waiting for cluster to be fully recovered before restarting elasticsearch-cdm-g82ncdqr-3: / green"
Quesstion: 1) Why the elasticsearch deployment resources were deleted during cluster upgrade. 2) Why the cluster-logging-operator couldn't be recreated the lasticsearch deployment resources. The bug was reported on the shared cluster on which the whole Openshift QE team work. But I couldn't reproduce it in my private clusters. Is that caused by resource limitation? Anping, You said that the elasticsearch deployments were deleted, were the other component deployments deleted as well? Did the Elasticsearch CR get deleted? Did the Clusterlogging CR get deleted? No, only the elasticsearch deployments were deleted. Trying to get this into 4.2 before close Here is a workaround: 1. scale down the elasticsearch operator $oc scale deployment/elasticsearch-operator --replicas=0 -n openshift-operators 2. remove the "status" part of elasticsearch cr $oc edit elasticsearch -n openshift-logging 3. scale back the elasticsearch operator $oc scale deployment/elasticsearch-operator --replicas=1 -n openshift-operators I deleted elasticsearch manully in 'Waiting for cluster to complete recovery', after a while, these deployments are recreated. so we can say the fix works. As we didn't figure out what/who cause the ES deployment disappers. Please feel free to reopen. [anli@preserve-anli-slave 42]$ oc get pods -n openshift-logging NAME READY STATUS RESTARTS AGE cluster-logging-operator-77b67d8b87-zsg85 1/1 Running 0 31m curator-1568627400-b6d66 0/1 Completed 0 3m14s elasticsearch-cd-pjyicydj-1-d5f967777-rcx6d 1/2 Running 0 7m51s elasticsearch-cdm-my5mbcvj-1-5bc946d955-4nhcr 2/2 Running 0 4m20s elasticsearch-cdm-my5mbcvj-2-55457749f5-m6gr7 2/2 Running 0 3m19s elasticsearch-cdm-my5mbcvj-3-598c6d6bc9-c56tb 2/2 Running 0 2m17s time="2019-09-16T09:24:42Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow" time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:05Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:06Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:07Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:13Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:17Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart" time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:38Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-my5mbcvj-3: yellow / green" time="2019-09-16T09:25:38Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Cluster not in green state before beginning upgrade: yellow" time="2019-09-16T09:25:39Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:40Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:42Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:43Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:25:44Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:26:16Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 9 shards in preparation for cluster restart" time="2019-09-16T09:27:29Z" level=info msg="Timed out waiting for elasticsearch-cdm-my5mbcvj-3 to rejoin cluster" time="2019-09-16T09:27:29Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-my5mbcvj-3: Node elasticsearch-cdm-my5mbcvj-3 has not rejoined cluster elasticsearch yet" time="2019-09-16T09:28:00Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:01Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:02Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" time="2019-09-16T09:28:04Z" level=info msg="Waiting for cluster to complete recovery: yellow / green" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |