Bug 1753568
| Summary: | transient.cluster.routing.allocation.enable is none sometimes after cluster upgrade | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Anping Li <anli> | ||||||||||||||
| Component: | Logging | Assignee: | ewolinet | ||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | ||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||
| Priority: | medium | ||||||||||||||||
| Version: | 4.2.0 | CC: | aos-bugs, ewolinet, jcantril, rmeggins | ||||||||||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||||||||||
| Target Release: | 4.3.0 | ||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||
| OS: | Unspecified | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | |||||||||||||||||
| : | 1761080 1789261 (view as bug list) | Environment: | |||||||||||||||
| Last Closed: | 2020-01-23 11:06:22 UTC | Type: | Bug | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Bug Depends On: | |||||||||||||||||
| Bug Blocks: | 1761080, 1789261 | ||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Anping Li
2019-09-19 10:28:10 UTC
Created attachment 1616669 [details]
The Shards and nodes status
Are you able to confirm if the PVC and PV associated with the pods before upgrade are the same as those afterwards? Maybe investigate the UIDs associated with each resource From the elasticsearch deployment, the PVC are assigned correctly. I think the pods have been associated correct volume too. [anli@preserve-anli-slave s42]$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
elasticsearch-elasticsearch-cdm-ubj1bxgq-1 Bound pvc-4d17966a-d9e0-11e9-b1bd-02ce9ff33326 19Gi RWO gp2 32h
elasticsearch-elasticsearch-cdm-ubj1bxgq-2 Bound pvc-4d35b9b9-d9e0-11e9-b1bd-02ce9ff33326 19Gi RWO gp2 32h
elasticsearch-elasticsearch-cdm-ubj1bxgq-3 Bound pvc-4d5434e6-d9e0-11e9-b1bd-02ce9ff33326 19Gi RWO gp2 32h
$ oc get pvc -o jsonpath={.metadata.uid} elasticsearch-elasticsearch-cdm-ubj1bxgq-1
4d17966a-d9e0-11e9-b1bd-02ce9ff33326
$ oc get pv -o jsonpath={.metadata.uid} pvc-4d17966a-d9e0-11e9-b1bd-02ce9ff33326
4ed02a8f-d9e0-11e9-b1bd-02ce9ff33326
$ oc get pvc -o jsonpath={.metadata.uid} elasticsearch-elasticsearch-cdm-ubj1bxgq-2
4d35b9b9-d9e0-11e9-b1bd-02ce9ff33326
$ oc get pv -o jsonpath={.metadata.uid} pvc-4d35b9b9-d9e0-11e9-b1bd-02ce9ff33326
538c4bc8-d9e0-11e9-b1bd-02ce9ff33326
$ oc get pvc -o jsonpath={.metadata.uid} elasticsearch-elasticsearch-cdm-ubj1bxgq-3
4d5434e6-d9e0-11e9-b1bd-02ce9ff33326
$ oc get pv -o jsonpath={.metadata.uid} pvc-4d5434e6-d9e0-11e9-b1bd-02ce9ff33326
573c61d2-d9e0-11e9-b1bd-02ce9ff33326
Hit similar issue again on a fresh clusters. Neigher cluster or ES were updated( The age of elasticsearch-cdm-**** and cluster-logging-operator Age are almost same.)
Three nodes in the ES cluster. All started shards are on same node (elasticsearch-cdm-f5d0ea4e-3-874467b4d-rb2bs). No shards on the other nodes (elasticsearch-cdm-f5d0ea4e-1-7595949bc8-ps89l and elasticsearch-cdm-f5d0ea4e-2-7fb9468597-czh6t)
#A fragment of unassigned shards
project.ezv5d.f0fc7e14-e292-11e9-ba88-42010a000002.2019.09.29 2 r UNASSIGNED NODE_LEFT
project.ezv5d.f0fc7e14-e292-11e9-ba88-42010a000002.2019.09.29 1 r UNASSIGNED NODE_LEFT
project.ezv5d.f0fc7e14-e292-11e9-ba88-42010a000002.2019.09.29 0 r UNASSIGNED NODE_LEFT
project.egawc.bde510c4-e299-11e9-a54a-42010a000005.2019.09.29 2 r UNASSIGNED NODE_LEFT
project.egawc.bde510c4-e299-11e9-a54a-42010a000005.2019.09.29 1 r UNASSIGNED NODE_LEFT
project.egawc.bde510c4-e299-11e9-a54a-42010a000005.2019.09.29 0 r UNASSIGNED NODE_LEFT
project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 1 p UNASSIGNED INDEX_CREATED
project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 1 r UNASSIGNED INDEX_CREATED
project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 2 p UNASSIGNED INDEX_CREATED
project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 2 r UNASSIGNED INDEX_CREATED
project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 0 p UNASSIGNED INDEX_CREATED
project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 0 r UNASSIGNED INDEX_CREATED
[anli@preserve-anli-slave 42]$ oc get pods
NAME READY STATUS RESTARTS AGE
cluster-logging-operator-55f9cfb648-fcjwd 1/1 Running 0 15h
curator-1569815400-8b25k 0/1 Completed 0 3m3s
elasticsearch-cdm-f5d0ea4e-1-7595949bc8-ps89l 2/2 Running 0 15h
elasticsearch-cdm-f5d0ea4e-2-7fb9468597-czh6t 2/2 Running 0 14h
elasticsearch-cdm-f5d0ea4e-3-874467b4d-rb2bs 2/2 Running 0 15h
fluentd-dzml6 1/1 Running 0 3h50m
There are mounted persistent volume. The volume type is gce-pd I can wrote on it. See the attached for all files under /elasticsearch/persistent.
sh-4.2$ ls -lh /elasticsearch/persistent
total 26M
drwxrwsr-x. 4 1008180000 1008180000 4.0K Sep 29 06:51 elasticsearch
-rw-r--r--. 1 1008180000 1008180000 26M Sep 30 04:12 elasticsearch.tar
drwxrws---. 2 root 1008180000 16K Sep 29 06:50 lost+found
$ oc get pvc elasticsearch-elasticsearch-cdm-f5d0ea4e-1 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
volume.kubernetes.io/selected-node: juzhao-8knd7-w-f-wqkwn.c.openshift-qe.internal
creationTimestamp: "2019-09-29T06:50:35Z"
finalizers:
- kubernetes.io/pvc-protection
name: elasticsearch-elasticsearch-cdm-f5d0ea4e-1
namespace: openshift-logging
resourceVersion: "269481"
selfLink: /api/v1/namespaces/openshift-logging/persistentvolumeclaims/elasticsearch-elasticsearch-cdm-f5d0ea4e-1
uid: 7044b100-e285-11e9-b4cd-42010a000003
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20G
storageClassName: standard
volumeMode: Filesystem
volumeName: pvc-7044b100-e285-11e9-b4cd-42010a000003
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 19Gi
phase: Bound
Created attachment 1621136 [details]
Elasticsearch persistent files on the unassigned node
The route cause is transient.cluster.routing.allocation.enable=none. It seems the elasticsearch-operator forget to enable allocation in some cases. Turn on allocation will fix this bug.
+ oc exec -c elasticsearch elasticsearch-cdm-f5d0ea4e-1-7595949bc8-ps89l -- es_util --query=_cluster/settings
+ python -m json.tool
{
"persistent": {
"cluster": {
"routing": {
"allocation": {
"enable": "all"
}
}
},
"discovery": {
"zen": {
"minimum_master_nodes": "2"
}
}
},
"transient": {
"cluster": {
"routing": {
"allocation": {
"enable": "none"
}
}
}
}
}
Can you provide the elasticsearch operator logs as well? Anytime it does a restart it will set allocation to "none" but it should set it back to "all" when its done. The cluster is shutdown. I will provide it once I met it the next time. It appears again. So promote the Severity to Urgent. The elasticsearch-operator log are updated. But I don't see any evidence in the bug. Note: only cluster is upgraded to 4.2. The logging keep 4.1. The Elasticsearch pods have been evicted during cluster upgraded. Logging: v4.1.20-201910102034 Upgrade cluster 4.1.18->4.2.rc.5 Created attachment 1624949 [details]
Shard_Unassinged Logs
Turn down to high as the workaroud is easy. no data loss. this bug has to be added into 4.2 release note if we couldn't fixed in 4.1. (In reply to Anping Li from comment #13) > It appears again. So promote the Severity to Urgent. The > elasticsearch-operator log are updated. But I don't see any evidence in the > bug. > > Note: only cluster is upgraded to 4.2. The logging keep 4.1. The > Elasticsearch pods have been evicted during cluster upgraded. > Logging: v4.1.20-201910102034 > Upgrade cluster 4.1.18->4.2.rc.5 This information along with the discovery that 'transient.cluster.routing.allocation.enable=none' is what keeping the es cluster from reforming is concerning to me. Neither the CLO nor EO have any understanding when the underlying cluster is being upgraded and nodes going away. Neither has any proactive way of executing a setting change so that shards are not reallocated. This means something else in the environment has modified the allocation setting of elasticsearch. That is mysterious. I didn't change transient.cluster.routing.allocation.enable to none manually . Neither CLO nor EO. Are there any other component that can modify the ES? curator? ES itself? (In reply to Anping Li from comment #17) > That is mysterious. I didn't change > transient.cluster.routing.allocation.enable to none manually . Neither CLO > nor EO. Are there any other component that can modify the ES? curator? ES > itself? There are no other components that perform this task and we don't have any form of audit logging to know who might be executing these queries. I can only imagine this scenario if someone tried to upgrade clusterlogging which has a feature to modify the allocation. Turn down the Severity as we thought someone change it. Will add more detail if i hit it again. If we couldn't meet it in one month. I will close it. @anping I am closing this issue for now to remove it from the 4.3 list. please reopen if you see it again. It appears again. That time, it happened on logging upgraded rather then cluster upgraded. + oc exec -c elasticsearch elasticsearch-cdm-xkenfw5r-1-85bb58879c-94l6m -- es_util --query=_cat/shards project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 1 p STARTED 32 171.7kb 10.128.2.84 elasticsearch-cdm-xkenfw5r-2 project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 1 r UNASSIGNED project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 2 p STARTED 30 151.4kb 10.130.2.129 elasticsearch-cdm-xkenfw5r-1 project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 2 r UNASSIGNED project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 0 p STARTED 37 57.2kb 10.130.2.129 elasticsearch-cdm-xkenfw5r-1 project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 0 r UNASSIGNED The pod and operation logs are attched. Created attachment 1633609 [details]
Operation logs
Created attachment 1633610 [details]
The operator pod logs
Created attachment 1638099 [details]
The elasticsearch resource
Appears again. OCP 4.2.7-> OCP 4.3, and then logging 4.2.8 - > 4.3.0-201911161914
In the latest week. No such issue during testing. so move to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |