Bug 1753568
Summary: | transient.cluster.routing.allocation.enable is none sometimes after cluster upgrade | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anping Li <anli> | ||||||||||||||
Component: | Logging | Assignee: | ewolinet | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 4.2.0 | CC: | aos-bugs, ewolinet, jcantril, rmeggins | ||||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||||
Target Release: | 4.3.0 | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | |||||||||||||||||
: | 1761080 1789261 (view as bug list) | Environment: | |||||||||||||||
Last Closed: | 2020-01-23 11:06:22 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 1761080, 1789261 | ||||||||||||||||
Attachments: |
|
Description
Anping Li
2019-09-19 10:28:10 UTC
Created attachment 1616669 [details]
The Shards and nodes status
Are you able to confirm if the PVC and PV associated with the pods before upgrade are the same as those afterwards? Maybe investigate the UIDs associated with each resource From the elasticsearch deployment, the PVC are assigned correctly. I think the pods have been associated correct volume too. [anli@preserve-anli-slave s42]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE elasticsearch-elasticsearch-cdm-ubj1bxgq-1 Bound pvc-4d17966a-d9e0-11e9-b1bd-02ce9ff33326 19Gi RWO gp2 32h elasticsearch-elasticsearch-cdm-ubj1bxgq-2 Bound pvc-4d35b9b9-d9e0-11e9-b1bd-02ce9ff33326 19Gi RWO gp2 32h elasticsearch-elasticsearch-cdm-ubj1bxgq-3 Bound pvc-4d5434e6-d9e0-11e9-b1bd-02ce9ff33326 19Gi RWO gp2 32h $ oc get pvc -o jsonpath={.metadata.uid} elasticsearch-elasticsearch-cdm-ubj1bxgq-1 4d17966a-d9e0-11e9-b1bd-02ce9ff33326 $ oc get pv -o jsonpath={.metadata.uid} pvc-4d17966a-d9e0-11e9-b1bd-02ce9ff33326 4ed02a8f-d9e0-11e9-b1bd-02ce9ff33326 $ oc get pvc -o jsonpath={.metadata.uid} elasticsearch-elasticsearch-cdm-ubj1bxgq-2 4d35b9b9-d9e0-11e9-b1bd-02ce9ff33326 $ oc get pv -o jsonpath={.metadata.uid} pvc-4d35b9b9-d9e0-11e9-b1bd-02ce9ff33326 538c4bc8-d9e0-11e9-b1bd-02ce9ff33326 $ oc get pvc -o jsonpath={.metadata.uid} elasticsearch-elasticsearch-cdm-ubj1bxgq-3 4d5434e6-d9e0-11e9-b1bd-02ce9ff33326 $ oc get pv -o jsonpath={.metadata.uid} pvc-4d5434e6-d9e0-11e9-b1bd-02ce9ff33326 573c61d2-d9e0-11e9-b1bd-02ce9ff33326 Hit similar issue again on a fresh clusters. Neigher cluster or ES were updated( The age of elasticsearch-cdm-**** and cluster-logging-operator Age are almost same.) Three nodes in the ES cluster. All started shards are on same node (elasticsearch-cdm-f5d0ea4e-3-874467b4d-rb2bs). No shards on the other nodes (elasticsearch-cdm-f5d0ea4e-1-7595949bc8-ps89l and elasticsearch-cdm-f5d0ea4e-2-7fb9468597-czh6t) #A fragment of unassigned shards project.ezv5d.f0fc7e14-e292-11e9-ba88-42010a000002.2019.09.29 2 r UNASSIGNED NODE_LEFT project.ezv5d.f0fc7e14-e292-11e9-ba88-42010a000002.2019.09.29 1 r UNASSIGNED NODE_LEFT project.ezv5d.f0fc7e14-e292-11e9-ba88-42010a000002.2019.09.29 0 r UNASSIGNED NODE_LEFT project.egawc.bde510c4-e299-11e9-a54a-42010a000005.2019.09.29 2 r UNASSIGNED NODE_LEFT project.egawc.bde510c4-e299-11e9-a54a-42010a000005.2019.09.29 1 r UNASSIGNED NODE_LEFT project.egawc.bde510c4-e299-11e9-a54a-42010a000005.2019.09.29 0 r UNASSIGNED NODE_LEFT project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 1 p UNASSIGNED INDEX_CREATED project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 1 r UNASSIGNED INDEX_CREATED project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 2 p UNASSIGNED INDEX_CREATED project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 2 r UNASSIGNED INDEX_CREATED project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 0 p UNASSIGNED INDEX_CREATED project.h4.7695630e-e32b-11e9-bd3d-42010a000003.2019.09.30 0 r UNASSIGNED INDEX_CREATED [anli@preserve-anli-slave 42]$ oc get pods NAME READY STATUS RESTARTS AGE cluster-logging-operator-55f9cfb648-fcjwd 1/1 Running 0 15h curator-1569815400-8b25k 0/1 Completed 0 3m3s elasticsearch-cdm-f5d0ea4e-1-7595949bc8-ps89l 2/2 Running 0 15h elasticsearch-cdm-f5d0ea4e-2-7fb9468597-czh6t 2/2 Running 0 14h elasticsearch-cdm-f5d0ea4e-3-874467b4d-rb2bs 2/2 Running 0 15h fluentd-dzml6 1/1 Running 0 3h50m There are mounted persistent volume. The volume type is gce-pd I can wrote on it. See the attached for all files under /elasticsearch/persistent. sh-4.2$ ls -lh /elasticsearch/persistent total 26M drwxrwsr-x. 4 1008180000 1008180000 4.0K Sep 29 06:51 elasticsearch -rw-r--r--. 1 1008180000 1008180000 26M Sep 30 04:12 elasticsearch.tar drwxrws---. 2 root 1008180000 16K Sep 29 06:50 lost+found $ oc get pvc elasticsearch-elasticsearch-cdm-f5d0ea4e-1 -o yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd volume.kubernetes.io/selected-node: juzhao-8knd7-w-f-wqkwn.c.openshift-qe.internal creationTimestamp: "2019-09-29T06:50:35Z" finalizers: - kubernetes.io/pvc-protection name: elasticsearch-elasticsearch-cdm-f5d0ea4e-1 namespace: openshift-logging resourceVersion: "269481" selfLink: /api/v1/namespaces/openshift-logging/persistentvolumeclaims/elasticsearch-elasticsearch-cdm-f5d0ea4e-1 uid: 7044b100-e285-11e9-b4cd-42010a000003 spec: accessModes: - ReadWriteOnce resources: requests: storage: 20G storageClassName: standard volumeMode: Filesystem volumeName: pvc-7044b100-e285-11e9-b4cd-42010a000003 status: accessModes: - ReadWriteOnce capacity: storage: 19Gi phase: Bound Created attachment 1621136 [details]
Elasticsearch persistent files on the unassigned node
The route cause is transient.cluster.routing.allocation.enable=none. It seems the elasticsearch-operator forget to enable allocation in some cases. Turn on allocation will fix this bug. + oc exec -c elasticsearch elasticsearch-cdm-f5d0ea4e-1-7595949bc8-ps89l -- es_util --query=_cluster/settings + python -m json.tool { "persistent": { "cluster": { "routing": { "allocation": { "enable": "all" } } }, "discovery": { "zen": { "minimum_master_nodes": "2" } } }, "transient": { "cluster": { "routing": { "allocation": { "enable": "none" } } } } } Can you provide the elasticsearch operator logs as well? Anytime it does a restart it will set allocation to "none" but it should set it back to "all" when its done. The cluster is shutdown. I will provide it once I met it the next time. It appears again. So promote the Severity to Urgent. The elasticsearch-operator log are updated. But I don't see any evidence in the bug. Note: only cluster is upgraded to 4.2. The logging keep 4.1. The Elasticsearch pods have been evicted during cluster upgraded. Logging: v4.1.20-201910102034 Upgrade cluster 4.1.18->4.2.rc.5 Created attachment 1624949 [details]
Shard_Unassinged Logs
Turn down to high as the workaroud is easy. no data loss. this bug has to be added into 4.2 release note if we couldn't fixed in 4.1. (In reply to Anping Li from comment #13) > It appears again. So promote the Severity to Urgent. The > elasticsearch-operator log are updated. But I don't see any evidence in the > bug. > > Note: only cluster is upgraded to 4.2. The logging keep 4.1. The > Elasticsearch pods have been evicted during cluster upgraded. > Logging: v4.1.20-201910102034 > Upgrade cluster 4.1.18->4.2.rc.5 This information along with the discovery that 'transient.cluster.routing.allocation.enable=none' is what keeping the es cluster from reforming is concerning to me. Neither the CLO nor EO have any understanding when the underlying cluster is being upgraded and nodes going away. Neither has any proactive way of executing a setting change so that shards are not reallocated. This means something else in the environment has modified the allocation setting of elasticsearch. That is mysterious. I didn't change transient.cluster.routing.allocation.enable to none manually . Neither CLO nor EO. Are there any other component that can modify the ES? curator? ES itself? (In reply to Anping Li from comment #17) > That is mysterious. I didn't change > transient.cluster.routing.allocation.enable to none manually . Neither CLO > nor EO. Are there any other component that can modify the ES? curator? ES > itself? There are no other components that perform this task and we don't have any form of audit logging to know who might be executing these queries. I can only imagine this scenario if someone tried to upgrade clusterlogging which has a feature to modify the allocation. Turn down the Severity as we thought someone change it. Will add more detail if i hit it again. If we couldn't meet it in one month. I will close it. @anping I am closing this issue for now to remove it from the 4.3 list. please reopen if you see it again. It appears again. That time, it happened on logging upgraded rather then cluster upgraded. + oc exec -c elasticsearch elasticsearch-cdm-xkenfw5r-1-85bb58879c-94l6m -- es_util --query=_cat/shards project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 1 p STARTED 32 171.7kb 10.128.2.84 elasticsearch-cdm-xkenfw5r-2 project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 1 r UNASSIGNED project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 2 p STARTED 30 151.4kb 10.130.2.129 elasticsearch-cdm-xkenfw5r-1 project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 2 r UNASSIGNED project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 0 p STARTED 37 57.2kb 10.130.2.129 elasticsearch-cdm-xkenfw5r-1 project.devexp-jenkins-gitlab.7a026ab0-0125-11ea-a85c-42010a000004.2019.11.07 0 r UNASSIGNED The pod and operation logs are attched. Created attachment 1633609 [details]
Operation logs
Created attachment 1633610 [details]
The operator pod logs
Created attachment 1638099 [details]
The elasticsearch resource
Appears again. OCP 4.2.7-> OCP 4.3, and then logging 4.2.8 - > 4.3.0-201911161914
In the latest week. No such issue during testing. so move to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |