Bug 1466695
Summary: | Running the logging playbooks resets elasticsearch-storage volume for es-ops | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Johan Swensson <jswensso> | ||||||
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Xia Zhao <xiazhao> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.5.0 | CC: | anli, aos-bugs, bleanhar, ewolinet, juzhao, rmeggins, tibrahim, xiazhao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.5.z | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: The storage path was need being evaluated correctly for ops clusters
Consequence:
Fix: Use the information from the logging facts to determine if a volume should be reused.
Result:
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-08-31 17:00:23 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Johan Swensson
2017-06-30 09:19:22 UTC
Triaging via Upcoming release as its not a blocker. Not ideal but the workaround is to reattach the volume on the dc. Tested with openshift-ansible-playbooks-3.5.101-1.git.0.0107544.el7.noarch After bounding es and es-ops with HostPath storage, rerun the logging playbook (without any parameters related storage set in inventory file), describe es pods, the HostPath PV was successfully persisted: Volumes elasticsearch-storage: Type: HostPath (bare host directory volume) Path: /usr/local/es-storage But the es/es-ops pods are no longer runable after the redeploying due to es node is locked, is this the expected behavior? Test steps in detail: 1. Deploy logging without PV (the playbook cna be found in attachment), then manually bound es with HostPath PV, and make sure it's running fine at this moment: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-t42db 1/1 Running 0 3m logging-curator-ops-1-nzm6f 1/1 Running 0 3m logging-es-0glaw65l-3-jmtcf 1/1 Running 0 12s logging-es-ops-j8xjf9l3-3-n7v5t 1/1 Running 0 9s logging-fluentd-dlp8b 1/1 Running 0 3m logging-fluentd-fx1wv 1/1 Running 0 3m logging-kibana-1-s29kr 2/2 Running 0 3m logging-kibana-ops-1-9q9k2 2/2 Running 0 3m 2. Rerun logging deployment playbook with the same playbook as in the attachment without uninstalling logging stacks (is this a real use case?), the es nodes are locked: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-t42db 1/1 Running 0 6m logging-curator-ops-1-nzm6f 1/1 Running 0 6m logging-es-0glaw65l-4-kfl16 0/1 CrashLoopBackOff 3 1m logging-es-ops-j8xjf9l3-4-43pc7 0/1 CrashLoopBackOff 3 1m logging-fluentd-bb8x6 1/1 Running 0 1m logging-fluentd-phjx0 1/1 Running 0 1m logging-kibana-1-s29kr 2/2 Running 0 6m logging-kibana-ops-1-9q9k2 2/2 Running 0 6m # oc logs logging-es-ops-j8xjf9l3-4-43pc7 [2017-07-25 09:29:55,904][INFO ][container.run ] Begin Elasticsearch startup script [2017-07-25 09:29:55,913][INFO ][container.run ] Comparing the specified RAM to the maximum recommended for Elasticsearch... [2017-07-25 09:29:55,913][INFO ][container.run ] Inspecting the maximum RAM available... [2017-07-25 09:29:55,917][INFO ][container.run ] ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms4096m -Xmx4096m' Exception in thread "main" java.lang.IllegalStateException: Failed to obtain node lock, is the following location writable?: [/elasticsearch/persistent/logging-es-ops/data/logging-es-ops] Likely root cause: java.nio.file.AccessDeniedException: /elasticsearch/persistent/logging-es-ops/data/logging-es-ops/nodes/0/node.lock at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) at java.nio.channels.FileChannel.open(FileChannel.java:287) at java.nio.channels.FileChannel.open(FileChannel.java:335) at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:113) at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:174) at org.elasticsearch.node.Node.<init>(Node.java:165) at org.elasticsearch.node.Node.<init>(Node.java:140) at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143) at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194) at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286) at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45) Refer to the log for complete error details. Created attachment 1304143 [details]
inventory file used for logging deployment
@Eric, Any ideas or thoughts why this would behave differently for an OPS instance? @Xia, Can you provide additional details: 1. The DC for the ES pods 2. Permissions and owner info of the storage directory 3. Steps to produce the issue? setup hostmount->rerun playbook? I don't think this has anything to do with ops vs non-ops. At first glance, based on the logs it looks like the locks weren't able to be released. It also looks like both the ops and non-ops ES pod is having issues since they're both in "CrashLoopBackOff" status labeled es-pv-node with nodeSelector:logging-es-node: "1", we can find the nodeSelector in ES dc file after attaching hostpath pv to es and es-ops pod, but if we re-run the playbook, nodeSelector:logging-es-node: "1" is missed from dc, this maybe caused the CrashLoopBackOff of es pods. 1. ES DC see the attached file 2. Permissions of the storage directory is 777 mkdir -p /usr/local/es-storage; chmod 777 /usr/local/es-storage 3. Reproduce steps: 1) followed https://docs.openshift.com/container-platform/3.5/install_config/aggregate_logging.html to attach hostpath pv ###################################################################### # oc label node ${es-pv-node-name} logging-es-node=1 --overwrite # oadm policy add-scc-to-user privileged system:serviceaccount:logging:aggregated-logging-elasticsearch # for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do oc scale $dc --replicas=0 oc patch $dc \ -p '{"spec":{"template":{"spec":{"containers":[{"name":"elasticsearch","securityContext":{"privileged": true}}]}}}}' done # for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do oc patch $dc -p '{"spec":{"template":{"spec":{"nodeSelector":{"logging-es-node":"1"}}}}}' done # for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do oc set volume $dc \ --add --overwrite --name=elasticsearch-storage \ --type=hostPath --path=/usr/local/es-storage oc rollout latest $dc oc scale $dc --replicas=1 done ###################################################################### 2) Re run the playbook, the playbook is the same when deployed logging without PV. see xiazhao's attached file:inventory file used for logging deployment Created attachment 1304590 [details]
es dc info
I think this issue is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1478771 where the nodeselector is not being preserved. My suspicion is that 2 instance of ES end up on the same infra nodes and are trying to access the same storage. There is an open debate if we should be preserving these nodeselectors for the case described here or if we should be using pod affinity/anti-affinity to ensure the pods are spread properly across nodes. Can you confirm the original issue was resolved with the original PR? openshift/openshift-ansible/pull/4661 The issue for which this is reopened should be addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1478771 Tested with openshift-ansible-playbooks-3.6.173.0.5-5.git.0.0538d97.el7.noarch, the issue described in comment #3 was resolved but the original issue described in comment #0 is still reproduced: Here are the pod status and logs after re-running the playbook, both es and es-ops pods storage is changed to EmptyDir: # oc describe po logging-es-ops-data-master-m99iw2v9-4-bx6bx ... elasticsearch-storage: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: ... This means re-running the playbook reseted es storage volumes. # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-bp7bz 1/1 Running 0 8m logging-curator-ops-1-6z2zp 1/1 Running 0 8m logging-es-data-master-5cdy4ncl-4-br95q 1/1 Running 0 4m logging-es-ops-data-master-m99iw2v9-4-bx6bx 1/1 Running 0 4m logging-fluentd-68jrj 1/1 Running 0 8m logging-fluentd-xf9dn 1/1 Running 0 8m logging-kibana-1-zzmd5 2/2 Running 0 9m logging-kibana-ops-1-9lhqw 2/2 Running 0 9m # oc logs logging-es-ops-data-master-m99iw2v9-4-bx6bx [2017-08-21 03:36:16,554][INFO ][container.run ] Begin Elasticsearch startup script [2017-08-21 03:36:16,565][INFO ][container.run ] Comparing the specified RAM to the maximum recommended for Elasticsearch... [2017-08-21 03:36:16,566][INFO ][container.run ] Inspecting the maximum RAM available... [2017-08-21 03:36:16,570][INFO ][container.run ] ES_HEAP_SIZE: '4096m' [2017-08-21 03:36:16,571][INFO ][container.run ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof [2017-08-21 03:36:16,577][INFO ][container.run ] Checking if Elasticsearch is ready on https://localhost:9200 [2017-08-21 03:36:17,662][INFO ][node ] [logging-es-ops-data-master-m99iw2v9] version[2.4.4], pid[1], build[b3c4811/2017-01-18T03:01:12Z] [2017-08-21 03:36:17,663][INFO ][node ] [logging-es-ops-data-master-m99iw2v9] initializing ... [2017-08-21 03:36:18,647][INFO ][plugins ] [logging-es-ops-data-master-m99iw2v9] modules [reindex, lang-expression, lang-groovy], plugins [openshift-elasticsearch, cloud-kubernetes], sites [] [2017-08-21 03:36:18,744][INFO ][env ] [logging-es-ops-data-master-m99iw2v9] using [1] data paths, mounts [[/elasticsearch/persistent (/dev/mapper/rhel-root)]], net usable_space [7.4gb], net total_space [9.9gb], spins? [possibly], types [xfs] [2017-08-21 03:36:18,744][INFO ][env ] [logging-es-ops-data-master-m99iw2v9] heap size [3.9gb], compressed ordinary object pointers [true] [2017-08-21 03:36:19,661][INFO ][http ] [logging-es-ops-data-master-m99iw2v9] Using [org.elasticsearch.http.netty.NettyHttpServerTransport] as http transport, overridden by [search-guard2] [2017-08-21 03:36:19,948][INFO ][transport ] [logging-es-ops-data-master-m99iw2v9] Using [com.floragunn.searchguard.transport.SearchGuardTransportService] as transport service, overridden by [search-guard2] [2017-08-21 03:36:19,948][INFO ][transport ] [logging-es-ops-data-master-m99iw2v9] Using [com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] as transport, overridden by [search-guard-ssl] [root@host-8-241-33 ~]# oc logs logging-es-data-master-5cdy4ncl-4-br95q [2017-08-21 03:36:14,786][WARN ][com.floragunn.searchguard.configuration.PrivilegesEvaluator] .kibana does not exist in cluster metadata [2017-08-21 03:36:14,794][INFO ][cluster.metadata ] [logging-es-data-master-5cdy4ncl] [.kibana] creating index, cause [api], templates [], shards [1]/[0], mappings [config] [2017-08-21 03:36:14,814][INFO ][cluster.routing.allocation] [logging-es-data-master-5cdy4ncl] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.kibana][0]] ...]). Please kindly ignore the above comment #12 since it's actually based on a 3.6 ansible package. Tested with openshift-ansible-playbooks-3.5.110-1.git.0.6f1f193.el7.noarch, the original bug was fixed: The hostPath PV settings to es-ops can be successfully preserved after rerunning the logging playbooks. Please feel free to transfer back to ON_QA status for closure, thanks. More details: Test env: # openshift version openshift v3.5.5.31.19 kubernetes v1.5.2+43a9be4 etcd 3.1.0 After rerunning the logging playbooks again, es stacks and it's volumes are fine: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-n9fl9 1/1 Running 0 23m logging-curator-ops-1-bkwcs 1/1 Running 0 23m logging-es-hr0p3wv9-3-b0p4c 1/1 Running 0 18m logging-es-ops-73oh3fg6-3-sr1v5 1/1 Running 0 18m logging-fluentd-9ljkt 1/1 Running 0 23m logging-fluentd-h3ljj 1/1 Running 0 23m logging-kibana-1-dm6k5 2/2 Running 0 23m logging-kibana-ops-1-gvsw8 2/2 Running 0 23m # oc describe po logging-es-ops-73oh3fg6-3-sr1v5 ... Volumes: elasticsearch-storage: Type: HostPath (bare host directory volume) Path: /usr/local/es-storage ... Also checked the es logs, contains no error. Set to verified according to test result in comment #13 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1828 |