1466695 – Running the logging playbooks resets elasticsearch-storage volume for es-ops

Bug 1466695 - Running the logging playbooks resets elasticsearch-storage volume for es-ops

Summary: Running the logging playbooks resets elasticsearch-storage volume for es-ops

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.5.z
Assignee:	Jeff Cantrill
QA Contact:	Xia Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-30 09:19 UTC by Johan Swensson
Modified:	2018-09-20 21:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The storage path was need being evaluated correctly for ops clusters Consequence: Fix: Use the information from the logging facts to determine if a volume should be reused. Result:
Clone Of:
Environment:
Last Closed:	2017-08-31 17:00:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
inventory file used for logging deployment (745 bytes, text/plain) 2017-07-25 09:58 UTC, Xia Zhao	no flags	Details
es dc info (22.39 KB, text/plain) 2017-07-26 05:38 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 4661	0	'None'	closed	use the storage path for logging from facts	2021-01-13 15:14:28 UTC
Red Hat Product Errata	RHBA-2017:1828	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.5, 3.4, and 3.3 bug fix update	2017-08-31 20:59:56 UTC

Description Johan Swensson 2017-06-30 09:19:22 UTC

Description of problem:
Every time when running the new openshift-logging playbook the volume elasticsearch-storage gets reset to a emptyDir, meaning logs will get lost until the volume is reconfigured again.

Version-Release number of selected component (if applicable):


How reproducible:
Every time

Steps to Reproduce:
1. Configure EFK stack with ops instance
2. Configure the es-ops deploymentconfigs with persistent storage, for example hostPath on dedicated nodes
3. Re-run the openshift-logging playbook

Actual results:
The elastichsearch-storage volume is reset to emptyDir

Expected results:
The configured volume should not be changed.

Additional info:

Comment 1 Jeff Cantrill 2017-06-30 16:25:06 UTC

Triaging via Upcoming release as its not a blocker.  Not ideal but the workaround is to reattach the volume on the dc.

Comment 3 Xia Zhao 2017-07-25 09:42:24 UTC

Tested with openshift-ansible-playbooks-3.5.101-1.git.0.0107544.el7.noarch

After bounding es and es-ops with HostPath storage, rerun the logging playbook (without any parameters related storage set in inventory file), describe es pods, the HostPath PV was successfully persisted:

Volumes
  elasticsearch-storage:
    Type:	HostPath (bare host directory volume)
    Path:	/usr/local/es-storage

But the es/es-ops pods are no longer runable after the redeploying due to es node is locked, is this the expected behavior? 

Test steps in detail:

1. Deploy logging without PV (the playbook cna be found in attachment), then manually bound es with HostPath PV, and make sure it's running fine at this moment:
# oc get po
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-t42db           1/1       Running   0          3m
logging-curator-ops-1-nzm6f       1/1       Running   0          3m
logging-es-0glaw65l-3-jmtcf       1/1       Running   0          12s
logging-es-ops-j8xjf9l3-3-n7v5t   1/1       Running   0          9s
logging-fluentd-dlp8b             1/1       Running   0          3m
logging-fluentd-fx1wv             1/1       Running   0          3m
logging-kibana-1-s29kr            2/2       Running   0          3m
logging-kibana-ops-1-9q9k2        2/2       Running   0          3m

2. Rerun logging deployment playbook with the same playbook as in the attachment without uninstalling logging stacks (is this a real use case?), the es nodes are locked:

# oc get po
NAME                              READY     STATUS             RESTARTS   AGE
logging-curator-1-t42db           1/1       Running            0          6m
logging-curator-ops-1-nzm6f       1/1       Running            0          6m
logging-es-0glaw65l-4-kfl16       0/1       CrashLoopBackOff   3          1m
logging-es-ops-j8xjf9l3-4-43pc7   0/1       CrashLoopBackOff   3          1m
logging-fluentd-bb8x6             1/1       Running            0          1m
logging-fluentd-phjx0             1/1       Running            0          1m
logging-kibana-1-s29kr            2/2       Running            0          6m
logging-kibana-ops-1-9q9k2        2/2       Running            0          6m


# oc logs logging-es-ops-j8xjf9l3-4-43pc7
[2017-07-25 09:29:55,904][INFO ][container.run            ] Begin Elasticsearch startup script
[2017-07-25 09:29:55,913][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2017-07-25 09:29:55,913][INFO ][container.run            ] Inspecting the maximum RAM available...
[2017-07-25 09:29:55,917][INFO ][container.run            ] ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms4096m -Xmx4096m'
Exception in thread "main" java.lang.IllegalStateException: Failed to obtain node lock, is the following location writable?: [/elasticsearch/persistent/logging-es-ops/data/logging-es-ops]
Likely root cause: java.nio.file.AccessDeniedException: /elasticsearch/persistent/logging-es-ops/data/logging-es-ops/nodes/0/node.lock
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
	at java.nio.channels.FileChannel.open(FileChannel.java:287)
	at java.nio.channels.FileChannel.open(FileChannel.java:335)
	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:113)
	at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
	at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:174)
	at org.elasticsearch.node.Node.<init>(Node.java:165)
	at org.elasticsearch.node.Node.<init>(Node.java:140)
	at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194)
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286)
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45)
Refer to the log for complete error details.

Comment 4 Xia Zhao 2017-07-25 09:58:59 UTC

Created attachment 1304143 [details]
inventory file used for logging deployment

Comment 5 Jeff Cantrill 2017-07-25 13:15:22 UTC

@Eric,

Any ideas or thoughts why this would behave differently for an OPS instance?

Comment 6 Jeff Cantrill 2017-07-25 13:32:38 UTC

@Xia,

Can you provide additional details:

1. The DC for the ES pods
2. Permissions and owner info of the storage directory
3. Steps to produce the issue?  setup hostmount->rerun playbook?

Comment 7 ewolinet 2017-07-25 15:31:56 UTC

I don't think this has anything to do with ops vs non-ops. At first glance, based on the logs it looks like the locks weren't able to be released. It also looks like both the ops and non-ops ES pod is having issues since they're both in "CrashLoopBackOff" status

Comment 8 Junqi Zhao 2017-07-26 05:38:28 UTC

labeled es-pv-node with nodeSelector:logging-es-node: "1", we can find the nodeSelector in ES dc file after attaching hostpath pv to es and es-ops pod, but if we re-run the playbook, nodeSelector:logging-es-node: "1"  is missed from dc, this maybe caused the CrashLoopBackOff of es pods.

1. ES DC see the attached file
2. Permissions of the storage directory is 777
mkdir -p /usr/local/es-storage; chmod 777 /usr/local/es-storage
3. Reproduce steps:
1) followed https://docs.openshift.com/container-platform/3.5/install_config/aggregate_logging.html to attach hostpath pv
######################################################################
# oc label node ${es-pv-node-name} logging-es-node=1 --overwrite

# oadm policy add-scc-to-user privileged system:serviceaccount:logging:aggregated-logging-elasticsearch

# for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do
oc scale $dc --replicas=0
oc patch $dc \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"elasticsearch","securityContext":{"privileged": true}}]}}}}'
done

# for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do 
oc patch $dc -p '{"spec":{"template":{"spec":{"nodeSelector":{"logging-es-node":"1"}}}}}'
done

# for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do
    oc set volume $dc \
          --add --overwrite --name=elasticsearch-storage \
          --type=hostPath --path=/usr/local/es-storage
    oc rollout latest $dc
    oc scale $dc --replicas=1
  done
######################################################################
2) Re run the playbook, the playbook is the same when deployed logging without PV. see xiazhao's attached file:inventory file used for logging deployment

Comment 9 Junqi Zhao 2017-07-26 05:38:56 UTC

Created attachment 1304590 [details]
es dc info

Comment 10 Jeff Cantrill 2017-08-16 19:26:52 UTC

I think this issue is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1478771 where the nodeselector is not being preserved.  My suspicion is that 2 instance of ES end up on the same infra nodes and are trying to access the same storage.  There is an open debate if we should be preserving these nodeselectors for the case described here or if we should be using pod affinity/anti-affinity to ensure the pods are spread properly across nodes.

Comment 11 Jeff Cantrill 2017-08-18 18:11:11 UTC

Can you confirm the original issue was resolved with the original PR? openshift/openshift-ansible/pull/4661  The issue for which this is reopened should be addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1478771

Comment 12 Xia Zhao 2017-08-21 03:43:49 UTC

Tested with openshift-ansible-playbooks-3.6.173.0.5-5.git.0.0538d97.el7.noarch, the issue described in comment #3 was resolved

but the original issue described in comment #0 is still reproduced:

Here are the pod status and logs after re-running the playbook, both es and es-ops pods storage is changed to EmptyDir:

# oc describe po logging-es-ops-data-master-m99iw2v9-4-bx6bx
...
  elasticsearch-storage:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
...

This means re-running the playbook reseted es storage volumes.

# oc get po
NAME                                          READY     STATUS    RESTARTS   AGE
logging-curator-1-bp7bz                       1/1       Running   0          8m
logging-curator-ops-1-6z2zp                   1/1       Running   0          8m
logging-es-data-master-5cdy4ncl-4-br95q       1/1       Running   0          4m
logging-es-ops-data-master-m99iw2v9-4-bx6bx   1/1       Running   0          4m
logging-fluentd-68jrj                         1/1       Running   0          8m
logging-fluentd-xf9dn                         1/1       Running   0          8m
logging-kibana-1-zzmd5                        2/2       Running   0          9m
logging-kibana-ops-1-9lhqw                    2/2       Running   0          9m

# oc logs logging-es-ops-data-master-m99iw2v9-4-bx6bx
[2017-08-21 03:36:16,554][INFO ][container.run            ] Begin Elasticsearch startup script
[2017-08-21 03:36:16,565][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2017-08-21 03:36:16,566][INFO ][container.run            ] Inspecting the maximum RAM available...
[2017-08-21 03:36:16,570][INFO ][container.run            ] ES_HEAP_SIZE: '4096m'
[2017-08-21 03:36:16,571][INFO ][container.run            ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof
[2017-08-21 03:36:16,577][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
[2017-08-21 03:36:17,662][INFO ][node                     ] [logging-es-ops-data-master-m99iw2v9] version[2.4.4], pid[1], build[b3c4811/2017-01-18T03:01:12Z]
[2017-08-21 03:36:17,663][INFO ][node                     ] [logging-es-ops-data-master-m99iw2v9] initializing ...
[2017-08-21 03:36:18,647][INFO ][plugins                  ] [logging-es-ops-data-master-m99iw2v9] modules [reindex, lang-expression, lang-groovy], plugins [openshift-elasticsearch, cloud-kubernetes], sites []
[2017-08-21 03:36:18,744][INFO ][env                      ] [logging-es-ops-data-master-m99iw2v9] using [1] data paths, mounts [[/elasticsearch/persistent (/dev/mapper/rhel-root)]], net usable_space [7.4gb], net total_space [9.9gb], spins? [possibly], types [xfs]
[2017-08-21 03:36:18,744][INFO ][env                      ] [logging-es-ops-data-master-m99iw2v9] heap size [3.9gb], compressed ordinary object pointers [true]
[2017-08-21 03:36:19,661][INFO ][http                     ] [logging-es-ops-data-master-m99iw2v9] Using [org.elasticsearch.http.netty.NettyHttpServerTransport] as http transport, overridden by [search-guard2]
[2017-08-21 03:36:19,948][INFO ][transport                ] [logging-es-ops-data-master-m99iw2v9] Using [com.floragunn.searchguard.transport.SearchGuardTransportService] as transport service, overridden by [search-guard2]
[2017-08-21 03:36:19,948][INFO ][transport                ] [logging-es-ops-data-master-m99iw2v9] Using [com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] as transport, overridden by [search-guard-ssl]
[root@host-8-241-33 ~]# oc logs logging-es-data-master-5cdy4ncl-4-br95q
[2017-08-21 03:36:14,786][WARN ][com.floragunn.searchguard.configuration.PrivilegesEvaluator] .kibana does not exist in cluster metadata
[2017-08-21 03:36:14,794][INFO ][cluster.metadata         ] [logging-es-data-master-5cdy4ncl] [.kibana] creating index, cause [api], templates [], shards [1]/[0], mappings [config]
[2017-08-21 03:36:14,814][INFO ][cluster.routing.allocation] [logging-es-data-master-5cdy4ncl] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.kibana][0]] ...]).

Comment 13 Xia Zhao 2017-08-21 06:43:44 UTC

Please kindly ignore the above comment #12 since it's actually based on a 3.6 ansible package.

Tested with openshift-ansible-playbooks-3.5.110-1.git.0.6f1f193.el7.noarch, the original bug was fixed: The hostPath PV settings to es-ops can be successfully preserved after rerunning the logging playbooks. Please feel free to transfer back to ON_QA status for closure, thanks.

More details:

Test env: 
# openshift version
openshift v3.5.5.31.19
kubernetes v1.5.2+43a9be4
etcd 3.1.0

After rerunning the logging playbooks again, es stacks and it's volumes are fine:

# oc get po
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-n9fl9           1/1       Running   0          23m
logging-curator-ops-1-bkwcs       1/1       Running   0          23m
logging-es-hr0p3wv9-3-b0p4c       1/1       Running   0          18m
logging-es-ops-73oh3fg6-3-sr1v5   1/1       Running   0          18m
logging-fluentd-9ljkt             1/1       Running   0          23m
logging-fluentd-h3ljj             1/1       Running   0          23m
logging-kibana-1-dm6k5            2/2       Running   0          23m
logging-kibana-ops-1-gvsw8        2/2       Running   0          23m

# oc describe po logging-es-ops-73oh3fg6-3-sr1v5
...
Volumes:
  elasticsearch-storage:
    Type:	HostPath (bare host directory volume)
    Path:	/usr/local/es-storage
...

Also checked the es logs, contains no error.

Comment 14 Xia Zhao 2017-08-22 02:07:53 UTC

Set to verified according to test result in comment #13

Comment 16 errata-xmlrpc 2017-08-31 17:00:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1828

Note You need to log in before you can comment on or make changes to this bug.