Description of problem: Engineering indicated that they recently ran into a problem with the deployment where ES scaling down does not properly shutdown ES, so it starts creating new node directories, essentially stranding data, causing new relocation operations. Anytime you scale down your ES nodes, or they fail, this can happen. It leaves around a lock file in those node directories causes a pod scale-up to see the EBS volume as "in-use" by another node, and so it creates another copy of data. Additional info: The information below indicates that this cluster has had the issue occur twice previously: sh-4.2$ ls /elasticsearch/persistent/logging-es/data/logging-es/nodes/ 0 1 2
With this issue being a current problem, does this change the recommendation we provide in our documentation [0]? Keeping in mind that this recommendation has become the go to recommendation for most changes that need to be done with the EFK stack. [0] https://docs.openshift.com/enterprise/3.2/install_config/upgrading/manual_upgrades.html#manual-upgrading-efk-logging-stack
We can probably recommend increasing the terminationGracePeriodSeconds in the Elasticsearch pod spec (within the DC). The default is 30 seconds, and if ES isn't able to finish its tasks during this time, it will then be issued a SIGKILL. If ES isn't able to release its locks it will create these other directories.
I think Eric W. has covered all the right steps to take here. Any word on how this worked out in the field?
Did the change in https://github.com/openshift/origin-aggregated-logging/pull/227 get into a release yet? If so, we should probably attach this bug to an errata or close it. I don't see any more helpful fix for this coming along.
I didn't see it in there, but I'll sync it over now for the 3.3 and 3.4 deployer images
Verified with this image, issue has been fixed: registry.ops.openshift.com/openshift3/logging-deployer 3.3.1 1e85b37518ba 14 hours ago 761.6 MB --Scaled down es cluster for 3 times, only 1 node created in pv: $ ls /elasticsearch/persistent/logging-es/data/logging-es/nodes/ 0 # openshift version openshift v3.3.1.3 kubernetes v1.3.0+52492b4 etcd 2.3.0+git
Hey Eric Jones, do we have a kbase on what it looks like when node.lock is left behind in ES storage and what to do about it?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2085