Hide Forgot
Customer has tried to use two separate methods [0] and [1] but both options end with them having corrupted data. Customer is running 3.X with the logging pods all using image ___ Both methods appears to create errors similar to [2] in their elasticsearch logs. Customer is using openshift v3.2.1.13-1-gc2a90e1 Will attach RCs from each component shortly. [0]# oc new-app logging-deployer-template \ ... \ --param MODE=stop # oc new-app logging-deployer-template \ ... \ --param MODE=start [1] Staggered scaling of ES [2] org.apache.lucene.index.CorruptIndexException: codec footer mismatch: actual footer=0 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/logging.b4d1f006-54d2-11e6-bb44-02003e6b002a.2016.08.20/1/index/_8t.cfs"))
Looking around, this could have been caused from an OOME or the JVM being in a bad state when ES went down. Is it possible to find out from the customer if they had run into a situation like this? We may also want to investigate increasing the `graceTerminationPeriodSeconds` value for the ES pods (this can be updated in the DC, under the `spec` and above the `container` definition). A value of "600" should be enough. Just a FYI, methods 0 and 1 above both do staggered scaling.
If an index is corrupted it's going to stay that way until you do something about it. Normally ES has enough time to put its affairs in order when it's going down, but there may always be situations when it is killed in the middle of writing something and the index ends up corrupted. I don't know of a method for recovering the index. You should be able to delete it though. Choose one of the ES instances (do they have more than one ES?) and shell into it with oc rsh. Then: curl --key /etc/elasticsearch/keys/admin-key --cert /etc/elasticsearch/keys/admin-cert --cacert /etc/elasticsearch/keys/admin-ca -XDELETE https://localhost:9200/logging.b4d1f006-54d2-11e6-bb44-02003e6b002a.2016.08.20
@Eric What would be the best method to check if ES was in that state when it went down?
Unfortunately, I think the best method would be to check the logs... but given that the pod was already shut down, that might be difficult to do. Since ES writes its logs to stdout it may be possible to check the container logs for that pod though. If it is the case where ES wasn't able to properly shut down, we could check the volume mounts for the ES nodes and see if there are any extra directories under the 'nodes' directory... This doesn't guarantee that it was an OOME or JVM related issue, but it does tell us ES probably didn't have enough time to shut down.
@Eric, The customer claims to have removed the /logging-es/data/logging-es/nodes/0/node.lock files from both of the two ES persistent storage volumes Is that what you mean? Or are you referring to even lower down the stack at /logging-es/data/logging-es/nodes ?
@Eric, I meant simply were there extra directories under /logging-es/data/logging-es/nodes/ other than '0' for their clusters? But since they removed the node.lock files from '0' it didn't seem to be the case :)
Yeah, I just confirmed with the user, there was only '0' in that dir on both ES pods. Is there anything else we should check? Or is the thought just that it scaled down too quickly and increasing that graceTerminationPeriodSeconds value _should_ resolve the problem?
The graceTerminationPeriodSeconds could help if that is the cause of it, but I'm not certain of that being the case. If there was a OOME or other JVM related issue that could be the case of this... looking around there isn't much regarding that issue and ES 1.5.2 which is provided with this image. There is some suggestion that the Lucene index was modified after it was created. What does their data storage backend look like? Is it possible they are running out of disk space or are otherwise unable to write to disk for some reason (e.g. NFS locks)? There appears to be two options for fixing the corrupt index based on this groups thread [1]. The first is to attempt to fix it with a low-level Lucene utility 'CheckIndex'. I have never used this before. The second option is to delete the index in question using the ES index delete api which Luke mentioned above [2] from within the ES pod to use the admin cert [3]. [1] https://groups.google.com/forum/#!topic/graylog2/x9rTz0ufglg [2] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-delete-index.html [3] curl --key /etc/elasticsearch/keys/admin-key --cert /etc/elasticsearch/keys/admin-cert --cacert /etc/elasticsearch/keys/admin-ca -XDELETE https://localhost:9200/logging.b4d1f006-54d2-11e6-bb44-02003e6b002a.2016.08.20
Customer case has been closed.