Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1375374

Summary:	Proper method to stop logging pods safely
Product:	OpenShift Container Platform	Reporter:	Eric Jones <erjones>
Component:	Logging	Assignee:	Luke Meyer <lmeyer>
Status:	CLOSED CANTFIX	QA Contact:	Xia Zhao <xiazhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.2.1	CC:	aos-bugs, erjones, ewolinet, wsun
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-12 14:22:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Jones 2016-09-12 21:49:27 UTC

Customer has tried to use two separate methods [0] and [1] but both options end with them having corrupted data. Customer is running 3.X with the logging pods all using image ___

Both methods appears to create errors similar to [2] in their elasticsearch logs.

Customer is using openshift v3.2.1.13-1-gc2a90e1

Will attach RCs from each component shortly.


[0]# oc new-app logging-deployer-template \
... \
--param MODE=stop

# oc new-app logging-deployer-template \
... \
--param MODE=start

[1]
Staggered scaling of ES

[2] org.apache.lucene.index.CorruptIndexException: codec footer mismatch: actual footer=0 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/logging.b4d1f006-54d2-11e6-bb44-02003e6b002a.2016.08.20/1/index/_8t.cfs"))

Comment 2 ewolinet 2016-09-13 15:38:43 UTC

Looking around, this could have been caused from an OOME or the JVM being in a bad state when ES went down.

Is it possible to find out from the customer if they had run into a situation like this?

We may also want to investigate increasing the `graceTerminationPeriodSeconds` value for the ES pods (this can be updated in the DC, under the `spec` and above the `container` definition). A value of "600" should be enough.

Just a FYI, methods 0 and 1 above both do staggered scaling.

Comment 3 Luke Meyer 2016-09-13 16:50:27 UTC

If an index is corrupted it's going to stay that way until you do something about it. Normally ES has enough time to put its affairs in order when it's going down, but there may always be situations when it is killed in the middle of writing something and the index ends up corrupted.

I don't know of a method for recovering the index. You should be able to delete it though. Choose one of the ES instances (do they have more than one ES?) and shell into it with oc rsh. Then:

curl --key /etc/elasticsearch/keys/admin-key --cert /etc/elasticsearch/keys/admin-cert --cacert /etc/elasticsearch/keys/admin-ca -XDELETE https://localhost:9200/logging.b4d1f006-54d2-11e6-bb44-02003e6b002a.2016.08.20

Comment 4 Eric Jones 2016-09-15 13:36:32 UTC

@Eric What would be the best method to check if ES was in that state when it went down?

Comment 5 ewolinet 2016-09-15 14:59:51 UTC

Unfortunately, I think the best method would be to check the logs... but given that the pod was already shut down, that might be difficult to do. Since ES writes its logs to stdout it may be possible to check the container logs for that pod though.

If it is the case where ES wasn't able to properly shut down, we could check the volume mounts for the ES nodes and see if there are any extra directories under the 'nodes' directory... This doesn't guarantee that it was an OOME or JVM related issue, but it does tell us ES probably didn't have enough time to shut down.

Comment 6 Eric Jones 2016-09-15 19:21:57 UTC

@Eric, The customer claims to have removed the /logging-es/data/logging-es/nodes/0/node.lock files from both of the two ES persistent storage volumes

Is that what you mean? Or are you referring to even lower down the stack at /logging-es/data/logging-es/nodes ?

Comment 7 ewolinet 2016-09-15 20:41:08 UTC

@Eric, I meant simply were there extra directories under /logging-es/data/logging-es/nodes/  other than '0' for their clusters? But since they removed the node.lock files from '0' it didn't seem to be the case :)

Comment 8 Eric Jones 2016-09-16 13:08:56 UTC

Yeah, I just confirmed with the user, there was only '0' in that dir on both ES pods.

Is there anything else we should check? Or is the thought just that it scaled down too quickly and increasing that graceTerminationPeriodSeconds value _should_ resolve the problem?

Comment 9 ewolinet 2016-09-16 21:01:47 UTC

The graceTerminationPeriodSeconds could help if that is the cause of it, but I'm not certain of that being the case. If there was a OOME or other JVM related issue that could be the case of this... looking around there isn't much regarding that issue and ES 1.5.2 which is provided with this image. There is some suggestion that the Lucene index was modified after it was created.

What does their data storage backend look like? Is it possible they are running out of disk space or are otherwise unable to write to disk for some reason (e.g. NFS locks)?


There appears to be two options for fixing the corrupt index based on this groups thread [1].

The first is to attempt to fix it with a low-level Lucene utility 'CheckIndex'. I have never used this before.

The second option is to delete the index in question using the ES index delete api which Luke mentioned above [2] from within the ES pod to use the admin cert [3].


[1] https://groups.google.com/forum/#!topic/graylog2/x9rTz0ufglg

[2] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-delete-index.html

[3] curl --key /etc/elasticsearch/keys/admin-key --cert /etc/elasticsearch/keys/admin-cert --cacert /etc/elasticsearch/keys/admin-ca -XDELETE https://localhost:9200/logging.b4d1f006-54d2-11e6-bb44-02003e6b002a.2016.08.20

Comment 10 Luke Meyer 2016-10-12 14:22:02 UTC

Customer case has been closed.