Description of problem: Upgraded Infra nodes from 3.6 to 3.7 which resulted in 2 of 3 ES nodes in a crash loop throwing a CorruptStateException: logging-es-data-master-dd2iwji6-2-pxws2 0/1 Running 0 23m 192.168.1.176 server171 logging-es-data-master-n3hrld40-2-lm5x5 0/1 CrashLoopBackOff 2 1m 192.168.1.186 server180 logging-es-data-master-qunqsokb-2-k5d5p 0/1 CrashLoopBackOff 2 1m 192.168.1.199 server170 [root@server100 ~]# oc logs logging-es-data-master-qunqsokb-2-k5d5p [2018-03-01 23:20:30,904][INFO ][container.run ] Begin Elasticsearch startup script [2018-03-01 23:20:30,913][INFO ][container.run ] Comparing the specified RAM to the maximum recommended for Elasticsearch... [2018-03-01 23:20:30,915][INFO ][container.run ] Inspecting the maximum RAM available... [2018-03-01 23:20:30,919][INFO ][container.run ] ES_HEAP_SIZE: '16384m' [2018-03-01 23:20:30,921][INFO ][container.run ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof [2018-03-01 23:20:30,923][INFO ][container.run ] Checking if Elasticsearch is ready on https://localhost:9200 Exception in thread "main" ElasticsearchException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: IOException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: CorruptStateException[codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st")))]; Likely root cause: org.elasticsearch.gateway.CorruptStateException: codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st"))) at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:418) at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:330) at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:451) at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:177) at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:299) at org.elasticsearch.gateway.MetaStateService.loadGlobalState(MetaStateService.java:119) at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:87) at org.elasticsearch.gateway.GatewayMetaState.loadMetaState(GatewayMetaState.java:99) at org.elasticsearch.gateway.GatewayMetaState.pre20Upgrade(GatewayMetaState.java:225) at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:87) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at <<<guice>>> at org.elasticsearch.node.Node.<init>(Node.java:213) at org.elasticsearch.node.Node.<init>(Node.java:140) at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143) at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194) at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286) at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45) Upgrading logging PODs to the 3.7 images resulted in the same issues Version-Release number of selected component (if applicable): 3.7.14 How reproducible: One time Actual results: Corrupted logging indexes. Expected results: Additional info:
Can you please provide additional information about the persistent volumes you are using? Also please consider running https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/hack/logging-dump.sh to provide additional information about the environment
Possible duplicate: https://bugzilla.redhat.com/show_bug.cgi?id=1379568
Shhouldn't this be owned by the GlusterFS team to determine why GlusterFS block storage is showing corruption when we have not seen this error on AWS EBS storage, or local disks?
Created attachment 1405523 [details] logging-dump.sh output