Bug 1552257

Summary: [GSS] Logging Corruption After OCP 3.7 Upgrade
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Matthew Robson <mrobson>
Component: gluster-blockAssignee: Prasanna Kumar Kalever <prasanna.kalever>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: high    
Version: cns-3.6CC: akhakhar, annair, aos-bugs, bgoyal, bkunal, bugs, ccustine, hchiramm, jarrpa, kramdoss, madam, mrobson, nbhatt, pkarampu, pportant, pprakash, prasanna.kalever, rhs-bugs, rmeggins, rreddy, rtalur, sankarshan, steven.barre, tkatarki, vbellur, vinug, xiubli
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-23 12:51:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1573420, 1622458, 1641915, 1642792    

Description Matthew Robson 2018-03-06 20:27:39 UTC
Description of problem:

Upgraded Infra nodes from 3.6 to 3.7 which resulted in 2 of 3 ES nodes in a crash loop throwing a CorruptStateException:

logging-es-data-master-dd2iwji6-2-pxws2   0/1       Running            0          23m       192.168.1.176   server171
logging-es-data-master-n3hrld40-2-lm5x5   0/1       CrashLoopBackOff   2          1m        192.168.1.186   server180
logging-es-data-master-qunqsokb-2-k5d5p   0/1       CrashLoopBackOff   2          1m        192.168.1.199   server170
[root@server100 ~]# oc logs logging-es-data-master-qunqsokb-2-k5d5p
[2018-03-01 23:20:30,904][INFO ][container.run            ] Begin Elasticsearch startup script
[2018-03-01 23:20:30,913][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2018-03-01 23:20:30,915][INFO ][container.run            ] Inspecting the maximum RAM available...
[2018-03-01 23:20:30,919][INFO ][container.run            ] ES_HEAP_SIZE: '16384m'
[2018-03-01 23:20:30,921][INFO ][container.run            ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof
[2018-03-01 23:20:30,923][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
Exception in thread "main" ElasticsearchException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: IOException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: CorruptStateException[codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st")))];
Likely root cause: org.elasticsearch.gateway.CorruptStateException: codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st")))
        at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:418)
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:330)
        at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:451)
        at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:177)
        at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:299)
        at org.elasticsearch.gateway.MetaStateService.loadGlobalState(MetaStateService.java:119)
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:87)
        at org.elasticsearch.gateway.GatewayMetaState.loadMetaState(GatewayMetaState.java:99)
        at org.elasticsearch.gateway.GatewayMetaState.pre20Upgrade(GatewayMetaState.java:225)
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:87)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at <<<guice>>>
        at org.elasticsearch.node.Node.<init>(Node.java:213)
        at org.elasticsearch.node.Node.<init>(Node.java:140)
        at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194)
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45)

Upgrading logging PODs to the 3.7 images resulted in the same issues


Version-Release number of selected component (if applicable):

3.7.14


How reproducible:
One time

Actual results:
Corrupted logging indexes.

Expected results:


Additional info:

Comment 2 Jeff Cantrill 2018-03-06 20:41:13 UTC
Can you please provide additional information about the persistent volumes you are using? Also please consider running  https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/hack/logging-dump.sh  to provide additional information about the environment

Comment 5 Jeff Cantrill 2018-03-06 21:33:54 UTC
Possible duplicate: https://bugzilla.redhat.com/show_bug.cgi?id=1379568

Comment 7 Peter Portante 2018-03-07 03:19:15 UTC
Shhouldn't this be owned by the GlusterFS team to determine why GlusterFS block storage is showing corruption when we have not seen this error on AWS EBS storage, or local disks?

Comment 9 Steven Barre 2018-03-07 19:57:23 UTC
Created attachment 1405523 [details]
logging-dump.sh output