Bug 1552257 - [GSS] Logging Corruption After OCP 3.7 Upgrade
Summary: [GSS] Logging Corruption After OCP 3.7 Upgrade
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-block
Version: cns-3.6
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Prasanna Kumar Kalever
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks: 1573420 1622458 OCS-3.11.1-devel-triage-done 1642792
TreeView+ depends on / blocked
 
Reported: 2018-03-06 20:27 UTC by Matthew Robson
Modified: 2018-11-06 09:46 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-23 12:51:03 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Matthew Robson 2018-03-06 20:27:39 UTC
Description of problem:

Upgraded Infra nodes from 3.6 to 3.7 which resulted in 2 of 3 ES nodes in a crash loop throwing a CorruptStateException:

logging-es-data-master-dd2iwji6-2-pxws2   0/1       Running            0          23m       192.168.1.176   server171
logging-es-data-master-n3hrld40-2-lm5x5   0/1       CrashLoopBackOff   2          1m        192.168.1.186   server180
logging-es-data-master-qunqsokb-2-k5d5p   0/1       CrashLoopBackOff   2          1m        192.168.1.199   server170
[root@server100 ~]# oc logs logging-es-data-master-qunqsokb-2-k5d5p
[2018-03-01 23:20:30,904][INFO ][container.run            ] Begin Elasticsearch startup script
[2018-03-01 23:20:30,913][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2018-03-01 23:20:30,915][INFO ][container.run            ] Inspecting the maximum RAM available...
[2018-03-01 23:20:30,919][INFO ][container.run            ] ES_HEAP_SIZE: '16384m'
[2018-03-01 23:20:30,921][INFO ][container.run            ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof
[2018-03-01 23:20:30,923][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
Exception in thread "main" ElasticsearchException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: IOException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: CorruptStateException[codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st")))];
Likely root cause: org.elasticsearch.gateway.CorruptStateException: codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st")))
        at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:418)
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:330)
        at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:451)
        at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:177)
        at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:299)
        at org.elasticsearch.gateway.MetaStateService.loadGlobalState(MetaStateService.java:119)
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:87)
        at org.elasticsearch.gateway.GatewayMetaState.loadMetaState(GatewayMetaState.java:99)
        at org.elasticsearch.gateway.GatewayMetaState.pre20Upgrade(GatewayMetaState.java:225)
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:87)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at <<<guice>>>
        at org.elasticsearch.node.Node.<init>(Node.java:213)
        at org.elasticsearch.node.Node.<init>(Node.java:140)
        at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194)
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45)

Upgrading logging PODs to the 3.7 images resulted in the same issues


Version-Release number of selected component (if applicable):

3.7.14


How reproducible:
One time

Actual results:
Corrupted logging indexes.

Expected results:


Additional info:

Comment 2 Jeff Cantrill 2018-03-06 20:41:13 UTC
Can you please provide additional information about the persistent volumes you are using? Also please consider running  https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/hack/logging-dump.sh  to provide additional information about the environment

Comment 5 Jeff Cantrill 2018-03-06 21:33:54 UTC
Possible duplicate: https://bugzilla.redhat.com/show_bug.cgi?id=1379568

Comment 7 Peter Portante 2018-03-07 03:19:15 UTC
Shhouldn't this be owned by the GlusterFS team to determine why GlusterFS block storage is showing corruption when we have not seen this error on AWS EBS storage, or local disks?

Comment 9 Steven Barre 2018-03-07 19:57:23 UTC
Created attachment 1405523 [details]
logging-dump.sh output


Note You need to log in before you can comment on or make changes to this bug.