Bug 1552257

Summary:	[GSS] Logging Corruption After OCP 3.7 Upgrade
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Matthew Robson <mrobson>
Component:	gluster-block	Assignee:	Prasanna Kumar Kalever <prasanna.kalever>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	high
Version:	cns-3.6	CC:	akhakhar, annair, aos-bugs, bgoyal, bkunal, bugs, ccustine, hchiramm, jarrpa, kramdoss, madam, mrobson, nbhatt, pkarampu, pportant, pprakash, prasanna.kalever, rhs-bugs, rmeggins, rreddy, rtalur, sankarshan, steven.barre, tkatarki, vbellur, vinug, xiubli
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-23 12:51:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1573420, 1622458, 1641915, 1642792

Description Matthew Robson 2018-03-06 20:27:39 UTC

Description of problem:

Upgraded Infra nodes from 3.6 to 3.7 which resulted in 2 of 3 ES nodes in a crash loop throwing a CorruptStateException:

logging-es-data-master-dd2iwji6-2-pxws2   0/1       Running            0          23m       192.168.1.176   server171
logging-es-data-master-n3hrld40-2-lm5x5   0/1       CrashLoopBackOff   2          1m        192.168.1.186   server180
logging-es-data-master-qunqsokb-2-k5d5p   0/1       CrashLoopBackOff   2          1m        192.168.1.199   server170
[root@server100 ~]# oc logs logging-es-data-master-qunqsokb-2-k5d5p
[2018-03-01 23:20:30,904][INFO ][container.run            ] Begin Elasticsearch startup script
[2018-03-01 23:20:30,913][INFO ][container.run            ] Comparing the specified RAM to the maximum recommended for Elasticsearch...
[2018-03-01 23:20:30,915][INFO ][container.run            ] Inspecting the maximum RAM available...
[2018-03-01 23:20:30,919][INFO ][container.run            ] ES_HEAP_SIZE: '16384m'
[2018-03-01 23:20:30,921][INFO ][container.run            ] Setting heap dump location /elasticsearch/persistent/heapdump.hprof
[2018-03-01 23:20:30,923][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
Exception in thread "main" ElasticsearchException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: IOException[failed to read [id:5, legacy:false, file:/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st]]; nested: CorruptStateException[codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st")))];
Likely root cause: org.elasticsearch.gateway.CorruptStateException: codec footer mismatch (file truncated?): actual footer=1869505397 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/_state/global-5.st")))
        at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:418)
        at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:330)
        at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:451)
        at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:177)
        at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:299)
        at org.elasticsearch.gateway.MetaStateService.loadGlobalState(MetaStateService.java:119)
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:87)
        at org.elasticsearch.gateway.GatewayMetaState.loadMetaState(GatewayMetaState.java:99)
        at org.elasticsearch.gateway.GatewayMetaState.pre20Upgrade(GatewayMetaState.java:225)
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:87)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at <<<guice>>>
        at org.elasticsearch.node.Node.<init>(Node.java:213)
        at org.elasticsearch.node.Node.<init>(Node.java:140)
        at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194)
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45)

Upgrading logging PODs to the 3.7 images resulted in the same issues


Version-Release number of selected component (if applicable):

3.7.14


How reproducible:
One time

Actual results:
Corrupted logging indexes.

Expected results:


Additional info:

Comment 2 Jeff Cantrill 2018-03-06 20:41:13 UTC

Can you please provide additional information about the persistent volumes you are using? Also please consider running  https://raw.githubusercontent.com/openshift/origin-aggregated-logging/master/hack/logging-dump.sh  to provide additional information about the environment

Comment 5 Jeff Cantrill 2018-03-06 21:33:54 UTC

Possible duplicate: https://bugzilla.redhat.com/show_bug.cgi?id=1379568

Comment 7 Peter Portante 2018-03-07 03:19:15 UTC

Shhouldn't this be owned by the GlusterFS team to determine why GlusterFS block storage is showing corruption when we have not seen this error on AWS EBS storage, or local disks?

Comment 9 Steven Barre 2018-03-07 19:57:23 UTC

Created attachment 1405523 [details]
logging-dump.sh output