Bug 1379568 - [Tracker-RHGS-#1698601] Elasticsearch get CorruptIndexException errors when running with GlusterFS persistent storage [depends on ctime support in RHGS - coming in 3.5]
Summary: [Tracker-RHGS-#1698601] Elasticsearch get CorruptIndexException errors when r...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhgs-server-container
Version: cns-1.0
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: ---
Assignee: Raghavendra Talur
QA Contact: Prasanth
URL:
Whiteboard: gluster-stale-cache, ES-on-gluster-fu...
Depends On: 1698601
Blocks: RHGS-3.4-GSS-proposed-tracker 1622458
TreeView+ depends on / blocked
 
Reported: 2016-09-27 07:26 UTC by Jonas Nordell
Modified: 2021-12-10 14:45 UTC (History)
40 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1390050 1698601 (view as bug list)
Environment:
Last Closed: 2019-11-12 21:26:50 UTC
Embargoed:
jnordell: needinfo-


Attachments (Terms of Use)

Description Jonas Nordell 2016-09-27 07:26:01 UTC
Description of problem:
After switching from NFS based persistent storage (not supported for elasticsearch but did work) to FUSE GlusterFS based persistent storage ES keeps getting corruptindexexeptions.

We did not migrate any data so it is a clean volume from the beginning. After about five minutes we get a lot of this error messages.

[2016-09-08 04:48:12,014][WARN ][cluster.action.shard     ] [Turner D. Century] [.operations.2016.09.08][2] received shard failed for [.operations.2016.09.08][2], node[870JdnoTQB2Mao7kKKyAgg], [P], s[INITIALIZING], indexUUID [7ctrYnQvSteWQVeffVaOiw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[.operations.2016.09.08][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[.operations.2016.09.08][2] Preexisting corrupted index [corrupted__8st9rreSmyJlTiv1-fXQw] caused by: CorruptIndexException[codec footer mismatch: actual footer=33885192 vs expected footer=-1071082520 (resource: _87_Lucene41_0.pos)]
org.apache.lucene.index.CorruptIndexException: codec footer mismatch: actual footer=33885192 vs expected footer=-1071082520 (resource: _87_Lucene41_0.pos)
        at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:235)
        at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:228)
        at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.<init>(Lucene41PostingsReader.java:101)
        at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:434)
        at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
        at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
        at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
        at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239)
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421)
        at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297)
        at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
        at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
        at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:171)
        at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
        at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
        at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
        at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
        at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:568)
        at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:565)
        at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1095)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

and 

[2016-09-08 04:46:42,044][WARN ][cluster.action.shard     ] [Turner D. Century] [.operations.2016.09.08][2] received shard failed for [.operations.2016.09.08][2], node[870JdnoTQB2Mao7kKKyAgg], [P], s[INITIALIZING], indexUUID [7ctrYnQvSteWQVeffVaOiw], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[.operations.2016.09.08][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[.operations.2016.09.08][2] Preexisting corrupted index [corrupted__8st9rreSmyJlTiv1-fXQw] caused by: CorruptIndexException[codec footer mismatch: actual footer=33885192 vs expected footer=-1071082520 (resource: _87_Lucene41_0.pos)]
org.apache.lucene.index.CorruptIndexException: codec footer mismatch: actual footer=33885192 vs expected footer=-1071082520 (resource: _87_Lucene41_0.pos)
       	at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:235)
       	at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:228)
       	at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.<init>(Lucene41PostingsReader.java:101)
       	at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:434)
       	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
       	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
       	at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
       	at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
       	at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
       	at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239)
       	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
       	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421)
       	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297)
       	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
       	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
       	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:171)
       	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
       	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       	at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:568)
       	at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:565)
       	at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1095)
       	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       	at java.lang.Thread.run(Thread.java:745)
]; ]] 

Version-Release number of selected component (if applicable):
3.2.1

How reproducible:
Every time in this environment with does GlusterFS Volumes.

Steps to Reproduce:
1. Give empty GlusterFS volume to ES
2. start ES
3. after about five minutes corruption happens.

Actual results:


Expected results:


Additional info:

Comment 1 Luke Meyer 2016-10-12 13:28:12 UTC
I need an assist from the storage team on this one. Am I right in thinking that the GlusterFS Fuse client is still doing an NFS mount behind the scenes to provide the FS?

If so it's strange that we get such disparate results between the Fuse client and NFS one.

Comment 2 Bradley Childs 2016-10-12 19:17:40 UTC
I'm pretty sure that GLUSTER.FUSE mounts do not use NFS to transport behind the scenes.  

Example cases why this wouldn't work - Striped volumes have pieces of files on multiple servers.  If it was NFS behind the scenes, NFS wouldn't know when to switch server for the next piece of a file.

Most likely there's an NFS server running backed by a glusterfs server.

You may want to check the consistency of the back gluster filesystem.  An (old) blog on  such things is here: http://blog.gluster.org/2014/02/gluster-and-hadoop-using-fuse-options-to-defy-the-cap-theorem/

I'm not sure elastic searches requirements for filesystem consistency, but if a file write is supposed to be visible instantly on any node reading the file, you want strict consistency.  There's a performance hit but some applications require it.

Comment 16 Raghavendra G 2016-10-28 04:48:28 UTC
There are errors seen which possibly indicate a wrong fstat (see o/p file truncated?):

org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=0 vs expected footer=-1071082520 (resource=NIOFSIndexInput(path="\
/mnt/ptop/node153/customerissue/nodes/0/indices/node142/2/index/_b1.cfs") [slice=_b1_Lucene50_0.pos])

There is a similar issue being discussed upstream at:
http://nongnu.13855.n7.nabble.com/Issue-about-the-size-of-fstat-is-less-than-the-really-size-of-the-syslog-file-td216522.html

Comment 18 Raghavendra G 2016-10-31 03:22:21 UTC
Just had a discussion with Pranith. Write-behind doesn't add lookups and readdirp (both do send iatt information in their cbks) for dependency checks on cached writes. So, it might be a possibility that lookups/readdirp bypassed write-behind cache and returned stale iatt information. I'll be working on a patch to fix these issues in write-behind and see whether it helps.

Another thing is that there are other fops too, which are not directly dependent on writev (like readlink, link etc) that return postop iatt in their responses. write-behind doesn't strive to maintain the ordering of these fops with cached writes.

Thanks to Pranith for all the hardwork that went into gathering this data.

Comment 19 Raghavendra G 2016-10-31 03:23:34 UTC
Will be working on readdirp and quick-read related issues once I am done with write-behind.

Comment 20 Raghavendra G 2016-10-31 05:36:04 UTC
(In reply to Raghavendra G from comment #18)
> Just had a discussion with Pranith. Write-behind doesn't add lookups and
> readdirp (both do send iatt information in their cbks) for dependency checks
> on cached writes. So, it might be a possibility that lookups/readdirp
> bypassed write-behind cache and returned stale iatt information. I'll be
> working on a patch to fix these issues in write-behind and see whether it
> helps.

An upstream patch can be found at:
http://review.gluster.org/#/c/15757/

Comment 21 Raghavendra G 2016-10-31 11:18:24 UTC
(In reply to Pranith Kumar K from comment #17)
> I have been doing runs with different sets of xlators and found that
> whenever the following performance xlators are present in the graph it is
> leading to index-corruption
> 1) Quick-read

It's strange that just Quick-read resulted in the failure. Quick-read returns a cached stat only in readv_cbk. Note that stat is cached in md-cache and kernel. Though md-cache uses the stat in readv_cbk to refresh its cache, fuse-bridge doesn't send the stat in readv reply. So, for a stale stat to be seen we need at least md-cache along with quick-read. Since, it is not the case, I am suspecting there is another issue apart from stale-state which is resulting in failure of Elasticsearch.

Comment 22 Raghavendra G 2016-10-31 11:19:51 UTC
(In reply to Raghavendra G from comment #21)
> (In reply to Pranith Kumar K from comment #17)
> > I have been doing runs with different sets of xlators and found that
> > whenever the following performance xlators are present in the graph it is
> > leading to index-corruption
> > 1) Quick-read
> 
> It's strange that just Quick-read resulted in the failure. Quick-read
> returns a cached stat only in readv_cbk. Note that stat is cached in

s/cached stat/stat it has cached in qr_inode. Other than readv, it acts as pass through for stat in all calls and hence it cannot be the originator for stale stat.

> md-cache and kernel. Though md-cache uses the stat in readv_cbk to refresh
> its cache, fuse-bridge doesn't send the stat in readv reply. So, for a stale
> stat to be seen we need at least md-cache along with quick-read. Since, it
> is not the case, I am suspecting there is another issue apart from
> stale-state which is resulting in failure of Elasticsearch.

Comment 23 Pranith Kumar K 2016-11-01 02:26:19 UTC
With Raghavendra's http://review.gluster.org/#/c/15757/2 patch I don't see any corruption with just write-behind.

Comment 24 Pranith Kumar K 2016-11-01 02:35:42 UTC
(In reply to Raghavendra G from comment #21)
> (In reply to Pranith Kumar K from comment #17)
> > I have been doing runs with different sets of xlators and found that
> > whenever the following performance xlators are present in the graph it is
> > leading to index-corruption
> > 1) Quick-read
> 
> It's strange that just Quick-read resulted in the failure. Quick-read
> returns a cached stat only in readv_cbk. Note that stat is cached in
> md-cache and kernel. Though md-cache uses the stat in readv_cbk to refresh
> its cache, fuse-bridge doesn't send the stat in readv reply. So, for a stale
> stat to be seen we need at least md-cache along with quick-read. Since, it
> is not the case, I am suspecting there is another issue apart from
> stale-state which is resulting in failure of Elasticsearch.

I think there is a race between qr_writev() and qr_lookup_cbk() which can result in quick-read caching old data? i.e.
1) qr_lookup() is wound it got some values from posix and qr_lookup_cbk() is going to be executed.
2) qr_writev() is wound after qr_inode_prune()
3) qr_lookup_cbk() from 1) above updates the content and statbuf with old values and will serve that as readv's output. Which is stale right?

Comment 25 Pranith Kumar K 2016-11-01 03:08:56 UTC
(In reply to Pranith Kumar K from comment #24)
> (In reply to Raghavendra G from comment #21)
> > (In reply to Pranith Kumar K from comment #17)
> > > I have been doing runs with different sets of xlators and found that
> > > whenever the following performance xlators are present in the graph it is
> > > leading to index-corruption
> > > 1) Quick-read
> > 
> > It's strange that just Quick-read resulted in the failure. Quick-read
> > returns a cached stat only in readv_cbk. Note that stat is cached in
> > md-cache and kernel. Though md-cache uses the stat in readv_cbk to refresh
> > its cache, fuse-bridge doesn't send the stat in readv reply. So, for a stale
> > stat to be seen we need at least md-cache along with quick-read. Since, it
> > is not the case, I am suspecting there is another issue apart from
> > stale-state which is resulting in failure of Elasticsearch.
> 
> I think there is a race between qr_writev() and qr_lookup_cbk() which can
> result in quick-read caching old data? i.e.
> 1) qr_lookup() is wound it got some values from posix and qr_lookup_cbk() is
> going to be executed.
> 2) qr_writev() is wound after qr_inode_prune()
> 3) qr_lookup_cbk() from 1) above updates the content and statbuf with old
> values and will serve that as readv's output. Which is stale right?

Just to expand on the case:
1) We have a file with content 'abc' 2) lookup and writev which replaces 'abc' with 'def' comes. Lookup fetches abc but yet to update the cache, and then immediately writev is wound which zeros out the cache. Now lookup_cbk updates the buffer with 'abc' even though on disk it is 'def'. Now writev completes and returns to application. 3) application does a readv which will be fetched from quick-read as 'abc'.

Comment 26 Raghavendra G 2016-11-01 03:32:49 UTC
Yes. Its a bug. Will be sending a fix (mostly use generation numbers in quick-read to figure out stale data) as we discussed on chat.

Comment 29 Atin Mukherjee 2016-11-07 08:49:50 UTC
upstream patch http://review.gluster.org/15759 posted for review.

Comment 41 Krutika Dhananjay 2017-02-06 10:38:47 UTC
Found one more bug.

So this is with write-behind enabled. It took me just 8 runs before I encountered CorruptIndexException error with the following volume configuration:

Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 193984aa-65cb-4f5b-a3c1-5c61939dde69
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: xxxxxxxxx:/bricks/1
Brick2: xxxxxxxxx:/bricks/2
Brick3: xxxxxxxxx:/bricks/3
Brick4: xxxxxxxxx:/bricks/4
Options Reconfigured:
performance.readdir-ahead: on
performance.client-io-threads: off
performance.io-cache: off
performance.quick-read: off
performance.read-ahead: off
performance.open-behind: on
performance.write-behind: on
transport.address-family: inet
nfs.disable: on

So before running the test I manually edited the volfile to load trace translator above and below write-behind and the file that got the CorruptIndexException was "/nodes/0/indices/HGWz8soOTCeZgk3QJqc3Gw/0/index/_1b_Lucene50_0.pos"


From the mount logs, the following sequence of logs indicates a bug due to write-behind:

1. [2017-02-03 06:51:42.636232] I [trace.c:1730:trace_lookup] 0-dis-rep-trace-1: 5038732: gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 path=/nodes/0/indices/HGWz8soOTCeZgk3QJqc3Gw/0/index/_1b_Lucene50_0.pos

2. [2017-02-03 06:51:42.636336] I [trace.c:1730:trace_lookup] 0-dis-rep-trace-2: 5038732: gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 path=/nodes/0/indices/HGWz8soOTCeZgk3QJqc3Gw/0/index/_1b_Lucene50_0.pos

3. [2017-02-03 06:51:42.637617] I [trace.c:2324:trace_writev] 0-dis-rep-trace-1: 5038734: gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 fd=0x7f363801fa5c, count=1,  offset=3769062 flags=08001 write_size=8

4. [2017-02-03 06:51:42.637692] I [trace.c:258:trace_writev_cbk] 0-dis-rep-trace-1: 5038734: (op_ret=8, *prebuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=N/A mtime=N/A ctime=N/A atime_sec=0, atime_nsec=0, mtime_sec=0, mtime_nsec=0, ctime_sec=0, ctime_nsec=0}, *postbuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=N/A mtime=N/A ctime=N/A atime_sec=0, atime_nsec=0, mtime_sec=0, mtime_nsec=0, ctime_sec=0, ctime_nsec=0})
[2017-02-03 06:51:42.637783] I [trace.c:2324:trace_writev] 0-dis-rep-trace-2: 0: gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 fd=0x7f363801fa5c, count=1,  offset=3769062 flags=08001 write_size=8

5. [2017-02-03 06:51:42.638907] I [trace.c:626:trace_lookup_cbk] 0-dis-rep-trace-2: 5038732: gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 (op_ret=0 *buf {gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 ino=13256560029748328626, mode=100644, nlink=1, uid=996, gid=993, size=3769062, blocks=7362, atime=2017-02-03-06:51:38 mtime=2017-02-03-06:51:42 ctime=2017-02-03-06:51:42 atime_sec=1486104698, atime_nsec=845399892, mtime_sec=1486104702, mtime_nsec=627399887, ctime_sec=1486104702, ctime_nsec=627399887}, *postparent {gfid=f168f144-b280-4729-9fa9-ab4ddb5cebf9 ino=11504915073975774201, mode=40755, nlink=2, uid=996, gid=993, size=4096, blocks=8, atime=2017-02-03-06:51:41 mtime=2017-02-03-06:51:38 ctime=2017-02-03-06:51:38 atime_sec=1486104701, atime_nsec=993399888, mtime_sec=1486104698, mtime_nsec=901399892, ctime_sec=1486104698, ctime_nsec=901399892}

6. [2017-02-03 06:51:42.638967] I [trace.c:626:trace_lookup_cbk] 0-dis-rep-trace-1: 5038732: gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 (op_ret=0 *buf {gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 ino=13256560029748328626, mode=100644, nlink=1, uid=996, gid=993, size=3769062, blocks=7362, atime=2017-02-03-06:51:38 mtime=2017-02-03-06:51:42 ctime=2017-02-03-06:51:42 atime_sec=1486104698, atime_nsec=845399892, mtime_sec=1486104702, mtime_nsec=627399887, ctime_sec=1486104702, ctime_nsec=627399887}, *postparent {gfid=f168f144-b280-4729-9fa9-ab4ddb5cebf9 ino=11504915073975774201, mode=40755, nlink=2, uid=996, gid=993, size=4096, blocks=8, atime=2017-02-03-06:51:41 mtime=2017-02-03-06:51:38 ctime=2017-02-03-06:51:38 atime_sec=1486104701, atime_nsec=993399888, mtime_sec=1486104698, mtime_nsec=901399892, ctime_sec=1486104698, ctime_nsec=901399892}

7. [2017-02-03 06:51:42.639940] I [trace.c:258:trace_writev_cbk] 0-dis-rep-trace-2: 0: (op_ret=8, *prebuf = {gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 ino=13256560029748328626, mode=100644, nlink=1, uid=996, gid=993, size=3769062, blocks=7362, atime=2017-02-03-06:51:38 mtime=2017-02-03-06:51:42 ctime=2017-02-03-06:51:42 atime_sec=1486104698, atime_nsec=845399892, mtime_sec=1486104702, mtime_nsec=627399887, ctime_sec=1486104702, ctime_nsec=627399887}, *postbuf = {gfid=3d13bad2-094b-443b-b7f8-c31c5cca78b2 ino=13256560029748328626, mode=100644, nlink=1, uid=996, gid=993, size=3769070, blocks=7362, atime=2017-02-03-06:51:38 mtime=2017-02-03-06:51:42 ctime=2017-02-03-06:51:42 atime_sec=1486104698, atime_nsec=845399892, mtime_sec=1486104702, mtime_nsec=637399887, ctime_sec=1486104702, ctime_nsec=637399887})



Basicaly there was a lookup on _1b_Lucene50_0.pos, very quickly followed by a writev on its fd at offset=3769062 and size=8. write-behind lied to the application at (4) indicating that the write is complete. So after this the lookup that was originally wound and this pending write were both wound over the network; lookup hit the disk first and returned size of the inode as "3769062" in its response, when it should have in fact been 3769062 + 8 = 3769070 now that from the application's POV the write is complete. And shortly after that when the writev *actually* hit the disk, it returned the correct size of 3769070 in its poststat.

This is also in line with the nature of the error thrown in the elasticsearch logs:

...
...
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=0 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/glusterfs/elasticsearch/nodes/0/indices/HGWz8soOTCeZgk3QJqc3Gw/0/index/_1b_Lucene50_0.pos")))
...
...


Maybe write-behind should be flushing outstanding writes on a file before serving the fops LOOKUP, STAT, FSTAT and READDIRP.

I will go through the code myself (*even* if it is going to take me some time in understanding write-behind internals) and then figure out the fix and send out the patch.

-Krutika


PS: In the tests that I ran with my readdir-ahead fix @ https://review.gluster.org/#/c/16419/ (with md-cache and readdir-ahead being the only perf xls enabled), I did not encounter any CorruptIndexException errors even after 670+ runs.

Comment 48 Csaba Henk 2017-03-28 06:11:14 UTC
I briefly describe my testing methodology.

The basic test run is loading a Mediawiki index as described in

https://www.elastic.co/blog/loading-wikipedia

(we actually use wikiquote).

This is scripted up here: https://gist.github.com/csabahenk/1389b608b27881782d9e63cc4c14b9e1

However, a single run will hardly reproduce the issue. Depending on configuration, we might need 50-100, or 500-1000 runs to hit an error. (Nb. it's good to clarify what it actually means "to hit an error". We don't get to see anything on the GlusterFS side. From GlusterFS' POV a silent corruption occurs and GlusterFS lacks any reflection to it. It's the Elasticsearch service that detects an irregular situation; however, it's not crashing on it, just prints/logs a warning and dumps some stack straces, then churns on happily. So "hitting an error" boils down to monitor the stdout/log of Elasticsearch and hunt for failure warnings in that. The referred test script performs this check, and when an error is being hit, stops Elasticsearch and saves the relevant assets of the test to a permanent location.)

Therefore we have another script that runs the above script in a loop until certain conditions apply (in particular, disk space becomes too low). Here we provide the script (testrun.sh) as base64 encoded gzipped cpio (extact with "base64 -d | zcat | cpio -i"):

H4sIAIMS0lgAA3VTX08UMRBfgk/l3ee5dcnpwW45jEqOrHrCJRABDUhCYkwo2+5uuf13bZdTzzPxe/im34WvwpcwcXZv0UO
kD+2085vJ/OY3vRytWD/v/fp2tXi1uGBZ1o+T+50lPL8vGqGNKjNPx5b1oEXPZEZ1TIgWBlxBiAzhPdjO1t6gf3D81gYfuv
BhE0wsMgKgS56DjsENoF0MZZLAObtg6AAICqDCBFQkTBsZaMFUEN+8eZ/SZOLlSkar0zpGpeCqEOgFUzTJA5Zc46siaQdom
hkaJaU2Qt1w/QO7FU8rT6dNQkkI6YEzeXXYP9ja6fkpq3JNyUMScPg6n4dE0kCIDOLaUqLuiBszxaEqGfvkzLKQR4RwZlip
pG87k27PjY0pdI9SXqaF9sZyKFPBJUOqEc2xdYoGUqlSN00JSqUEMhNZBR2VuRHu+lr3abe7vuHOI90gzwwivXOdZ170eWo
THcvQ1JS2d49eHx8Ntvf7Jz1/49mUFJKHMhE+NWkxz8xDByEiiHNwHHgOTgOEmeorqPo4ru69TeA5CsOlHpZacP+Uh+DGQO
ELGCYTcLtosfEQ2pNCycyA82TaxifEQlvTZUrbpxhfz5BznQXcSIAzV+3cPAHUZdnbiIVSM0SKj4EQXN8kOF1eRYc0Mos8u
45rKqvtMyXYEK1K7WZGnZ03+4O7RsZTZ6hsIyEOe2PZaL6ssithLljiOy9mXFoVm9kbfpERrN0mUGWFEFuEdMfSxHWxoA0z
pb6O9f5XayJEAY8JzzNUAb/DX20uR9aftdCcS/X+7rC/uzc4bLVa1p3rNxR3oyMABAAA

The technical details of the test setup are given in a footnote.¹ After having run it for a while and collecting some errored cases, we can engage with forensics.

I describe a way of uncovering the snag Elasticsearch hits. As further prerequisite, we need a GlusterFS installation with the following change applied: https://review.gluster.org/16422 and we need to enable fuse dumping in the underlying GlusterFS instance, and have to pass the "--tracefiles=<path-to-fusedump>" option to testrun.sh. That will leave us with a large (typically of a size of a few gigabytes) gzipped fuse dump in the save directory.

We want to extract a trailing segment of it, say, of approximately 50 megabytes. Due to change 16442, the "GLUSTER\xF5" pattern will be present in every message in the dump at a fixed offset (25). So we can read in the trailing 50 Mb in the dump, look for "GLUSTER\xF5" in it, go back 25 bytes, and write the buffer out to a file from that point on, which then will be a consisent fuse dump segment. The following one-liner does this (assuming the original gzipped dump is called pop.fuse.gz):

  $ pv < pop.fuse.gz | zcat | ruby -e 'ba=Array.new 5; loop { ba.shift; ba << (STDIN.read(10<<20)||""); ba.last.size < 10<<20 and break}; b=ba.map(&:b).join; i=b.index("GLUSTER\xF5".b); STDERR.puts i; i or raise "pattern not found"; i<25 and (b=b[i+1..-1]; i=b.index("GLUSTER\xF5".b); STDERR.puts i; (!i or i<25) and raise "bad data"); print b[i-25..-1]' > pop.fuse.tail-50

Now in this manageable size dump we shall look for the history of the file with which things went awry. The Elasticsearch log (symlinked to elastic.log in the save directory) carries a mention of this file, similar to this:

  [2017-02-15T04:51:45,311][WARN ][o.e.i.e.Engine           ] [chenk-centos7-elastic-9] [en.wikiquote.org][4] failed engine [writeIndexingBuffer failed]
  org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=51577862 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/gluster/elastictest/elastictest/120/nodes/0/indices/EiKpUZ6uT3OXhrjYecEfXg/4/index/_f_Lucene50_0.doc")))

So the file we are to investigate is _f_Lucene50_0.doc. The first thing we need is the basic tool to investigate fuse dumps, parsefuse: https://github.com/csabahenk/parsefuse. Compile it following the instructions (against the appropriate FUSE protocol headers). We offer another utility fusenodequery.rb, which can extract operations on a certain file from a (JSON converted) fuse dump.  Here it is as base64'd gzipped cpio:

H4sIANJj2VgAA9VXTW7bRhRWEKAo2E17gzEdwJJBMVa6C6IYbm23bp3YlR2gACEII3FoMSaHNH8UKaIWvUSvlFUP0Btk0wN
0UaDvDWeGFGW5CIouasAiOfPm/X7vZz7cffb5r7/98cvHxx8fP2q1Wr///Nf4K3j++aWXp4xHLrvLWbKwk3GrtbvzNE+Tp2
OfP2V8RpJ8vDCMnAcsTclRktCFHbJsGrkjl3k+Z+4hee76N4ZBSBi5ecDIcflFCBAQ2CK3FtmfixVCUhZ4zu3Q9nmaUT5hI
zajAXEjuU2I75G5zcI4WxzqtfKY/mRBymp7KELzh13uGutv+DT0YxJQZYnY9fkkyN1S7QbR9zSdbqMRPw+7JYtG06ZE7Rix
Weo5JX2yXCml6WQK/iDFrTUrtFFT8BlQzYymjdMNC8VPAgH1E0b2UpbtGbv6820a8T3DkJE6heC/huD/hMHH8/MMTpfONko
1w/QmpDHx3DyMhajlalvsBI2N+i9JwSrdc1Cc2RCltvkqvTE7NpzKWTqiWbtnPevY4C53WdACvU36/T6hK8d8w32ApDnUXN
oCOfmwKPrOsENevCAKA6satLQ3tE/QiBuWhYxnfsRTQsEcwCOnIUsFdchYBhrGSTQhoDi1xgWhpEva+DPuAPtdJAsY9VJFh
/GZF+UGIfMDa96DLe4HFpmr1RwEBrAHJpF5T60idX/tswefNii1BJ8R7QNGDsFpABJKnsPHyvYCmkF01EG0TomX3BQohKaj
hE3q2jKLTiYqJhOaMu2+d1PGa+kgOIkoojfuioqdc4c8htrf4qBOkercSMQXT8/qp2frp9eSGHbqEa3gDRtGfWUtCgJmNQn
MIs5QSUgmiFudWA7G3QaIsEnWTK8QyBAFjmDlzIaWQIfCHlSksFmRPFqpv2aJEFsma6gXsyRvmgZPqyQebmAV6zEBJ67BVL
wBV6fUTeMB96K4D+QixQ4skWRWr2PoSEdl5op4mecXFz++uTQtYn47ODm6PjENBVdRygT3prXloiyCh6Qp62ur2+uQKCEJy
/KEbwRR0feUbpZ5kiRRAqXgPUuiw42jwgH9jXPPrIOOpFj4LHAFXc00sOf4+Gxwef7mytwiGlnYusTWi9QD5m84oKxmz4TN
HIomuaf5aCtKYq35hu7rfaqJBQhy8p9h4buT66Pr68E2Z/1jnIRy98bJMo9gz1wPF5IbjYyRcG/XTOxgYvNC+ckpWam81k4
ijYzZXuHHuDACRR/sA5wDxRXLbM7e6WP1goaI0TVD693UGthAJQOtsbH6bkNvqYwqTtsUwievlEDiRs1yYKhyVLGxyGw4VF
JsOVlUwwAUsNHo9Oz8ZDTCZvTkAHYbE0FtYUHDYG0hirOYJinbQ4efYlAuYkTfUkg0Q5pNpiw1Sf9l1W5CiCaMDrrCBrrol
YfA6GTx8JH0pnEG/f3gEYopMWwECPbBm0/2IT4YHiqio7sQlbkO3Ra6N7RaapB1qQjaT5KKB/611NR//2m2bpdaTl5XyLDz
CRpE8YPy8f2+IqM5rER6wkQIiqpu6SF4UqW5FyWAnXJugk/AWObDPFmRX8SYGZeIvQSzUugRxYWsc3bE22a3izJMzD4pDJu
ttFPRxNjuul0pQRBX0u6h90p6qe4pfIn5qCNnGm3GrHGuK006NTWpsrEkxfQUubRjaHdU7cgrLZOtSGWafcsgTur2c3V9fD
IY2HGepcTM+S2P3vGSEdldeitVytncz0iv0VU0a0dMqiamPlqKGW8Oq/ZWKm1sF1caBfLEixSqRcoGpu5M65cMQ94TBAjuv
0mAzLPXegINCjHYI5R/uLp4bQcRdduBQrKc+UvraFlUG1eWsS6193WJJ/sIvTwbRXlWn5ej8VsLuwvNg6yo2qc0vSjUXq2b
CjdK/8eJzzMCTLAa40aNTrhd0aFjJRmuN9tjQn2Qap69ury4ujr75lyOa9j+iLYb6ru+nklTHNFmVGiNGk/d1ZodBW+hEpH
rYVnqyeW9H7fRm0vEqgao4w1hZiarjhyvlyVbhLxdu+WuanKW6n7Xf3m7gotzcsN22lM1LUiUyhTV0qe1i+N+qWmnAi2WMQ
puwXlsYwJVXpkq1OtuKZ2I/60Pdy3990g+vxC/14MjaJ+DnZ2d1v/y729WCHY/ABIAAA==

See this command:

  $  ./parsefuse -format json pop.fuse.tail-50 | ./fusenodequery.rb -f op,size _f_Lucene50_0.doc

This extracts the FUSE operation name and size from those fuse messages in which attributes of _f_Lucene50_0.doc are contained:

  {"Unique":29507361,"op":"CREATE","size":[0]}
  {"Unique":29508577,"op":"READDIRPLUS","size":[204800]}
  {"Unique":29509685,"op":"LOOKUP","size":[209380]}

The size of the file reported in the READDIRPLUS and the later LOOKUP reply differ. Is that OK or not? Let's pull out more info from fusenodequery. In its  default mode of operation it reports only those fuse messages which include all requested fields (ie. the ones listed with "-f"). This can be relaxed with the --partial option:

  $  ./parsefuse -format json pop.fuse.tail-50 | ./fusenodequery.rb --partial -f op,size _f_Lucene50_0.doc | cat -n

thus getting the complete list of messages affecting _f_Lucene50_0.doc:

     1	{"Unique":29507359,"op":"LOOKUP"}
     2	{"Unique":29507360,"op":"LOOKUP"}
     3	{"Unique":29507361,"op":"CREATE","size":[0]}
     4	{"Unique":29507531,"op":"GETXATTR"}
     5	{"Unique":29507532,"op":"WRITE"}
     6	{"Unique":29507586,"op":"GETXATTR"}
     7	{"Unique":29507587,"op":"WRITE"}
    ... (a series of GETXATTR and WRITE messages) ...
    56	{"Unique":29508546,"op":"GETXATTR"}
    57	{"Unique":29508547,"op":"WRITE"}
    58	{"Unique":29508552,"op":"FLUSH"}
    59	{"Unique":29508553,"op":"RELEASE"}
    60	{"Unique":29508577,"op":"READDIRPLUS","size":[204800]}
    61	{"Unique":29509439,"op":"OPEN"}
    62	{"Unique":29509445,"op":"FLUSH"}
    63	{"Unique":29509447,"op":"READ"}
    64	{"Unique":29509476,"op":"READ"}
    65	{"Unique":29509521,"op":"RELEASE"}
    66	{"Unique":29509685,"op":"LOOKUP","size":[209380]}
    67	{"Unique":29509686,"op":"UNLINK"}
    68	{"Unique":29509687,"op":"FORGET"}

As one can see, the file is not open for writing in between the READDIRPLUS and the LOOKUP messages, so it's incorrect that the size reported by these two operations differ. Continuing the investigation further² reveals that the total bytes WRITE-n to the file amounts to the size reported for LOOKUP so the bogus size value is the one reported by READDIRPLUS.

* * *

¹ Prerequisites are (beyond GlusterFS and Elasticsearch): strace, ruby and pv (http://www.ivarch.com/programs/pv.shtml). Also the process has been performed with Elasticsearch 5.x, so it's recommended to use such a version. The GlusterFS volume should be mounted at /mnt/gluster/elastictest and /etc/elasticsearch/elasticsearch.yml should be created with a similar content:

---
cluster.name: ./gluster-test-00
node.name: ${HOSTNAME}
network.host: 172.16.53.71
network.publish_host: 172.16.53.71
path.data: /mnt/gluster/elastictest/elastictest/00
path.logs: /var/local/elastic/test00.logs

("${HOSTNAME}" should be kept literal, but change 172.16.53.71 to your IP address -- it's not tested whether ${HOSTNAME} can be used there).

Clone the above referred gist to ~/elastictest:

  $ git clone https://gist.github.com/csabahenk/1389b608b27881782d9e63cc4c14b9e1 elastictest

Get a dump of wikiquote as described in the referred article, say, enwikiquote-20170102-cirrussearch-content.json.gz
Now you can run testrun.sh with it:

  $ ./testrun.sh enwikiquote-20170102-cirrussearch-content.json.gz

This saves the result to numbered directories under /var/local/elastictest. Usually just a minimal test log is retained (test.log) but if error is detected, then the whole Elasticsearch index (under /mnt/gluster/elastictest/elastictest/<directory of same index>) and other associated logs/dumps are kept as well. The script runs until it detects that the space is not enough to accommodate the artifacts of yet another run ("enough space" is a dummmy heuristic, you might change the actual values). If you want to terminate the testing process, the suggested ways is just to rm ~/elastictest/elastictest.rb -- this will allow the current test run to finish and clean up, and only the next turn will be halted on detecting the lack of the elastictest.rb script (on next command line invocation of testrun.sh elasticsearch.rb will be restored from git, so don't worry about it). If you want to drop all previous results, call testrun.sh with CLEANUP=1 evironment.

² Using 1) the "-lim 0" option for fusedump to not truncate the WRITE payloads 2) the "--dump" option for fusenodequery to display the complete messages that involve _f_Lucene50_0.doc, and 3) some scripting to exctract and summarize the sizes of the WRITE buffers from the dumped yaml stream.

Comment 50 Pranith Kumar K 2017-05-10 16:47:10 UTC
(In reply to Yong from comment #49)
> +1
> Seems I have the same issue with glusterfs 3.10.1, here're the related
> topics:
> 
> https://discuss.elastic.co/t/elasticsearch-index-health-turned-red/83805/10
> https://discuss.elastic.co/t/shard-repeat-to-be-unassigned/83975/3
> https://forums.rancher.com/t/glusterfs-and-elasticsearch/2293

Yong,
     Could you try the following?

Create a plain distribute volume with just one brick. Set the following options:

gluster volume set <volname> performance.quick-read off
gluster volume set <volname> performance.io-cache off
gluster volume set <volname> performance.write-behind off
gluster volume set <volname> performance.stat-prefetch off
gluster volume set <volname> performance.read-ahead off
gluster volume set <volname> performance.readdir-ahead off
gluster volume set <volname> performance.open-behind off
gluster volume set <volname> performance.client-io-threads off

When I last tested elastic search this way it finished successfully. It may be a bit slow, but it worked and completed.

Let me know your results as well.

Comment 54 Oonkwee Lim 2017-09-27 17:11:15 UTC
Hello,

Any idea when gluster-block backed RWO volumes  to support elastic will be out?

Will the block storage in RHGS 3.3 able to support Elastisearch?

Comment 61 Jeff Cantrill 2018-03-06 21:34:20 UTC
Possible duplicate: https://bugzilla.redhat.com/show_bug.cgi?id=1552257

Comment 72 Raghavendra G 2019-04-10 04:29:44 UTC
http://hrkscribbles.blogspot.com/2018/11/elastic-search-on-gluster.html is the blog I referred in comment #71


Note You need to log in before you can comment on or make changes to this bug.