Bug 762547 (GLUSTER-815)

Summary: quick-read and replicate self-heal interaction result in empty reads
Product: [Community] GlusterFS Reporter: Vikas Gorur <vikas>
Component: replicateAssignee: Pavan Vilas Sondur <pavan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: mainlineCC: amarts, fharshav, gluster-bugs, ian.rogers, rabhat, sac, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
unit test for automation none

Description Vikas Gorur 2010-04-08 23:35:00 UTC
When quick-read and replicate are both loaded, the following sequence of events can happen:

1) Server2 goes down.
2) A bunch of files are created.
3) Server2 comes back up.
4) Entry self-heal is triggered on the directory and all the empty files are created.
5) "cat" happens on one of the files, and replicate triggers self-heal but unwinds
the lookup before the self-heal is complete (background self-heal). This lookup
returns the xattr to quick-read, which is ofcourse empty. quick-read caches this
for eternity and keeps serving the empty file.

Comment 1 Ian Rogers 2010-04-15 20:36:05 UTC
Please please (pretty please :-) just add all the caching types - lookup, quick-read, stat-prefetch - into io-cache and selectable by options, or this sort of thing is going to crop up all the time...

Comment 2 Vikas Gorur 2010-04-15 20:39:12 UTC
Ian,

Having a unified caching translator is an option we are currently exploring.

Comment 3 Anand Avati 2010-04-20 05:51:05 UTC
PATCH: http://patches.gluster.com/patch/3138 in master (cluster/afr: Send xattr in lookup from the source subvolume.)

Comment 4 Anand Avati 2010-04-20 05:51:22 UTC
PATCH: http://patches.gluster.com/patch/3137 in release-3.0 (cluster/afr: Send xattr in lookup from the source subvolume.)

Comment 5 Amar Tumballi 2010-05-04 08:06:13 UTC
This patch is reverted in release-3.0 branch. Need to have a proper NULL check.

Comment 6 Amar Tumballi 2010-05-04 08:07:00 UTC
*** Bug 869 has been marked as a duplicate of this bug. ***

Comment 7 Pavan Vilas Sondur 2010-05-21 03:54:32 UTC
Created attachment 206 [details]
Makefile showing overrides needed

Comment 8 Pavan Vilas Sondur 2010-05-21 03:55:15 UTC
A patch is sent for review, tested with the attached program (test.c) as a unit test (can be used in automation).
Also the same test can be used to trigger an entry self heal + data self heal to verify that case as well.

Comment 9 Anand Avati 2010-05-21 04:32:34 UTC
PATCH: http://patches.gluster.com/patch/3296 in release-3.0 (cluster/afr: Check before accessing xattrs in data self heal.)

Comment 10 Pavan Vilas Sondur 2010-05-21 18:36:20 UTC
Need to check if triggering entry self heal results in quick read storing the file as a zero byte file because we do not set xattrs in sh_missing_entries_lookup_cbk and self heal runs in the background.

Comment 11 Vikas Gorur 2010-05-21 18:54:19 UTC
Entry self-heal is not a problem because entry self-heal is done on directories. Quick-read will have no clue about files inside that directory when entry self-heal happens. The missing entries lookup is triggered inside replicate and quick-read will never see it.

Comment 12 Anand Avati 2010-06-01 04:24:09 UTC
PATCH: http://patches.gluster.com/patch/3358 in master (cluster/afr: Check before accessing xattrs in data self heal.)

Comment 13 Raghavendra Bhat 2010-06-15 06:04:03 UTC
Checked with glusterfs-3.0.5rc6 and glusterfs-3.0.4 clients both mounted. Brought 2nd server down. Executed the unit test attached for some time (which keeps writing "hello, world" to a file). stopped the execution, brought server2 up. And did "cat <file written by the test>" on both the clients. On 3.0.4 it was empty where as on 3.0.5rc6 it was working properly and the contents of the file were visible. Hence moving to the verified state.

Comment 14 shishir gowda 2010-07-27 06:38:05 UTC
*** Bug 1056 has been marked as a duplicate of this bug. ***