Bug 767585 - [7235e5b1af090ffc9d87ac59daadf7926433b495] dbench errors out with open failed due to io error
Summary: [7235e5b1af090ffc9d87ac59daadf7926433b495] dbench errors out with open failed...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact: Raghavendra Bhat
URL:
Whiteboard:
Depends On:
Blocks: 853691
TreeView+ depends on / blocked
 
Reported: 2011-12-14 12:48 UTC by Rahul C S
Modified: 2014-04-17 11:37 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.5.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 853691 (view as bug list)
Environment:
Last Closed: 2014-04-17 11:37:58 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
attached client & brick logs (7.14 MB, application/x-compressed-tar)
2011-12-14 12:48 UTC, Rahul C S
no flags Details

Description Rahul C S 2011-12-14 12:48:13 UTC
Created attachment 546712 [details]
attached client & brick logs

Description of problem:
dbench errors out with open failed due to input output error because handle was not found.
...
  10       302     1.52 MB/sec  warmup  73 sec  latency 579.520 ms
  10       308     1.52 MB/sec  warmup  74 sec  latency 765.454 ms
  10       314     1.53 MB/sec  warmup  75 sec  latency 628.735 ms
[349] open ./clients/client0/~dmtmp/PARADOX/COURSES.PX failed for handle 9978 (Input/output error)
(350) ERROR: handle 9978 was not found
Child failed with status 1

client log:
[2011-12-14 16:49:00.116017] W [client3_1-fops.c:899:client3_1_getxattr_cbk] 0-vol-client-5: remote operation failed: No data available. Path: (null)
[2011-12-14 16:49:00.123303] W [client3_1-fops.c:899:client3_1_getxattr_cbk] 0-vol-client-3: remote operation failed: No data available. Path: (null)
[2011-12-14 16:49:00.135147] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv
olume 0
[2011-12-14 16:49:00.135189] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv
olume 0
[2011-12-14 16:49:00.135203] W [afr-common.c:1153:afr_detect_self_heal_by_iatt] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid different
 on subvolume
[2011-12-14 16:49:00.135223] I [afr-common.c:1297:afr_launch_self_heal] 0-vol-replicate-2: background  meta-data data missing-entry self-heal triggered. path
: /clients/client0/~dmtmp/PARADOX/COURSES.PX, reason: lookup detected pending operations
[2011-12-14 16:49:00.136646] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv
olume 1
[2011-12-14 16:49:00.136893] I [afr-self-heal-common.c:967:afr_sh_missing_entries_done] 0-vol-replicate-2: split brain found, aborting selfheal of /clients/c
lient0/~dmtmp/PARADOX/COURSES.PX
[2011-12-14 16:49:00.136917] E [afr-self-heal-common.c:2057:afr_self_heal_completion_cbk] 0-vol-replicate-2: background  meta-data data missing-entry self-he
al failed on /clients/client0/~dmtmp/PARADOX/COURSES.PX
[2011-12-14 16:49:00.136942] W [fuse-bridge.c:279:fuse_entry_cbk] 0-glusterfs-fuse: 504406: LOOKUP() /clients/client0/~dmtmp/PARADOX/COURSES.PX => -1 (Input/
output error)
[2011-12-14 16:49:00.318888] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv
olume 1
[2011-12-14 16:49:00.318933] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subvolume 1
[2011-12-14 16:49:00.318948] W [afr-common.c:1153:afr_detect_self_heal_by_iatt] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid different on subvolume
[2011-12-14 16:49:00.318970] I [afr-common.c:1297:afr_launch_self_heal] 0-vol-replicate-2: background  meta-data data missing-entry self-heal triggered. path: /clients/client0/~dmtmp/PARADOX/COURSES.PX, reason: lookup detected pending operations
....
[2011-12-14 16:49:00.414083] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subvolume 0
[2011-12-14 16:49:00.414988] I [afr-self-heal-common.c:967:afr_sh_missing_entries_done] 0-vol-replicate-2: split brain found, aborting selfheal of /clients/client0/~dmtmp/PARADOX/COURSES.PX
[2011-12-14 16:49:00.415016] E [afr-self-heal-common.c:2057:afr_self_heal_completion_cbk] 0-vol-replicate-2: background  meta-data data missing-entry self-heal failed on /clients/client0/~dmtmp/PARADOX/COURSES.PX
[2011-12-14 16:49:00.415043] W [fuse-bridge.c:279:fuse_entry_cbk] 0-glusterfs-fuse: 504482: LOOKUP() /clients/client0/~dmtmp/PARADOX/COURSES.PX => -1 (Input/output error)

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. create a distributed replicate volume
2. run dbench -s -F -S -x  --one-byte-write-fix --stat-check 10
3. if dbench does not error out, bring down a brick process and start again.
4. while dbench is running bring brick back online, and then run gluster volume heal <volume> to enable self-healing on the volume.
  
Actual results:
dbench errors out.

Expected results:
dbench should complete without any errors.

Additional info:

Comment 2 M S Vishwanath Bhat 2012-04-23 12:21:41 UTC
So I was able to hit this issue with the following steps.

1. Create and start a 2*2 dist-rep volume.
2. From a fuse mount start running fs_mark and and from other terminal start dbench.
3. Now after sometime pkill two of the replicate legs from different subvolumes.
4. Bring'em back up after sometime. 
5. Repeat step 3 and 4 one more time.

Then dbench errors out.

Comment 3 Pranith Kumar K 2012-05-28 17:40:01 UTC
Will be fixed post 3.3.0 release.

Comment 4 Jeff Darcy 2012-10-31 14:14:30 UTC
http://review.gluster.org/2670 posted for this.

Comment 5 Anand Avati 2013-05-07 15:36:17 UTC
COMMIT: http://review.gluster.org/2670 committed in master by Vijay Bellur (vbellur) 
------
commit 273a42a421a7deeb3cde9865cfe4bab4826fdb7f
Author: Pranith Kumar K <pkarampu>
Date:   Fri Mar 1 15:05:04 2013 +0530

    cluster/afr: Club missing entry, missing gfid self-heals
    
    Problem:
    gfid-self-heal always assigns the gfid(GFID-1) it gets from lookup.
    Between the time of lookup to triggering the gfid-self-heal the
    entry could have changed. Now lets say there is a case where
    one of the files of the replica subolumes already has a gfid
    (GFID-2) and the other does not. In that case healing should
    happen with GFID-2 instead of GFID-1.
    
    Fix:
    Missing-entry-self-heal already handles all these cases. So removed
    separate handling of gfid-self-heal.
    
    Change-Id: Ie96261e9036c8f3cb4cad89347f9bf7b681cdc1a
    BUG: 767585
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/2670
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 6 Niels de Vos 2014-04-17 11:37:58 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report.

glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.