| Summary: | [7235e5b1af090ffc9d87ac59daadf7926433b495] dbench errors out with open failed due to io error | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Rahul C S <rahulcs> | ||||
| Component: | replicate | Assignee: | Pranith Kumar K <pkarampu> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Raghavendra Bhat <rabhat> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | mainline | CC: | gluster-bugs, jdarcy, rwheeler, vbellur, vbhat | ||||
| Target Milestone: | --- | Keywords: | Triaged | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | glusterfs-3.5.0 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 853691 (view as bug list) | Environment: | |||||
| Last Closed: | 2014-04-17 11:37:58 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 853691 | ||||||
| Attachments: |
|
||||||
So I was able to hit this issue with the following steps. 1. Create and start a 2*2 dist-rep volume. 2. From a fuse mount start running fs_mark and and from other terminal start dbench. 3. Now after sometime pkill two of the replicate legs from different subvolumes. 4. Bring'em back up after sometime. 5. Repeat step 3 and 4 one more time. Then dbench errors out. Will be fixed post 3.3.0 release. http://review.gluster.org/2670 posted for this. COMMIT: http://review.gluster.org/2670 committed in master by Vijay Bellur (vbellur) ------ commit 273a42a421a7deeb3cde9865cfe4bab4826fdb7f Author: Pranith Kumar K <pkarampu> Date: Fri Mar 1 15:05:04 2013 +0530 cluster/afr: Club missing entry, missing gfid self-heals Problem: gfid-self-heal always assigns the gfid(GFID-1) it gets from lookup. Between the time of lookup to triggering the gfid-self-heal the entry could have changed. Now lets say there is a case where one of the files of the replica subolumes already has a gfid (GFID-2) and the other does not. In that case healing should happen with GFID-2 instead of GFID-1. Fix: Missing-entry-self-heal already handles all these cases. So removed separate handling of gfid-self-heal. Change-Id: Ie96261e9036c8f3cb4cad89347f9bf7b681cdc1a BUG: 767585 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/2670 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report. glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |
Created attachment 546712 [details] attached client & brick logs Description of problem: dbench errors out with open failed due to input output error because handle was not found. ... 10 302 1.52 MB/sec warmup 73 sec latency 579.520 ms 10 308 1.52 MB/sec warmup 74 sec latency 765.454 ms 10 314 1.53 MB/sec warmup 75 sec latency 628.735 ms [349] open ./clients/client0/~dmtmp/PARADOX/COURSES.PX failed for handle 9978 (Input/output error) (350) ERROR: handle 9978 was not found Child failed with status 1 client log: [2011-12-14 16:49:00.116017] W [client3_1-fops.c:899:client3_1_getxattr_cbk] 0-vol-client-5: remote operation failed: No data available. Path: (null) [2011-12-14 16:49:00.123303] W [client3_1-fops.c:899:client3_1_getxattr_cbk] 0-vol-client-3: remote operation failed: No data available. Path: (null) [2011-12-14 16:49:00.135147] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv olume 0 [2011-12-14 16:49:00.135189] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv olume 0 [2011-12-14 16:49:00.135203] W [afr-common.c:1153:afr_detect_self_heal_by_iatt] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid different on subvolume [2011-12-14 16:49:00.135223] I [afr-common.c:1297:afr_launch_self_heal] 0-vol-replicate-2: background meta-data data missing-entry self-heal triggered. path : /clients/client0/~dmtmp/PARADOX/COURSES.PX, reason: lookup detected pending operations [2011-12-14 16:49:00.136646] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv olume 1 [2011-12-14 16:49:00.136893] I [afr-self-heal-common.c:967:afr_sh_missing_entries_done] 0-vol-replicate-2: split brain found, aborting selfheal of /clients/c lient0/~dmtmp/PARADOX/COURSES.PX [2011-12-14 16:49:00.136917] E [afr-self-heal-common.c:2057:afr_self_heal_completion_cbk] 0-vol-replicate-2: background meta-data data missing-entry self-he al failed on /clients/client0/~dmtmp/PARADOX/COURSES.PX [2011-12-14 16:49:00.136942] W [fuse-bridge.c:279:fuse_entry_cbk] 0-glusterfs-fuse: 504406: LOOKUP() /clients/client0/~dmtmp/PARADOX/COURSES.PX => -1 (Input/ output error) [2011-12-14 16:49:00.318888] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subv olume 1 [2011-12-14 16:49:00.318933] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subvolume 1 [2011-12-14 16:49:00.318948] W [afr-common.c:1153:afr_detect_self_heal_by_iatt] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid different on subvolume [2011-12-14 16:49:00.318970] I [afr-common.c:1297:afr_launch_self_heal] 0-vol-replicate-2: background meta-data data missing-entry self-heal triggered. path: /clients/client0/~dmtmp/PARADOX/COURSES.PX, reason: lookup detected pending operations .... [2011-12-14 16:49:00.414083] W [afr-common.c:1376:afr_conflicting_iattrs] 0-vol-replicate-2: /clients/client0/~dmtmp/PARADOX/COURSES.PX: gfid differs on subvolume 0 [2011-12-14 16:49:00.414988] I [afr-self-heal-common.c:967:afr_sh_missing_entries_done] 0-vol-replicate-2: split brain found, aborting selfheal of /clients/client0/~dmtmp/PARADOX/COURSES.PX [2011-12-14 16:49:00.415016] E [afr-self-heal-common.c:2057:afr_self_heal_completion_cbk] 0-vol-replicate-2: background meta-data data missing-entry self-heal failed on /clients/client0/~dmtmp/PARADOX/COURSES.PX [2011-12-14 16:49:00.415043] W [fuse-bridge.c:279:fuse_entry_cbk] 0-glusterfs-fuse: 504482: LOOKUP() /clients/client0/~dmtmp/PARADOX/COURSES.PX => -1 (Input/output error) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. create a distributed replicate volume 2. run dbench -s -F -S -x --one-byte-write-fix --stat-check 10 3. if dbench does not error out, bring down a brick process and start again. 4. while dbench is running bring brick back online, and then run gluster volume heal <volume> to enable self-healing on the volume. Actual results: dbench errors out. Expected results: dbench should complete without any errors. Additional info: