1408785 – with granular-entry-self-heal enabled i see that there is a gfid mismatch and vm goes to paused state after migrating to another host

Bug 1408785 - with granular-entry-self-heal enabled i see that there is a gfid mismatch and vm goes to paused state after migrating to another host

Summary: with granular-entry-self-heal enabled i see that there is a gfid mismatch and...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Krutika Dhananjay
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1408426 1408712 1408786
Blocks:	1400057
TreeView+	depends on / blocked

Reported:	2016-12-27 08:04 UTC by Krutika Dhananjay
Modified:	2017-03-08 10:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.9.1
Clone Of:	1408712
Environment:
Last Closed:	2017-03-08 10:23:37 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Krutika Dhananjay 2016-12-27 08:04:45 UTC

+++ This bug was initially created as a clone of Bug #1408712 +++

+++ This bug was initially created as a clone of Bug #1408426 +++

Description of problem:
vm creation happens when one of the data brick is down and once the brick is brought up back i see that there are some entries which does not get healed and when the vm is migrated to another node it goes to paused state by logging the following errors in the mount logs.

[2016-12-23 09:14:16.481519] W [MSGID: 108008] [afr-self-heal-name.c:369:afr_selfheal_name_gfid_mismatch_check] 0-engine-replicate-0: GFID mismatch for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/f735902d-12fa-4e4d-88c9-1b8ba06e3063.1673 6e17b733-b8a4-4563-bc3d-f659c9a46c2a on engine-client-1 and 55648f43-7e09-4e62-b7d2-16fe1ff7b23e on engine-client-0
[2016-12-23 09:14:16.482442] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 0-engine-shard: Lookup on shard 1673 failed. Base file gfid = f735902d-12fa-4e4d-88c9-1b8ba06e3063 [Input/output error]
[2016-12-23 09:14:16.482474] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 11280842: READ => -1 gfid=f735902d-12fa-4e4d-88c9-1b8ba06e3063 fd=0x7faeda380210 (Input/output error)
[2016-12-23 10:08:41.956330] W [MSGID: 108008] [afr-self-heal-name.c:369:afr_selfheal_name_gfid_mismatch_check] 0-engine-replicate-0: GFID mismatch for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/f735902d-12fa-4e4d-88c9-1b8ba06e3063.1673 6e17b733-b8a4-4563-bc3d-f659c9a46c2a on engine-client-1 and 55648f43-7e09-4e62-b7d2-16fe1ff7b23e on engine-client-0
[2016-12-23 10:08:41.957422] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 0-engine-shard: Lookup on shard 1673 failed. Base file gfid = f735902d-12fa-4e4d-88c9-1b8ba06e3063 [Input/output error]
[2016-12-23 10:08:41.957444] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 11427307: READ => -1 gfid=f735902d-12fa-4e4d-88c9-1b8ba06e3063 fd=0x7faeda380328 (Input/output error)
[2016-12-23 10:45:10.609600] W [MSGID: 108008] [afr-self-heal-name.c:369:afr_selfheal_name_gfid_mismatch_check] 0-engine-replicate-0: GFID mismatch for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/f735902d-12fa-4e4d-88c9-1b8ba06e3063.1673 6e17b733-b8a4-4563-bc3d-f659c9a46c2a on engine-client-1 and 55648f43-7e09-4e62-b7d2-16fe1ff7b23e on engine-client-0
[2016-12-23 10:45:10.610550] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 0-engine-shard: Lookup on shard 1673 failed. Base file gfid = f735902d-12fa-4e4d-88c9-1b8ba06e3063 [Input/output error]
[2016-12-23 10:45:10.610574] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 11526955: READ => -1 gfid=f735902d-12fa-4e4d-88c9-1b8ba06e3063 fd=0x7faeda380184 (Input/output error)


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-9.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC with three nodes.
2. Create a arbiter volume and enable all the options using gdeploy.
3. Now bring down the first brick in the arbiter volume and create vm.
4. Once the vm creation is completed, bring back the brick and wait for self heal to happen.
5. Now migrate the vm to another host.

Actual results:
There are two issues which i have seen.
1) There are still some entries present in the node which are not healed even after a long time
2) And once the vm is migrated i see that vm goes to paused state.

Expected results:
Vm should not go to paused state after migration plus there should not be any entries present in volume heal info.

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-12-23 05:56:11 EST ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from RamaKasturi on 2016-12-23 05:59:44 EST ---

As suggested by pranith i disabled granluar entry self heal on the volume and i do not see the issue


--- Additional comment from Krutika Dhananjay on 2016-12-26 05:41:04 EST ---

Resuming from https://bugzilla.redhat.com/show_bug.cgi?id=1400057#c11 to explain why there would be a gfid mismatch. So please go through https://bugzilla.redhat.com/show_bug.cgi?id=1400057#c11 first.

... the pending xattrs on .shard are at this point erased. Now when the brick that was down comes back online, another MKNOD on this shard's name triggered by shard readv fop, whenever it happens, would cause the fop to give EEXIST from the bricks that were already online; and on the brick that was previously offline, the creation of this shard would succeed, although with a new gfid. This leads to the gfid mismatch.

--- Additional comment from Worker Ant on 2016-12-26 12:15:12 EST ---

REVIEW: http://review.gluster.org/16286 (cluster/afr: Fix missing name indices due to EEXIST error) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2016-12-27 01:34:58 EST ---

REVIEW: http://review.gluster.org/16286 (cluster/afr: Fix missing name indices due to EEXIST error) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 1 Worker Ant 2016-12-27 08:05:39 UTC

REVIEW: http://review.gluster.org/16293 (cluster/afr: Fix missing name indices due to EEXIST error) posted (#1) for review on release-3.8 by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2016-12-27 08:10:28 UTC

REVIEW: http://review.gluster.org/16294 (cluster/afr: Fix missing name indices due to EEXIST error) posted (#1) for review on release-3.9 by Krutika Dhananjay (kdhananj)

Comment 3 Worker Ant 2016-12-28 09:06:51 UTC

COMMIT: http://review.gluster.org/16294 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) 
------
commit 544f6ce9e7a249360166e98dd7df1b09f91717a9
Author: Krutika Dhananjay <kdhananj>
Date:   Mon Dec 26 21:08:03 2016 +0530

    cluster/afr: Fix missing name indices due to EEXIST error
    
            Backport of: http://review.gluster.org/16286
    
    PROBLEM:
    Consider a volume with  granular-entry-heal and sharding enabled. When
    a replica is down and a shard is created as part of a write, the name
    index is correctly created under indices/entry-changes/<dot-shard-gfid>.
    Now when a read on the same region triggers another MKNOD, the fop
    fails on the online bricks with EEXIST. By virtue of this being a
    symmetric error, the failed_subvols[] array is reset to all zeroes.
    Because of this, before post-op, the GF_XATTROP_ENTRY_OUT_KEY will be
    set, causing the name index, which was created in the previous MKNOD
    operation, to be wrongly deleted in THIS MKNOD operation.
    
    FIX:
    The ideal fix would have been for a transaction to delete the name
    index ONLY if it knows it is the one that created the index in the first
    place. This would involve gathering information as to whether THIS xattrop
    created the index from individual bricks, aggregating their responses and
    based on the various posisble combinations of responses, decide whether to
    delete the index or not. This is rather complex. Simpler fix would be
    for post-op to examine local->op_ret in the event of no failed_subvols
    to figure out whether to delete the name index or not. This can occasionally
    lead to creation of stale name indices but they won't be affecting the IO path
    or mess with pending changelogs in any way and self-heal in its crawl of
    "entry-changes" directory would take care to delete such indices.
    
    Change-Id: I8c5c08b7a208e840b5970fe5699dabdaf751a150
    BUG: 1408785
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/16294
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 4 Kaushal 2017-03-08 10:23:37 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report.

glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.