Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1191537

Summary:	With afrv2 + ext4, lookups on directories with large offsets could result in duplicate/missing entries
Product:	[Community] GlusterFS	Reporter:	Pranith Kumar K <pkarampu>
Component:	core	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	3.6.2	CC:	bugs, hchiramm, jbyers, jhoffman, pkarampu, rabhat, skoduri
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-v3.6.3	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1163161	Environment:
Last Closed:	2016-02-04 15:21:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1163161
Bug Blocks:	1184460

Description Pranith Kumar K 2015-02-11 13:23:28 UTC

+++ This bug was initially created as a clone of Bug #1163161 +++

Description of problem:

'ext4' uses large offsets which may include the bits used by GlusterFS to encode the brick-id. This could end up in few offsets being modified when given back to the filesystem resulting in missing files and other such discrepancies.

Avati has proposed a solution to overcome this issue based on the assumption that "both EXT4/XFS are tolerant in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. For more info, please check http://review.gluster.org/#/c/4711/.

But now with afr using the same itransform/deitransform logic, the brick-id stored in the afr_global_d_off gets zeroed out when re-encoded in dht. This happens only when the offsets are huge (i.e with ext4 filesystem) as in such cases, the low n bits are replaced with brick-id which in turn gets replaced with afr_subvol_id when re-encoded in dht, where
       n = log2(N)
       N = no. of DHT/AFR subvolumes.

--- Additional comment from Anand Avati on 2014-11-12 07:55:16 EST ---

REVIEW: http://review.gluster.org/8201 (dht/afr: Modify itransform/deitransform to prevent loss of brick-id incase of both dht & afr involved.) posted (#4) for review on master by soumya k (skoduri)

--- Additional comment from Anand Avati on 2014-12-23 13:23:55 EST ---

REVIEW: http://review.gluster.org/9332 (afr: stop encoding subvolume id in readdir d_off) posted (#1) for review on master by Anand Avati (avati)

--- Additional comment from Anand Avati on 2014-12-26 04:59:57 EST ---

REVIEW: http://review.gluster.org/9332 (afr: stop encoding subvolume id in readdir d_off) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Anand Avati on 2014-12-26 09:21:49 EST ---

COMMIT: http://review.gluster.org/9332 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 7926fe6f7df664bbe5e050a8e66240dd67155eec
Author: Anand Avati <avati>
Date:   Tue Dec 23 10:04:00 2014 -0800

    afr: stop encoding subvolume id in readdir d_off
    
    The purpose of encoding d_off in AFR is to indicate the
    selected subvolume for the first readdir, and continue all
    further readdirs of the session on the same subvolume. This is
    required because, unlike files, dir d_offs are specific to the
    backend and cannot be re-used on another subvolume. The d_off
    transformation encodes the subvolume id and prevents such
    invalid use of d_offs on other servers.
    
    However, this approach could be quite wasteful of precious d_off
    bit-space. Unlike DHT, where server id can change from entry to
    entry and thus encoding the server id in the transformed d_off
    is necessary, we could take a slightly relaxed approach in AFR.
    The approach is to save the subvolume where the last readdir
    request was sent in the fd_ctx. This consumes constant space (i.e
    no per-entry cache), and serves the purpose of avoiding d_off
    "misuse" (i.e using d_off from one server on another).
    
    The compromise here is NFS resuming readdir from a non-0 cookie
    after an extended delay (either anonymous FD has been reclaimed,
    or server has restarted). In such cases a subvolume is picked
    freshly. To make this fresh picking more deterministic (i.e, to
    pick the same subvolume whenever possible, even after reboots),
    the function afr_hash_child (used by afr_read_subvol_select_by_policy)
    is modified to skip all dynamic inputs (i.e PID) for the case
    of directories.
    
    Change-Id: I46ad95feaeb21fb811b7e8d772866a646330c9d8
    BUG: 1163161
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/9332
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Pranith Kumar Karampuri <pkarampu>

--- Additional comment from Anand Avati on 2015-02-11 07:58:03 EST ---

REVIEW: http://review.gluster.org/9638 (afr: stop encoding subvolume id in readdir d_off) posted (#1) for review on release-3.6 by Pranith Kumar Karampuri (pkarampu)

Comment 1 Anand Avati 2015-02-11 13:24:07 UTC

REVIEW: http://review.gluster.org/9638 (afr: stop encoding subvolume id in readdir d_off) posted (#2) for review on release-3.6 by Pranith Kumar Karampuri (pkarampu)

Comment 2 Anand Avati 2015-03-04 07:38:47 UTC

COMMIT: http://review.gluster.org/9638 committed in release-3.6 by Raghavendra Bhat (raghavendra) 
------
commit f396e475417aa52daf49e4564c67628cc8f0e598
Author: Anand Avati <avati>
Date:   Tue Dec 23 10:04:00 2014 -0800

    afr: stop encoding subvolume id in readdir d_off
    
            Backport of http://review.gluster.org/9332
    
    The purpose of encoding d_off in AFR is to indicate the
    selected subvolume for the first readdir, and continue all
    further readdirs of the session on the same subvolume. This is
    required because, unlike files, dir d_offs are specific to the
    backend and cannot be re-used on another subvolume. The d_off
    transformation encodes the subvolume id and prevents such
    invalid use of d_offs on other servers.
    
    However, this approach could be quite wasteful of precious d_off
    bit-space. Unlike DHT, where server id can change from entry to
    entry and thus encoding the server id in the transformed d_off
    is necessary, we could take a slightly relaxed approach in AFR.
    The approach is to save the subvolume where the last readdir
    request was sent in the fd_ctx. This consumes constant space (i.e
    no per-entry cache), and serves the purpose of avoiding d_off
    "misuse" (i.e using d_off from one server on another).
    
    The compromise here is NFS resuming readdir from a non-0 cookie
    after an extended delay (either anonymous FD has been reclaimed,
    or server has restarted). In such cases a subvolume is picked
    freshly. To make this fresh picking more deterministic (i.e, to
    pick the same subvolume whenever possible, even after reboots),
    the function afr_hash_child (used by afr_read_subvol_select_by_policy)
    is modified to skip all dynamic inputs (i.e PID) for the case
    of directories.
    
    BUG: 1191537
    Change-Id: I7e3bd8dfe346a9a8e428d7ddeada6cfb66e64e54
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/9638
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra Bhat <raghavendra>

Comment 3 Kaushal 2016-02-04 15:21:33 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v3.6.3, please open a new bug report.

glusterfs-v3.6.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2015-April/021669.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user