Bug 1638453

Summary:	Gfid mismatch seen on shards when lookup and mknod are in progress at the same time
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	posix	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	bugs
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-6.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1641429 (view as bug list)		Environment:
Last Closed:	2019-03-25 16:31:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1641429

Description Krutika Dhananjay 2018-10-11 15:15:46 UTC

Description of problem:

Occasionally, dd on a sharded file in tests/bugs/shard/bug-1251824.t fails with EIO.
Turns out this is caused by gfid-mismatch between the replicas.

On investigation, it was found that this is due to a race between posix mknod and posix lookup.

posix mknod has 3 important stages, among other operations:
1. creation of the file itself
2. setting the gfid xattr on the file, and
3. creating the gfid link under .glusterfs.

Now assume the thread doing posix mknod has executed steps 1 and 2 and is on its way to executing 3.
And a parallel lookup from another thread sees that loc->inode->gfid is NULL, so it tries to perform gfid_heal and also attempts to create the gfid link under .glusterfs.

Assume lookup wins the race and creates the gfid link. posix_gfid_set() through mknod fails with EEXIST.

In the older code, mknod under such conditions was NOT being treated as a failure.

But ever since the following commit was merged:

<commit-msg>

Parent: 788cda4c (glusterd: fix some coverity issues)
Author: karthik-us <ksubrahm>
AuthorDate: 2018-08-03 15:55:18 +0530
Commit: Amar Tumballi <amarts>
CommitDate: 2018-08-20 12:14:22 +0000

posix: Delete the entry if gfid link creation fails

Problem:
If the gfid link file inside .glusterfs is not present for a file,
the operations which are dependent on the gfid will fail,
complaining the link file does not exists inside .glusterfs.

Fix:

If the link file creation fails, fail the entry creation operation
and delete the original file.

Change-Id: Id767511de2da46b1f45aea45cb68b98d965ac96d
fixes: bz#1612037
Signed-off-by: karthik-us <ksubrahm>

</commit-msg>

... this behavior changes and the mknod is treated as failure and the subsequent entry deleted.
When sometime in future, shard sends another mknod on the shard, the file is created, although this time with a new gfid (since "gfid-req" that is passed now is a new UUID. This leads to a gfid-mismatch across the replicas.

Version-Release number of selected component (if applicable):

How reproducible:
Fairly consistently. Just run the test tests/bugs/shard/bug-1251824.t in a loop on your laptop. I was able to hit it in less than 5 mins time.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 Worker Ant 2018-10-17 09:34:38 UTC

REVIEW: https://review.gluster.org/21436 (storage/posix: Do not fail entry creation fops if gfid handle already exists) posted (#1) for review on master by Krutika Dhananjay

Comment 2 Worker Ant 2018-10-18 16:04:58 UTC

COMMIT: https://review.gluster.org/21436 committed in master by "Raghavendra Bhat" <raghavendra> with a commit message- storage/posix: Do not fail entry creation fops if gfid handle already exists

PROBLEM:
tests/bugs/shard/bug-1251824.t fails occasionally with EIO due to gfid
mismatch across replicas on the same shard when dd is executed.

CAUSE:
Turns out this is due to a race between posix_mknod() and posix_lookup().

posix mknod does 3 operations, among other things:
1. creation of the entry itself under its parent directory
2. setting the gfid xattr on the file, and
3. creating the gfid link under .glusterfs.

Consider a case where the thread doing posix_mknod() (initiated by shard)
has executed steps 1 and 2 and is on its way to executing 3. And a
parallel LOOKUP from another thread on noting that loc->inode->gfid is NULL,
tries to perform gfid_heal where it attempts to create the gfid link
under .glusterfs and succeeds. As a result, posix_gfid_set() through
MKNOD (step 3) fails with EEXIST.

In the older code, MKNOD under such conditions was NOT being treated
as a failure. But commit e37ee6d changes this behavior by failing MKNOD,
causing the entry creation to be undone in posix_mknod() (it's another
matter that the stale gfid handle gets left behind if lookup has gone
ahead and gfid-healed it).
All of this happens on only one replica while on the other MKNOD succeeds.

Now if a parallel write causes shard translator to send another MKNOD
of the same shard (shortly after AFR releases entrylk from the first
MKNOD), the file is created on the other replica too, although with a
new gfid (since "gfid-req" that is passed now is a new UUID. This leads
to a gfid-mismatch across the replicas.

FIX:
The solution is to not fail MKNOD (or any other entry fop for that matter
that does posix_gfid_set()) if the .glusterfs link creation fails with EEXIST.

Change-Id: I84a5e54d214b6c47ed85671a880bb1c767a29f4d
fixes: bz#1638453
Signed-off-by: Krutika Dhananjay <kdhananj>

Comment 3 Shyamsundar 2019-03-25 16:31:17 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/