Bug 1651433 - Two data files in a small window during file migration and resultant corruption
Summary: Two data files in a small window during file migration and resultant corruption
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard: dht-rebalance-file
Depends On: 1330389
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-20 05:01 UTC by Susant Kumar Palai
Modified: 2020-03-12 12:54 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1330389
Environment:
Last Closed: 2020-03-12 12:54:21 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Susant Kumar Palai 2018-11-20 05:02:29 UTC
Description of problem:

At the end of migration, following are the sequence of steps related to conversion of src and dst file modes:

1. remove sticky and sgid bits from dst (this essentially converts dst to datafile).
2. set phase2 flags on src (this converts src to linkto file)
3. remove linkto xattr on dst (this essentially is a NOP as step 1 converts dst to datafile).
4. unlink the src.

In the above scheme of things, in the time-window after 1 but before 2, both src and dst are data files. This can cause some subtle data corruptions in read path (though the window is very narrow).
1. Assume there is an fd opened by an application app1 - fd1 - while the file was in migration (before step 1 in migration algo).
2. Assume another fd opened by another application app2 - fd2 - after 1, but before 2. This fd would be opened on dst.
3. some writes are done on fd2. These writes end up on dst. The application app2 does an fsync and on successful completion, notifies app1.
4. app1 does a read to the region written by app2. But if this read is done before sticky bit on src is removed (step 2 in migration algo), the read is not attempted on dst as dht_readv thinks file is still under phase 1 of migration. This results in app1 not seeing data written by app2 (a stale read).

The above scenario is a very minute race corresponding to two co-ordinated applications. But the steps expectations are POSIX complaint. Also, note that if rebalance process crashes in this window, the windown becomes bigger (till next invocation of rebalance). I think the root of the problem is that dht_readv expects file to be in phase2 of migration before attempting read on dst, but dst is converted to data file before we set phase2 flags on src. The problem can be solved if we set phase2 flags on src before converting dst to data file. However, this has a side-effect of having no data file in that window as linkto xattrs are already set on src. This problem can be overcome if we follow following steps:

1. remove linkto xattr from src (to ensure just setting phase2 flags on src doesn't covert it into a linkto file).
2. set phase2 flags on src (src will remain a dst file even after this).
3. remove sticky and sgid bits from dst (this essentially converts dst to datafile).
4. set the linkto xattr on src to dst.
5. remove linkto xattr on dst (this essentially is a NOP as step 1 converts dst to datafile).
6. unlink the src.

However, this has a problem that if rebalance process crashes after step 2 but before step 3. All the reads will be from dst though dst is not a data-file yet. But again, dst will be in sync with src even if it is not a data-file (as all writes still go to dst). In fact this situation can be identified if reads redirected to dst return with iatts stating that dst is still not a data file. Couple this with an identification of rebalance process crash (like maintaining some state - analogous to meta-lock - which gets cleaned up after a disconnect from rebalance process), we can prevent reads to dst in this specific scenario.

Kind of a complicated scenario (less probable too?), but something interesting .

Version-Release number of selected component (if applicable):


How reproducible:
No redproducer, found through code review

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Worker Ant 2020-03-12 12:54:21 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/963, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.