Description of problem:
At the end of migration, following are the sequence of steps related to conversion of src and dst file modes:
1. remove sticky and sgid bits from dst (this essentially converts dst to datafile).
2. set phase2 flags on src (this converts src to linkto file)
3. remove linkto xattr on dst (this essentially is a NOP as step 1 converts dst to datafile).
4. unlink the src.
In the above scheme of things, in the time-window after 1 but before 2, both src and dst are data files. This can cause some subtle data corruptions in read path (though the window is very narrow).
1. Assume there is an fd opened by an application app1 - fd1 - while the file was in migration (before step 1 in migration algo).
2. Assume another fd opened by another application app2 - fd2 - after 1, but before 2. This fd would be opened on dst.
3. some writes are done on fd2. These writes end up on dst. The application app2 does an fsync and on successful completion, notifies app1.
4. app1 does a read to the region written by app2. But if this read is done before sticky bit on src is removed (step 2 in migration algo), the read is not attempted on dst as dht_readv thinks file is still under phase 1 of migration. This results in app1 not seeing data written by app2 (a stale read).
The above scenario is a very minute race corresponding to two co-ordinated applications. But the steps expectations are POSIX complaint. Also, note that if rebalance process crashes in this window, the windown becomes bigger (till next invocation of rebalance). I think the root of the problem is that dht_readv expects file to be in phase2 of migration before attempting read on dst, but dst is converted to data file before we set phase2 flags on src. The problem can be solved if we set phase2 flags on src before converting dst to data file. However, this has a side-effect of having no data file in that window as linkto xattrs are already set on src. This problem can be overcome if we follow following steps:
1. remove linkto xattr from src (to ensure just setting phase2 flags on src doesn't covert it into a linkto file).
2. set phase2 flags on src (src will remain a dst file even after this).
3. remove sticky and sgid bits from dst (this essentially converts dst to datafile).
4. set the linkto xattr on src to dst.
5. remove linkto xattr on dst (this essentially is a NOP as step 1 converts dst to datafile).
6. unlink the src.
However, this has a problem that if rebalance process crashes after step 2 but before step 3. All the reads will be from dst though dst is not a data-file yet. But again, dst will be in sync with src even if it is not a data-file (as all writes still go to dst). In fact this situation can be identified if reads redirected to dst return with iatts stating that dst is still not a data file. Couple this with an identification of rebalance process crash (like maintaining some state - analogous to meta-lock - which gets cleaned up after a disconnect from rebalance process), we can prevent reads to dst in this specific scenario.
Kind of a complicated scenario (less probable too?), but something interesting .
Version-Release number of selected component (if applicable):
No redproducer, found through code review
Steps to Reproduce:
Can this be tracked in upstream instead?