1. created a 2x2 volume (bricks b1 to b4) using a 2 node cluster, fuse mounted it on a client.
2. brought down one b1.
3. Started small file creation on the mount
4. performed add-brick to convert it to arbiter i.e. convert it from 2x2 to 2x(2+1) using a 3rd node for the newly added bricks. Let the bricks be b5 and b6.
5. `volume start force` to bring up b1.
6. I/O was still going on.
7. After I/O and self-heal completed, it was found that a few files were missing on the newly added brick b5 (but present in the other bricks of the replica i.e. b1 and b2). heal-info showed zero entries.
When add-brick was performed, the shd got the updated volfile first and it did a conservative merge (as expected) and reset the pending xattrs for entry-heal.
The fuse mount was still operating on the old graph (with replica 2) and hence the creates did not happen on b5, until the fuse mount also got the new graph after which the creates went to all bricks.
This is a gluster infra problem but is serious when replicate comes into the picture:
- If it were a plain distribute vol, the effect of fuse client doing I/O on the old graph is that the the files may get hashed based on the old layout.
- When replication is involved, this can lead to data loss:
In the above example the files were present in b1 and b2 and not b5. If for some reason, *later on*, an I/O happens which makes b5 as the source for entry heal, then it will delete the files from b1 and b2.
We need to document this as a known issue. ie. Doing an add-brick to increase the replica count should only be done offline, i.e. when no I/O is going on.
Edited the doc text slightly for the release notes.
Atin: Any inputs that you can provide w.r.t to this issue. It's pretty old, but a customer enquired about this as it's preventing them from increasing bricks without shutting down the production environment ...