Description of problem:
While server side heal is disabled, client side heal is not removing dirty flag for some files.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a 4+2 volume and mount
2. Touch 100 files
3. Kill one brick
4. write some data all files using dd
5. bricng the bricks UP
6. Append some data on all the bricks using dd, this will trigger heal on all the files
7. Read data from all the files using dd command.
At the end all files should be healed. However, I have observed that 2-3 files are still showing up in heal info.
When I looked for the getxattr, all the xattrs were same and dirty flag was still present for data fop.
REVIEW: https://review.gluster.org/21744 (cluster/ec: Don't enqueue an entry if it is already healing) posted (#2) for review on master by Ashish Pandey
There are two things we need to fix.
First, If there are number of entries to be healed and client touches all of them while SHD if off, it will trigger heal for all the files but not all the files will be placed in queue. This is because, for a single file there are number of fops coming and all are placing the same file in queue for healing which fills the queue quickly and new files will not get chance.
Second, When a file is started healing, sometimes we see that the dirty flag is not removed while version and size, have been healed and are same.
Need to find out why is this happening.
If the shd is OFF and if a client access this file it will not find any discrepancies as version and size on all the bricks are same, hence heal will not be triggered for this file and the dirty will remain as it is.
While debugging the failure of this patch and thinking of incorporating comments given by Pranith and Xavi,
I found that there is some design constraints to implement the idea of not enqueue an entry if it is already healing.
Consider 2+1 config and following scenario -
1 - Create volume and disable self heal daemon.
2 - Created a file wrote some data while all the bricks are UP.
3 - Kill one brick and write some data on the same file.
4 - Bring the brick UP.
5 - Now to trigger heal we will do "chmod 0666 file". This will do stst on file which will find the brick is not healthy and
trigger the heal.
6 - Now a synctask for the heal will be created and started which will call ec_heal_do, which in turn calls ec_heal_metadata and ec_heal_data.
7 - A fop setattr will also be called on the file to set permission.
Now, a sequence of steps could be like this-
a > Stat- which saw unhealthy file and triggered heal
b >ec_heal_metadata - took lock and healed metadata and healed metadata part of trusted.ec.version, release the lock on file. [At this point setattr is waiting for lock]
c > setattr takes the lock and found that the brick is still unhealthy as data version is not healed and miss matching. Mark the dirty for metadata version, unlock the file.
d > ec_heal_data takes the locks and heals the data.
Now, if we restrict only one fop to trigger heal, after step d, the file will contain dirty flag and mismatched metadata versions.
If we keep all the heal request from every fop in a queue and after every heal we check if the heal is needed or not then we will end up triggering heal for all the fop, defeats the purpose of the patch.
Please provide your comments. Am I correct in my understanding?
I have found a bug which was the actual cause of dirty flag remain set even after heal happened and all the version and size are matching.
This bug can only be visible when we have shd disabled as shd will clear the dirty flag if it has nothing to heal.
1 - Let's say we have disabled shd
2 - Create a file and then kill a brick
3 - Write data, around 1GB, on file which will be healed after bricks comes UP
4 - Bring the brick UP
5 - Do "chmod 0777 file" this will trigger heal.
6 - Immediately, start write on this file append 512 bytes using dd.
Now, while data healing was happening, write from mount came (step 6) and took the lock. It saw that healing flag is set on version xattr of file so it will send write on all the bricks. Before releasing lock it will also update version and size on ALL the bricks including brick which is healing.
However, in ec_update_info, we consider lock->good_mask to decide if we should unset the "dirty" flag or not which was set by this write fop, even if it has succeeded on all the bricks.
So, +1 of dirty flag will remain as it is by this write fop.
Now, data heal again got the lock and after completing data heal, we unset the dirty flag by decreasing it by the _same_ number which we found at the start of healing. This will not have the incremented value made by the write fop in step 6.
So, after healing a dirty flag will remain set on the file. This flag will never be unset if shd is not enabled.
REVIEW: https://review.gluster.org/21744 (cluster/ec: Don't enqueue an entry if it is already healing) merged (#11) on master by Xavi Hernandez
REVIEW: https://review.gluster.org/22907 (cluster/ec: Prevent double pre-op xattrops) posted (#1) for review on master by Pranith Kumar Karampuri
REVIEW: https://review.gluster.org/22907 (cluster/ec: Prevent double pre-op xattrops) merged (#4) on master by Pranith Kumar Karampuri