Description of problem:
heal info of a disperse(ec) volume shows new files getting written as pending in heal info command
When we do a want to update a 3 node EC cluster, we need to do following steps:
1)identify a node and make sure only at max only redundant number of bricks are brought down
2)kill glusterfs,glusterfsd and glusterd
3)update the node
4)bring back glusterd
After this, we now need to update the next node, but before that we need to make sure the healing is complete on the node which got updated.
For this we use heal info command.
But, if IO is going on while doing this, the heal info command unlike an afr volume will never show as heal info entries to be zero, because the latest files which are getting written are shown as heal pending entries.
By this virtue, an admin can never be sure when to update the next node as he/she believes there is still heal pending. However these entries are spurious.
Steps and other details are avaialble in 1347251 - IO error seen with Rolling or non-disruptive upgrade of an distribute-disperse(EC) volume from 3.1.2 to 3.1.3
NOte: this can be hit(not tested) even when we bring down bricks in an ec volume and bring them back up and keep waiting for heal to complete while IO is going on.
I was not able to hit the issue when I did a rolling upgrade of a 2 node 4+2 disperse volume from rhgs-3.1.2 to 3.1.3 with IO (dd'ing to a file) happening from a 3.1.2 client. The heal info was showing entries as long as IO was happening (shd was waiting for locks). Once the IO stopped, healing resumed and came to zero entries.
Nag, do you have a more consistent reproducer for the issue?
Also, please provide the getfattr outputs of the files from all bricks and the logs. If the heal-info entries are spurious, then the trusted.ec* attributes of the file must be same on all bricks (indicating no heal is pending).
patch has been posted and merged on upstream -
3.9 upstream patch : http://review.gluster.org/15627
I am not seeing this issue of spurious entries anymore on 3.8.4-13
Note, the file that is being written currently can be seen in the heal info due to timing issue, which is acceptable
Hence moving to verified
(In reply to nchilaka from comment #9)
> QA verification:
> I am not seeing this issue of spurious entries anymore on 3.8.4-13
> Note, the file that is being written currently can be seen in the heal info
> due to timing issue, which is acceptable
Note, I am not seeing this issue of file being written showing up in heal pending (the whole purpose of the bz), however we can see if there is a network partition case, which is expected
> Hence moving to verified
So fix is working
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.