Description of problem: On a replica 3 volume if replace brick is run on all three bricks (before self-heal is completed) we might end up in a data loss. This is currently documented in admin guide as a precaution that admin should take. However, it would be good if Gluster stopped admin from doing such a mistake, :
Pranith, is this something we want to fix in gluster rather than Heketi? To prevent subsequent replace-bricks while one is already in progress, I think glusterd must be made aware of what an admin would normally run heal-info and verify, i.e. ensure that pending heal count is zero. If that is the way to go, then Ashiq's patch https://review.gluster.org/#/c/12154/ to get healinfo summary would be helpful. glusterd then would need to run it, parse the xml for no. of entries and then allow/deny every replace brick operation.
(In reply to Ravishankar N from comment #2) > Pranith, is this something we want to fix in gluster rather than Heketi? To > prevent subsequent replace-bricks while one is already in progress, I think > glusterd must be made aware of what an admin would normally run heal-info > and verify, i.e. ensure that pending heal count is zero. If that is the way > to go, then Ashiq's patch https://review.gluster.org/#/c/12154/ to get > healinfo summary would be helpful. glusterd then would need to run it, parse > the xml for no. of entries and then allow/deny every replace brick operation. That will lead to handling this part in all the management daemons. One of the recurring theme I have been hearing all this week is that the CLIs shouldn't do anything dangerous. So it would be better to increase intelligence of replace-brick CLI IMO.
(In reply to Pranith Kumar K from comment #3) > That will lead to handling this part in all the management daemons. One of > the recurring theme I have been hearing all this week is that the CLIs > shouldn't do anything dangerous. So it would be better to increase > intelligence of replace-brick CLI IMO. `replace-brick` already works only with commit force, so that itself is a good warning I think. Also, if we disallow replace-brick if pending heals are not zero, a user cannot replace a faulty brick because the faulty brick itself might be the reason why heal-info is non zero in the first place (i.e. it can be the only 'sink' brick).
(In reply to Ravishankar N from comment #4) > (In reply to Pranith Kumar K from comment #3) > > That will lead to handling this part in all the management daemons. One of > > the recurring theme I have been hearing all this week is that the CLIs > > shouldn't do anything dangerous. So it would be better to increase > > intelligence of replace-brick CLI IMO. > > `replace-brick` already works only with commit force, so that itself is a > good warning I think. That is interesting, in that case we need to make the command itself user friendly and without warning. > Also, if we disallow replace-brick if pending heals > are not zero, a user cannot replace a faulty brick because the faulty brick > itself might be the reason why heal-info is non zero in the first place > (i.e. it can be the only 'sink' brick). Yes that is correct, question is how do you propose to increase intelligence of replace-brick CLI so that it is safer to replace-brick?
(In reply to Pranith Kumar K from comment #5) > Yes that is correct, question is how do you propose to increase intelligence > of replace-brick CLI so that it is safer to replace-brick? For each of the entries needing heal, deciding which bricks are the sources/sinks can only be done by examining xattrs of all bricks. Doing this in glusterd and responding back to the CLI within the CLI time out can be an issue like you said. What I can think of is this: If a brick is definitely not a source for any of the files, then its .glusterfs/indices/xattrop will only have the xattrop-base-file and not any more entries (entries undergoing I/O will still be under indices/dirty). So if a replace-brick command is given for a brick, the glusterd on that node can check if link count of indices/xattrop/base-file is 1 and if yes, then permit the operation.
(In reply to Ravishankar N from comment #6) > (In reply to Pranith Kumar K from comment #5) > > Yes that is correct, question is how do you propose to increase intelligence > > of replace-brick CLI so that it is safer to replace-brick? > > For each of the entries needing heal, deciding which bricks are the > sources/sinks can only be done by examining xattrs of all bricks. Doing this > in glusterd and responding back to the CLI within the CLI time out can be an > issue like you said. What I can think of is this: If a brick is definitely > not a source for any of the files, then its .glusterfs/indices/xattrop will > only have the xattrop-base-file and not any more entries (entries undergoing > I/O will still be under indices/dirty). So if a replace-brick command is > given for a brick, the glusterd on that node can check if link count of > indices/xattrop/base-file is 1 and if yes, then permit the operation. That is too much afr specific knowledge for Glusterd to know IMO. Can we do it in a generic fashion?
Closing the BZ as this is not a high priority AFR bug at the moment. The documented procedure of waiting for heals to become zero before doing a 2nd replace brick works as noted in the bug description. Please re-open if you feel it needs to be re-prioritized.