Description of problem: ============================ On a 1 * 3 ( replicate ) and distribute volume ( 3 bricks ), replace-brick operation on a distribute volume is killing glustershd daemon process in a cluster Version-Release number of selected component (if applicable): ============================ glusterfs-3.8.4-49 Steps to Reproduce: ============================ 1) Create 1 * 3 replicate volume and start the volume 2) Create Distribute volume with 3 bricks and start the volume. 3) Do replace brick on the Distribute volume Actual results: ============================ glustershd Self-heal Daemon is killed on all the nodes in the cluster Expected results: ============================ Self-heal Daemon should not killed. Replacing brick in Distribute volume shouldn't effect the self-heal daemon process. Additional info: ============================ > before replacing brick in distributed volume [root@dhcp35-153 ~]# ps -eaf | grep -i glustershd root 16246 1 0 07:42 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/a0c711ad3004f4931540f945f85fcfb8.socket --xlator-option *replicate*.node-uuid=71259e62-428b-4d85-a187-e5c5660369f9 root 16288 14096 0 07:42 pts/1 00:00:00 grep --color=auto -i glustershd [root@dhcp35-153 ~]# > replace-brick [root@dhcp35-153 ~]# gluster vol replace-brick voldist 10.70.35.153:/bricks/brick4/b4 10.70.35.153:/bricks/brick4/b4_1 commit force volume replace-brick: success: replace-brick commit force operation successful [root@dhcp35-153 ~]# > check the self-heal daemon [root@dhcp35-153 ~]# ps -eaf | grep -i glustershd root 16865 14096 0 08:10 pts/1 00:00:00 grep --color=auto -i glustershd [root@dhcp35-153 ~]# [root@dhcp35-153 ~]# gluster v status Status of volume: vol13 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.202:/bricks/brick5/b5 49152 0 Y 4628 Brick 10.70.35.194:/bricks/brick5/b5 49152 0 Y 28959 Brick 10.70.35.146:/bricks/brick5/b5 49152 0 Y 28180 Self-heal Daemon on localhost N/A N/A N N/A Self-heal Daemon on dhcp35-202.lab.eng.blr. redhat.com N/A N/A N N/A Self-heal Daemon on dhcp35-146.lab.eng.blr. redhat.com N/A N/A N N/A Self-heal Daemon on dhcp35-194.lab.eng.blr. redhat.com N/A N/A N N/A Task Status of Volume vol13 ------------------------------------------------------------------------------ There are no active volume tasks Volume vol43 is not started Status of volume: voldist Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.153:/bricks/brick4/b4_1 49153 0 Y 16409 Brick 10.70.35.202:/bricks/brick4/b4 49156 0 Y 28937 Brick 10.70.35.194:/bricks/brick4/b4 49156 0 Y 29111 Task Status of Volume voldist ------------------------------------------------------------------------------ There are no active volume tasks
This is a negative test case as replace brick is supposed to be done only on bricks of replica subvol. replacing brick for plain distribute can lead to data loss. That said, what I think is happening is that as a part of replace brick, glusterd kills the selfheal daemon in the assumption that after replace brick is successful, it will restart shd with the new graph (containing the new brick path) but probably is not doing it because it is a distribute volume. Need to check in the code though.
In RHGS-3.4 replace brick will not be allowed for dist only bolumes. I’ve a patch in 3.12 branch now for the same. We can actually target this bug for 3.4.0 then?
s/bolumes/volumes
https://review.gluster.org/18334 is the patch
(In reply to Atin Mukherjee from comment #3) > In RHGS-3.4 replace brick will not be allowed for dist only bolumes. I’ve a > patch in 3.12 branch now for the same. We can actually target this bug for > 3.4.0 then? Makes sense. Feel free to assign the bug to yourself and mark the bug for 3.4.0 in the internal whiteboard.
Update: ========== verified the below scenario. 1. Created replicate and distribute volume 2. tried to do replace brick. ( expected as per patch mentioned in comment 5 ) # gluster vol replace-brick dist 10.70.35.61:/bricks/brick1/b1 10.70.35.61:/bricks/brick1/b1_1 commit force volume replace-brick: failed: replace-brick is not permitted on distribute only volumes. Please use add-brick and remove-brick operations instead. 3. check the glustershd pid # ps -eaf | grep -i glustershd root 25630 1 0 05:29 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/ed8ce427ce1f40d0b8a8c3c5b162e9b7.socket --xlator-option *replicate*.node-uuid=be801d54-d39e-40cb-967c-0987cfd4f5f7 root 25871 20220 0 05:32 pts/0 00:00:00 grep --color=auto -i glustershd 4. removed the brick from distribute volume ( commit after rebalance is completed) # gluster vol remove-brick dist 10.70.35.61:/bricks/brick1/b1 start volume remove-brick start: success ID: 4ede625b-1643-4100-b89b-27d322e63856 5. Add brick to distribute volume # gluster vol add-brick dist 10.70.35.61:/bricks/brick1/b1_new volume add-brick: success 6. check the glustershd pid # ps -eaf | grep -i glustershd | grep -v grep root 25630 1 0 05:29 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/ed8ce427ce1f40d0b8a8c3c5b162e9b7.socket --xlator-option *replicate*.node-uuid=be801d54-d39e-40cb-967c-0987cfd4f5f7 Changing status to Verfied.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607