Description of problem: ========================= in my brick mux setup In a situation where one of the cluster node's glusterd holds a lock , and we try to trigger heal, the heal fails with below message Launching heal operation to perform index self heal on volume cross3-23 has been unsuccessful on bricks that are down. Please check if all brick processes are running. However there are no bricks down. I understand the node doesnt know if bricks are down or not, but as it is not able to communicate with the other node, it must say a more generic error. I checked the glusterd log which says below [2017-05-20 07:04:34.371968] W [glusterd-locks.c:572:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd4000) [0x7f058c6b5000] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd3f2e) [0x7f058c6b4f2e] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd930f) [0x7f058c6ba30f] ) 0-management: Lock for cross3-23 held by c4f9ba86-a666-4c72-a3cf-0d1339b36820 [root@dhcp35-45 ~]# gluster v heal cross3-23 Launching heal operation to perform index self heal on volume cross3-23 has been unsuccessful on bricks that are down. Please check if all brick processes are running. [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# gluster v status cross3-23 Another transaction is in progress for cross3-23. Please try again after sometime. [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# gluster v status cross3-23 Status of volume: cross3-23 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick23/cross3-23 49152 0 Y 6094 Brick 10.70.35.130:/rhs/brick23/cross3-23 49152 0 Y 22705 Brick 10.70.35.122:/rhs/brick23/cross3-23 49152 0 Y 21893 Self-heal Daemon on localhost N/A N/A Y 7811 Self-heal Daemon on 10.70.35.122 N/A N/A Y 23028 Self-heal Daemon on 10.70.35.130 N/A N/A Y 23835 Self-heal Daemon on 10.70.35.23 N/A N/A Y 7709 Task Status of Volume cross3-23 ------------------------------------------------------------------------------ There are no active volume tasks Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.have a cluster setup with 3 nodes and create a 1x3 volume and make sure all bricks are up and no heal is pending 2.simulate a situation where say n3 glusterd holds a lock on vol status 3.now issue a vol status, it would say Another transaction is in progress for cross3-23. Please try again after sometime. 4. Now trigger a manual heal by issuing gluster v heal <vname> Actual results: it throws a wrong error saying bricks are down Expected results: It should throw a better generic error, instead of a misguiding statement
Moving to POST, patch is https://review.gluster.org/#/c/15724/
Update: ========= Build Used : glusterfs-fuse-3.12.2-7.el7rhgs.x86_64 1. create 1 * 3 replicate volume and start 2. bring 1 brick down 3. Issue heal # gluster vol status 13 Status of volume: 13 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.61:/bricks/brick2/b0 49154 0 Y 27632 Brick 10.70.35.174:/bricks/brick2/b1 N/A N/A N N/A Brick 10.70.35.17:/bricks/brick2/b1 49152 0 Y 17443 Self-heal Daemon on localhost N/A N/A Y 27654 Self-heal Daemon on dhcp35-136.lab.eng.blr. redhat.com N/A N/A Y 8012 Self-heal Daemon on dhcp35-17.lab.eng.blr.r edhat.com N/A N/A Y 17465 Self-heal Daemon on dhcp35-163.lab.eng.blr. redhat.com N/A N/A Y 18956 Self-heal Daemon on dhcp35-214.lab.eng.blr. redhat.com N/A N/A Y 17538 Self-heal Daemon on dhcp35-174.lab.eng.blr. redhat.com N/A N/A Y 11243 Task Status of Volume 13 ------------------------------------------------------------------------------ There are no active volume tasks # gluster vol heal 13 Launching heal operation to perform index self heal on volume 13 has been unsuccessful: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details. # Observation: The message "Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details." is not appropriate for the brick down scenario. It should be user friendly. In this case it should be something like Bricks are down > Also the patch mentioned in comment #4 looks different from the description of the problem. Could you please confirm
Update: ========== Build used : glusterfs-server-3.12.2-7.el7rhgs.x86_64 Scenario's Verified: 1. Self-heal-daemon Disabled # gluster vol heal 23 Launching heal operation to perform index self heal on volume 23 has been unsuccessful: Self-heal-daemon is disabled. Heal will not be triggered on volume 23 # 2. Self-heal-daemon Not running # gluster vol heal 23 Launching heal operation to perform index self heal on volume 23 has been unsuccessful: Self-heal daemon is not running. Check self-heal daemon log file. # 3. volume stop # gluster vol heal 23 Launching heal operation to perform index self heal on volume 23 has been unsuccessful: Volume 23 is not started. # 4. brick down # gluster vol heal 23 Launching heal operation to perform index self heal on volume 23 has been unsuccessful: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details. # 5. Locking # gluster vol heal 23 Launching heal operation to perform index self heal on volume 23 has been unsuccessful: Another transaction is in progress for 23. Please try again after sometime. #
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days