When I issue heal statistics on an ec-volume, it displays wrong message saying unable to fetch from bricks which are offline. However all the bricks are online. [root@dhcp35-179 glusterfs]# gluster v heal dist-ec statistics Gathering crawl statistics on volume dist-ec has been unsuccessful on bricks that are down. Please check if all brick processes are running. [root@dhcp35-179 glusterfs]# gluster v status Status of volume: dist-ec Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.179:/rhs/brick1/dist-ec 49152 0 Y 19313 Brick 10.70.35.180:/rhs/brick1/dist-ec 49152 0 Y 7020 Brick 10.70.35.86:/rhs/brick1/dist-ec 49152 0 Y 31806 Brick 10.70.35.9:/rhs/brick1/dist-ec 49152 0 Y 14347 Brick 10.70.35.153:/rhs/brick1/dist-ec 49152 0 Y 12841 Brick 10.70.35.79:/rhs/brick1/dist-ec 49152 0 Y 3774 Brick 10.70.35.179:/rhs/brick2/dist-ec 49153 0 Y 19332 Brick 10.70.35.180:/rhs/brick2/dist-ec 49153 0 Y 5787 Brick 10.70.35.86:/rhs/brick2/dist-ec 49153 0 Y 3770 Brick 10.70.35.9:/rhs/brick2/dist-ec 49153 0 Y 14366 Brick 10.70.35.153:/rhs/brick2/dist-ec 49153 0 Y 12860 Brick 10.70.35.79:/rhs/brick2/dist-ec 49153 0 Y 3793 Snapshot Daemon on localhost 49154 0 Y 21888 Self-heal Daemon on localhost N/A N/A Y 22044 Snapshot Daemon on 10.70.35.180 49154 0 Y 15831 Self-heal Daemon on 10.70.35.180 N/A N/A Y 15934 Snapshot Daemon on 10.70.35.86 49154 0 Y 14409 Self-heal Daemon on 10.70.35.86 N/A N/A Y 14511 Snapshot Daemon on dhcp35-79.lab.eng.blr.re dhat.com 49154 0 Y 12729 Self-heal Daemon on dhcp35-79.lab.eng.blr.r edhat.com N/A N/A Y 12833 Snapshot Daemon on 10.70.35.9 49154 0 Y 14254 Self-heal Daemon on 10.70.35.9 N/A N/A Y 14386 Snapshot Daemon on 10.70.35.153 49154 0 Y 12738 Self-heal Daemon on 10.70.35.153 N/A N/A Y 12881 Task Status of Volume dist-ec ------------------------------------------------------------------------------ Task : Rebalance ID : 1b482140-fd66-4d2e-9bae-903c74bcd875 Status : completed [root@dhcp35-179 glusterfs]# rpm -qa|grep gluste glusterfs-libs-3.8.4-3.el7rhgs.x86_64 glusterfs-cli-3.8.4-3.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-3.el7rhgs.x86_64 glusterfs-server-3.8.4-3.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 glusterfs-fuse-3.8.4-3.el7rhgs.x86_64 glusterfs-api-3.8.4-3.el7rhgs.x86_64 glusterfs-3.8.4-3.el7rhgs.x86_64 [root@dhcp35-179 glusterfs]# r Volume Name: dist-ec Type: Distributed-Disperse Volume ID: 3bcd582c-f0cd-446c-afce-bfa3b0b8e316 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: 10.70.35.179:/rhs/brick1/dist-ec Brick2: 10.70.35.180:/rhs/brick1/dist-ec Brick3: 10.70.35.86:/rhs/brick1/dist-ec Brick4: 10.70.35.9:/rhs/brick1/dist-ec Brick5: 10.70.35.153:/rhs/brick1/dist-ec Brick6: 10.70.35.79:/rhs/brick1/dist-ec Brick7: 10.70.35.179:/rhs/brick2/dist-ec Brick8: 10.70.35.180:/rhs/brick2/dist-ec Brick9: 10.70.35.86:/rhs/brick2/dist-ec Brick10: 10.70.35.9:/rhs/brick2/dist-ec Brick11: 10.70.35.153:/rhs/brick2/dist-ec Brick12: 10.70.35.79:/rhs/brick2/dist-ec Options Reconfigured: features.uss: enable disperse.shd-max-threads: 3 disperse.heal-wait-qlength: 3 cluster.shd-max-threads: 4 transport.address-family: inet performance.readdir-ahead: on nfs.disable: on [root@dhcp35-179 glusterfs]#
The same behaviour can be observed in 3.7.x occasionally. 3 node cluster, replica 3, type replicate.
Just to give a small update. I've encountered similar issue in 2 node cluster during upgrade. Unfortunately, I didn't save output of the commands. One node was running 3.11.3-1 and the other one was still running 3.7.20-1. Despite all bricks were online and everything was 5-by-5, I couldn't issue % gluster volume heal;. The reason given was as above ``[...] has been unsuccessful on bricks that are down. Please check if all brick processes are running.''. As stated previously, I could see all bricks and PIDs from remote node and everything as you should. Once the whole cluster has been updated to 3.11.3, heal became available. There is one thing I don't understand about this bug, though. Original reporter reported issue against 3.8.x(based on provided output), yet version set in bug is 3.2. That's why I've originally posted this issue can be seen in 3.7.x as well. Thanks.
(In reply to Zdenek Styblik from comment #7) > Just to give a small update. I've encountered similar issue in 2 node > cluster during upgrade. Unfortunately, I didn't save output of the commands. > > One node was running 3.11.3-1 and the other one was still running 3.7.20-1. > Despite all bricks were online and everything was 5-by-5, I couldn't issue % > gluster volume heal;. The reason given was as above ``[...] has been > unsuccessful on bricks that are down. Please check if all brick processes > are running.''. As stated previously, I could see all bricks and PIDs from > remote node and everything as you should. Once the whole cluster has been > updated to 3.11.3, heal became available. > > There is one thing I don't understand about this bug, though. Original > reporter reported issue against 3.8.x(based on provided output), yet version > set in bug is 3.2. That's why I've originally posted this issue can be seen > in 3.7.x as well. > > Thanks. I understand your confusion. The 3.2 version in this bug is the version of enterprise Gluster(ie Redhat Paid Subscription). This version is based on 3.8.4 codeline(same as 3.8.4 community version). Yes, it is possible for this bug to be existing even before 3.8.x as you mentioned. If a bug is raised on Product "Redhat Gluster Storage" , then it means it was reported while testing "Redhat Paid Subscription" release. In such a case, kindly look into the description to identify the community version(in this case 3.8.4).
(In reply to Zdenek Styblik from comment #7) > Just to give a small update. I've encountered similar issue in 2 node > cluster during upgrade. Unfortunately, I didn't save output of the commands. > > One node was running 3.11.3-1 and the other one was still running 3.7.20-1. > Despite all bricks were online and everything was 5-by-5, I couldn't issue % > gluster volume heal;. The reason given was as above ``[...] has been > unsuccessful on bricks that are down. Please check if all brick processes > are running.''. As stated previously, I could see all bricks and PIDs from > remote node and everything as you should. Once the whole cluster has been > updated to 3.11.3, heal became available. > > There is one thing I don't understand about this bug, though. Original > reporter reported issue against 3.8.x(based on provided output), yet version > set in bug is 3.2. That's why I've originally posted this issue can be seen > in 3.7.x as well. One thing you'd need to keep an eye on while googling out any gluster issues is the product type in the bugzilla. If you happen to see the bug filed against Red Hat Gluster Storage, please consider it to be a Red Hat Product and not the community version. GlusterFS as the product type are the bugs reported by community users. > > Thanks.
The same issue is happening with 3.11.3.