Description of problem: ========================= i have landed my testbed to a situation where I dont see bricks info in vol status when issued from one node. However I am able to see from another node Also, the shd deamon for all nodes is not seen in vol status from this node , but is seen from another node problematic node view: =========================== In below case I dont even see the bricks root@dhcp35-38 glusterfs]# gluster v status z-rep3-9 Status of volume: z-rep3-9 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Self-heal Daemon on localhost N/A N/A Y 963 Self-heal Daemon on dhcp35-184.lab.eng.blr. redhat.com N/A N/A Y 19618 Self-heal Daemon on dhcp35-140.lab.eng.blr. redhat.com N/A N/A Y 5480 Task Status of Volume z-rep3-9 ------------------------------------------------------------------------------ There are no active volume tasks In below case I don't see shd of remaining 3 nodes ie nodes which don't host bricks [root@dhcp35-38 glusterfs]# gluster v status y-rep3-8 Status of volume: y-rep3-8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp35-140.lab.eng.blr.redhat.com:/gl uster/brick8/y-rep3-8 49152 0 Y 3216 Brick dhcp35-38.lab.eng.blr.redhat.com:/glu ster/brick8/y-rep3-8 49153 0 Y 26507 Brick dhcp35-184.lab.eng.blr.redhat.com:/gl uster/brick8/y-rep3-8 49152 0 Y 3266 Self-heal Daemon on localhost N/A N/A Y 963 Self-heal Daemon on dhcp35-184.lab.eng.blr. redhat.com N/A N/A Y 19618 Self-heal Daemon on dhcp35-140.lab.eng.blr. redhat.com N/A N/A Y 5480 Task Status of Volume y-rep3-8 ------------------------------------------------------------------------------ There are no active volume tasks same info fetched from good node: ================================ [root@dhcp35-140 test-scripts]# gluster v status y-rep3-8 Status of volume: y-rep3-8 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp35-140.lab.eng.blr.redhat.com:/gl uster/brick8/y-rep3-8 49152 0 Y 3216 Brick dhcp35-38.lab.eng.blr.redhat.com:/glu ster/brick8/y-rep3-8 49153 0 Y 26507 Brick dhcp35-184.lab.eng.blr.redhat.com:/gl uster/brick8/y-rep3-8 49152 0 Y 3266 Self-heal Daemon on localhost N/A N/A Y 5480 Self-heal Daemon on dhcp35-218.lab.eng.blr. redhat.com N/A N/A Y 10764 Self-heal Daemon on dhcp35-83.lab.eng.blr.r edhat.com N/A N/A Y 22560 Self-heal Daemon on dhcp35-127.lab.eng.blr. redhat.com N/A N/A Y 5931 Self-heal Daemon on dhcp35-38.lab.eng.blr.r edhat.com N/A N/A Y 963 Self-heal Daemon on dhcp35-184.lab.eng.blr. redhat.com N/A N/A Y 19618 Task Status of Volume y-rep3-8 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-140 test-scripts]# gluster v status z-rep3-9 Status of volume: z-rep3-9 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp35-83.lab.eng.blr.redhat.com:/glu ster/brick9/z-rep3-9 49152 0 Y 3260 Brick dhcp35-127.lab.eng.blr.redhat.com:/gl uster/brick9/z-rep3-9 49152 0 Y 3237 Brick dhcp35-218.lab.eng.blr.redhat.com:/gl uster/brick9/z-rep3-9 49152 0 Y 3247 Self-heal Daemon on localhost N/A N/A Y 5480 Self-heal Daemon on dhcp35-184.lab.eng.blr. redhat.com N/A N/A Y 19618 Self-heal Daemon on dhcp35-38.lab.eng.blr.r edhat.com N/A N/A Y 963 Self-heal Daemon on dhcp35-218.lab.eng.blr. redhat.com N/A N/A Y 10764 Self-heal Daemon on dhcp35-127.lab.eng.blr. redhat.com N/A N/A Y 5931 Self-heal Daemon on dhcp35-83.lab.eng.blr.r edhat.com N/A N/A Y 22560 Task Status of Volume z-rep3-9 ------------------------------------------------------------------------------ There are no active volume tasks peer status shows all connected Version-Release number of selected component (if applicable): ======================= 3.12.2-18.1 How reproducible: =============== hit it once Steps to Reproduce: 1.have created about 22 volumes of 1x3 2. started to pump IOs from 8 fuse clients to 8 of those volumes 3.started creation of ~90 new volumes, start and deleted, did that in about 3 iterations 4. during the start of 4th iteration stopped this process of vol create,deletes 5. during this process, had killed a brick on n2 and was bringing brick up for some vols using start force 6. then did a glusterd restart on n2 Actual results: ============== seeing wrong vol status when issued from n2
sosreports and gluster-health reports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1637459
I'd like to highlight that when we write down steps to reproduce, please mention explicitly from which nodes the commands are originated as that's a very crucial data point to start analyzing the issue. Unfortunately it's not clear to me that what exactly has been tested here.
Now I understand that we have a fix posted for this mismatch in caps value at https://review.gluster.org/#/c/21336/ . It'd be worth if you can also update the reproducer steps here, Sanju.
Reproducing steps: 1. In a cluster of n nodes, create a volume using bricks hosted on any n-1 nodes 2. After volume creation, restart glusterd on any node 3. Check peer status from node, where glusterd has been restarted. Peer which is not hosting any bricks will be in rejected state. Thanks, Sanju
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3827