Description of problem: ====================== Had a 4node brick-mux enabled cluster with 3 volumes, one of which was plain distribute 4*1 with the name 'dist'. The cluster was part of RHGS-Console. Executed a remove-brick from the UI for the fourth brick of 'dist' and that failed instantly. Unable to make out much from the UI logs, logged into the CLI and saw the below error messages on the node where rebalance was supposed to take place. [2017-08-26 07:30:19.773041] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-dist-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2017-08-26 07:30:19.773623] E [MSGID: 109027] [dht-rebalance.c:4401:gf_defrag_start_crawl] 0-dist-dht: Failed to start rebalance: look up on / failed Had gluster-eventing enabled on the cluster as well. It correctly showed the event for VOLUME_REMOVE_BRICK_START but showed no event for the failure of the same.. (but that would be another issue). Just FYI, if glusterfs-events logs give some clue on what must have gone wrong. Stopped the failed rebalance operation. Tried to set the quota usage-limit on the volume and that failed as well - with the error: [2017-08-26 07:00:32.795025] E [MSGID: 106176] [glusterd-quota.c:1939:glusterd_create_quota_auxiliary_mount] 0-management: Failed to mount glusterfs client. Please check the log file /var/log/glusterfs/quota-mount-dist.log for more details [Transport endpoint is not connected] [2017-08-26 07:00:32.795117] E [MSGID: 106528] [glusterd-quota.c:2117:glusterd_op_stage_quota] 0-management: Failed to start aux mount Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.8.4-42 How reproducible: ================= 1:1 Additional info: ================ [2017-08-26 07:30:19.768379] E [socket.c:2360:socket_connect_finish] 0-dist-client-2: connection to 10.70.37.94:49154 failed (Connection refused); disconnecting socket [2017-08-26 07:30:19.771804] E [socket.c:2360:socket_connect_finish] 0-dist-client-3: connection to 10.70.37.98:49154 failed (Connection refused); disconnecting socket The message "W [MSGID: 109073] [dht-common.c:9279:dht_notify] 0-dist-dht: Received CHILD_DOWN. Exiting" repeated 3 times between [2017-08-26 07:30:19.754912] and [2017-08-26 07:30:19.771877] [2017-08-26 07:30:19.773041] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-dist-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2017-08-26 07:30:19.773623] E [MSGID: 109027] [dht-rebalance.c:4401:gf_defrag_start_crawl] 0-dist-dht: Failed to start rebalance: look up on / failed [2017-08-26 07:30:19.773888] I [MSGID: 109028] [dht-rebalance.c:5059:gf_defrag_status_get] 0-dist-dht: Rebalance is failed. Time taken is 0.00 secs [root@dhcp37-78 ~]# gluster v info dist Volume Name: dist Type: Distribute Volume ID: 6711eb76-132e-4f79-ac65-86a745120ba3 Status: Started Snapshot Count: 1 Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.70.37.78:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick1/dist_0 Brick2: dhcp37-86.lab.eng.blr.redhat.com:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick2/dist_1 Brick3: dhcp37-94.lab.eng.blr.redhat.com:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick3/dist_2 Brick4: dhcp37-98.lab.eng.blr.redhat.com:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick4/dist_3 Options Reconfigured: features.scrub: Active features.bitrot: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on auth.allow: * user.cifs: enable transport.address-family: inet nfs.disable: off cluster.brick-multiplex: enable [root@dhcp37-78 ~]# [root@dhcp37-78 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp37-86.lab.eng.blr.redhat.com Uuid: 94928003-62e6-441a-a693-500f812dcbd9 State: Peer in Cluster (Connected) Hostname: dhcp37-94.lab.eng.blr.redhat.com Uuid: 71dfe4b0-eee2-4678-bb03-c149a57e1cfc State: Peer in Cluster (Connected) Hostname: dhcp37-98.lab.eng.blr.redhat.com Uuid: c085554d-3b49-4615-af4d-d14086338e36 State: Peer in Cluster (Connected) [root@dhcp37-78 ~]# [root@dhcp37-78 ~]# rpm -qa | grep gluster glusterfs-events-3.8.4-42.el7rhgs.x86_64 gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-3.8.4-42.el7rhgs.x86_64 glusterfs-cli-3.8.4-42.el7rhgs.x86_64 python-gluster-3.8.4-42.el7rhgs.noarch glusterfs-geo-replication-3.8.4-42.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.2.x86_64 vdsm-gluster-4.17.33-1.2.el7rhgs.noarch glusterfs-libs-3.8.4-42.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-42.el7rhgs.x86_64 glusterfs-fuse-3.8.4-42.el7rhgs.x86_64 glusterfs-server-3.8.4-42.el7rhgs.x86_64 glusterfs-rdma-3.8.4-42.el7rhgs.x86_64 glusterfs-api-3.8.4-42.el7rhgs.x86_64 [root@dhcp37-78 ~]# [root@dhcp37-78 ~]# [root@dhcp37-78 ~]# gluster v status Another transaction is in progress for dist. Please try again after sometime. Status of volume: ecvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.78:/bricks/brick0/ecvol0 49153 0 Y 21356 Brick 10.70.37.86:/bricks/brick0/ecvol1 49154 0 Y 23355 Brick 10.70.37.94:/bricks/brick0/ecvol2 49153 0 Y 18913 Self-heal Daemon on localhost N/A N/A Y 21388 Quota Daemon on localhost N/A N/A Y 21778 Self-heal Daemon on dhcp37-86.lab.eng.blr.r edhat.com N/A N/A Y 23393 Quota Daemon on dhcp37-86.lab.eng.blr.redha t.com N/A N/A Y 23670 Self-heal Daemon on dhcp37-94.lab.eng.blr.r edhat.com N/A N/A Y 18946 Quota Daemon on dhcp37-94.lab.eng.blr.redha t.com N/A N/A Y 19513 Self-heal Daemon on dhcp37-98.lab.eng.blr.r edhat.com N/A N/A Y 30296 Quota Daemon on dhcp37-98.lab.eng.blr.redha t.com N/A N/A Y 30523 Task Status of Volume ecvol ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: master Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.78:/bricks/brick0/master_0 49154 0 Y 9439 Brick dhcp37-86.lab.eng.blr.redhat.com:/bri cks/brick0/master_1 49152 0 Y 10544 NFS Server on localhost 2049 0 Y 21378 Self-heal Daemon on localhost N/A N/A Y 21388 Quota Daemon on localhost N/A N/A Y 21778 NFS Server on dhcp37-86.lab.eng.blr.redhat. com 2049 0 Y 23381 Self-heal Daemon on dhcp37-86.lab.eng.blr.r edhat.com N/A N/A Y 23393 Quota Daemon on dhcp37-86.lab.eng.blr.redha t.com N/A N/A Y 23670 NFS Server on dhcp37-94.lab.eng.blr.redhat. com 2049 0 Y 18937 Self-heal Daemon on dhcp37-94.lab.eng.blr.r edhat.com N/A N/A Y 18946 Quota Daemon on dhcp37-94.lab.eng.blr.redha t.com N/A N/A Y 19513 NFS Server on dhcp37-98.lab.eng.blr.redhat. com 2049 0 Y 30287 Self-heal Daemon on dhcp37-98.lab.eng.blr.r edhat.com N/A N/A Y 30296 Quota Daemon on dhcp37-98.lab.eng.blr.redha t.com N/A N/A Y 30523 Task Status of Volume master ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp37-78 ~]# 'gluster volume status' of 'dist' fails with the error 'Another transaction is in progress for dist and please try again after sometime'. The glusterd log file says that the lock for dist is held by itself. I am guessing that is either due to sosreports being triggered on all the nodes of the cluster at similar time, or because of periodic interference from Console to get the status.
Have set the priority of this bug to medium, as there are multiple factors that have come into play for the volume in question 'dist'. Other than the ones already mentioned in description, 'dist' was a snapshot restored volume. Sosreports will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/
Found a bug of similar kind 1408621, but that was on a disperse volume. Please dup this bug if RC is found to be similar.
Set the KnownIssue flag for this, and closing as wont fix. Mainly as the design of rebalance is like this.