Bug 1485616 - Remove brick fails on a distribute volume - "Failed to start rebalance: look up on / failed"
Summary: Remove brick fails on a distribute volume - "Failed to start rebalance: look ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Vijay Bellur
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-26 08:45 UTC by Sweta Anandpara
Modified: 2018-10-24 09:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: rebalance (or remove-brick) operation needs all the subvolumes of distribute to be online to work. Consequence: In case, if any brick/subvolume is down during these operations, the rebalance (or remove-brick) process will get terminated. Workaround (if any): Make sure all the bricks are online during these operations. Result: If all the bricks are online, then we will see that rebalance (or remove-brick) operation would complete, without issues.
Clone Of:
Environment:
Last Closed: 2018-10-24 09:55:42 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1408621 0 medium CLOSED rebalance operation because of remove-brick failed on one of the cluster node 2021-02-22 00:41:40 UTC

Internal Links: 1408621

Description Sweta Anandpara 2017-08-26 08:45:17 UTC
Description of problem:
======================
Had a 4node brick-mux enabled cluster with 3 volumes, one of which was plain distribute 4*1 with the name 'dist'. The cluster was part of RHGS-Console. Executed a remove-brick from the UI for the fourth brick of 'dist' and that failed instantly. Unable to make out much from the UI logs, logged into the CLI and saw the below error messages on the node where rebalance was supposed to take place.

[2017-08-26 07:30:19.773041] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-dist-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2017-08-26 07:30:19.773623] E [MSGID: 109027] [dht-rebalance.c:4401:gf_defrag_start_crawl] 0-dist-dht: Failed to start rebalance: look up on / failed

Had gluster-eventing enabled on the cluster as well. It correctly showed the event for VOLUME_REMOVE_BRICK_START but showed no event for the failure of the same.. (but that would be another issue). Just FYI, if glusterfs-events logs give some clue on what must have gone wrong.

Stopped the failed rebalance operation. Tried to set the quota usage-limit on the volume and that failed as well - with the error: 

[2017-08-26 07:00:32.795025] E [MSGID: 106176] [glusterd-quota.c:1939:glusterd_create_quota_auxiliary_mount] 0-management: Failed to mount glusterfs client. Please check the log file /var/log/glusterfs/quota-mount-dist.log for more details [Transport endpoint is not connected]
[2017-08-26 07:00:32.795117] E [MSGID: 106528] [glusterd-quota.c:2117:glusterd_op_stage_quota] 0-management: Failed to start aux mount



Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-42


How reproducible:
=================
1:1


Additional info:
================

[2017-08-26 07:30:19.768379] E [socket.c:2360:socket_connect_finish] 0-dist-client-2: connection to 10.70.37.94:49154 failed (Connection refused); disconnecting socket
[2017-08-26 07:30:19.771804] E [socket.c:2360:socket_connect_finish] 0-dist-client-3: connection to 10.70.37.98:49154 failed (Connection refused); disconnecting socket
The message "W [MSGID: 109073] [dht-common.c:9279:dht_notify] 0-dist-dht: Received CHILD_DOWN. Exiting" repeated 3 times between [2017-08-26 07:30:19.754912] and [2017-08-26 07:30:19.771877]
[2017-08-26 07:30:19.773041] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-dist-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2017-08-26 07:30:19.773623] E [MSGID: 109027] [dht-rebalance.c:4401:gf_defrag_start_crawl] 0-dist-dht: Failed to start rebalance: look up on / failed
[2017-08-26 07:30:19.773888] I [MSGID: 109028] [dht-rebalance.c:5059:gf_defrag_status_get] 0-dist-dht: Rebalance is failed. Time taken is 0.00 secs


[root@dhcp37-78 ~]# gluster v info dist
 
Volume Name: dist
Type: Distribute
Volume ID: 6711eb76-132e-4f79-ac65-86a745120ba3
Status: Started
Snapshot Count: 1
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.78:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick1/dist_0
Brick2: dhcp37-86.lab.eng.blr.redhat.com:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick2/dist_1
Brick3: dhcp37-94.lab.eng.blr.redhat.com:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick3/dist_2
Brick4: dhcp37-98.lab.eng.blr.redhat.com:/run/gluster/snaps/104406ff0ab34ef791363ae524737448/brick4/dist_3
Options Reconfigured:
features.scrub: Active
features.bitrot: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
auth.allow: *
user.cifs: enable
transport.address-family: inet
nfs.disable: off
cluster.brick-multiplex: enable
[root@dhcp37-78 ~]# 
[root@dhcp37-78 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp37-86.lab.eng.blr.redhat.com
Uuid: 94928003-62e6-441a-a693-500f812dcbd9
State: Peer in Cluster (Connected)

Hostname: dhcp37-94.lab.eng.blr.redhat.com
Uuid: 71dfe4b0-eee2-4678-bb03-c149a57e1cfc
State: Peer in Cluster (Connected)

Hostname: dhcp37-98.lab.eng.blr.redhat.com
Uuid: c085554d-3b49-4615-af4d-d14086338e36
State: Peer in Cluster (Connected)
[root@dhcp37-78 ~]# 
[root@dhcp37-78 ~]# rpm -qa | grep gluster
glusterfs-events-3.8.4-42.el7rhgs.x86_64
gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-3.8.4-42.el7rhgs.x86_64
glusterfs-cli-3.8.4-42.el7rhgs.x86_64
python-gluster-3.8.4-42.el7rhgs.noarch
glusterfs-geo-replication-3.8.4-42.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.2.x86_64
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch
glusterfs-libs-3.8.4-42.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-42.el7rhgs.x86_64
glusterfs-fuse-3.8.4-42.el7rhgs.x86_64
glusterfs-server-3.8.4-42.el7rhgs.x86_64
glusterfs-rdma-3.8.4-42.el7rhgs.x86_64
glusterfs-api-3.8.4-42.el7rhgs.x86_64
[root@dhcp37-78 ~]# 
[root@dhcp37-78 ~]# 
[root@dhcp37-78 ~]# gluster v status
Another transaction is in progress for dist. Please try again after sometime.
 
Status of volume: ecvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.78:/bricks/brick0/ecvol0     49153     0          Y       21356
Brick 10.70.37.86:/bricks/brick0/ecvol1     49154     0          Y       23355
Brick 10.70.37.94:/bricks/brick0/ecvol2     49153     0          Y       18913
Self-heal Daemon on localhost               N/A       N/A        Y       21388
Quota Daemon on localhost                   N/A       N/A        Y       21778
Self-heal Daemon on dhcp37-86.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       23393
Quota Daemon on dhcp37-86.lab.eng.blr.redha
t.com                                       N/A       N/A        Y       23670
Self-heal Daemon on dhcp37-94.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       18946
Quota Daemon on dhcp37-94.lab.eng.blr.redha
t.com                                       N/A       N/A        Y       19513
Self-heal Daemon on dhcp37-98.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       30296
Quota Daemon on dhcp37-98.lab.eng.blr.redha
t.com                                       N/A       N/A        Y       30523
 
Task Status of Volume ecvol
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: master
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.78:/bricks/brick0/master_0   49154     0          Y       9439 
Brick dhcp37-86.lab.eng.blr.redhat.com:/bri
cks/brick0/master_1                         49152     0          Y       10544
NFS Server on localhost                     2049      0          Y       21378
Self-heal Daemon on localhost               N/A       N/A        Y       21388
Quota Daemon on localhost                   N/A       N/A        Y       21778
NFS Server on dhcp37-86.lab.eng.blr.redhat.
com                                         2049      0          Y       23381
Self-heal Daemon on dhcp37-86.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       23393
Quota Daemon on dhcp37-86.lab.eng.blr.redha
t.com                                       N/A       N/A        Y       23670
NFS Server on dhcp37-94.lab.eng.blr.redhat.
com                                         2049      0          Y       18937
Self-heal Daemon on dhcp37-94.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       18946
Quota Daemon on dhcp37-94.lab.eng.blr.redha
t.com                                       N/A       N/A        Y       19513
NFS Server on dhcp37-98.lab.eng.blr.redhat.
com                                         2049      0          Y       30287
Self-heal Daemon on dhcp37-98.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       30296
Quota Daemon on dhcp37-98.lab.eng.blr.redha
t.com                                       N/A       N/A        Y       30523
 
Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp37-78 ~]#


'gluster volume status' of 'dist' fails with the error 'Another transaction is in progress for dist and please try again after sometime'. The glusterd log file says that the lock for dist is held by itself. I am guessing that is either due to sosreports being triggered on all the nodes of the cluster at similar time, or because of periodic interference from Console to get the status.

Comment 2 Sweta Anandpara 2017-08-26 08:47:25 UTC
Have set the priority of this bug to medium, as there are multiple factors that have come into play for the volume in question 'dist'. Other than the ones already mentioned in description, 'dist' was a snapshot restored volume.

Sosreports will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/

Comment 3 Sweta Anandpara 2017-08-26 08:48:28 UTC
Found a bug of similar kind 1408621, but that was on a disperse volume. Please dup this bug if RC is found to be similar.

Comment 7 Amar Tumballi 2018-10-24 09:55:42 UTC
Set the KnownIssue flag for this, and closing as wont fix. Mainly as the design of rebalance is like this.


Note You need to log in before you can comment on or make changes to this bug.