Bug 1006824

Summary: add-brick and rebalance commands exit without an error message when one of the nodes is down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Anush Shetty <ashetty>
Component: glusterdAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED DEFERRED QA Contact: storage-qa-internal <storage-qa-internal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.1CC: spalai, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1286182 1286190 (view as bug list) Environment:
Last Closed: 2015-11-27 12:15:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1286182, 1286190    

Description Anush Shetty 2013-09-11 10:47:00 UTC
Description of problem: To a 1x2 replicated volume, we brought down(power-off) one of the nodes. Then we tried adding 2 more bricks to it, add-brick failed without an error message. Then the brick was brought up again to add the brick successfully making it 2x2 Distributed-Replicated volume.

After adding the bricks, rebalance was started. Now during rebalance, one source brick and another destination brick were brought down. The rebalance status fails without an error message.

We had to restart the glusterd deamon to get the rebalance status command working.


Version-Release number of selected component (if applicable): glusterfs-3.4.0.33rhs-1.el6rhs.x86_64

How reproducible: Consistently 

Steps to reproduce:

1) Volume Name: cinder-vol
Type: Replicate
Volume ID: 843455d2-3665-4cf5-8390-923f8edac27f
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhshdp01.lab.eng.blr.redhat.com:/rhs/brick2/s1
Brick2: rhshdp02.lab.eng.blr.redhat.com:/rhs/brick2/s2
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 165
storage.owner-gid: 165

2) Powered off rhshdp02.lab.eng.blr.redhat.com

3) Did an add-brick

[root@rhshdp01 ~]# gluster volume add-brick cinder-vol rhshdp03.lab.eng.blr.redhat.com:/rhs/brick2/s3 rhshdp04.lab.eng.blr.redhat.com:/rhs/brick2/s4 

4) Add-brick didn't succeed. Exited without an error message

[root@rhshdp01 ~]# gluster volume info cinder-vol
 
Volume Name: cinder-vol
Type: Replicate
Volume ID: 843455d2-3665-4cf5-8390-923f8edac27f
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhshdp01.lab.eng.blr.redhat.com:/rhs/brick2/s1
Brick2: rhshdp02.lab.eng.blr.redhat.com:/rhs/brick2/s2
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 165
storage.owner-gid: 165

5) Brought up rhshdp02.lab.eng.blr.redhat.com. Tried add-brick again after clearing the extended attributes on the bricks to be added. All nodes are up. Add-brick succeeded.

[root@rhshdp01 ~]# gluster volume add-brick cinder-vol rhshdp03.lab.eng.blr.redhat.com:/rhs/brick2/s3 rhshdp04.lab.eng.blr.redhat.com:/rhs/brick2/s4 
volume add-brick: success
[root@rhshdp01 ~]# gluster volume info cinder-vol
 
Volume Name: cinder-vol
Type: Distributed-Replicate
Volume ID: 843455d2-3665-4cf5-8390-923f8edac27f
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhshdp01.lab.eng.blr.redhat.com:/rhs/brick2/s1
Brick2: rhshdp02.lab.eng.blr.redhat.com:/rhs/brick2/s2
Brick3: rhshdp03.lab.eng.blr.redhat.com:/rhs/brick2/s3
Brick4: rhshdp04.lab.eng.blr.redhat.com:/rhs/brick2/s4
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 165
storage.owner-gid: 165

6) Brought down rhshdp02.lab.eng.blr.redhat.com again. Started rebalance. 

[root@rhshdp01 ~]# gluster volume rebalance cinder-vol start
volume rebalance: cinder-vol: success: Starting rebalance on volume cinder-vol has been successful.
ID: 6a803a22-07ca-48f2-a537-16dc2f9c05ad
[root@rhshdp01 ~]# gluster volume rebalance cinder-vol status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             2             0             0    in progress            12.00
         rhshdp04.lab.eng.blr.redhat.com                0        0Bytes            10             0             0      completed             0.00
         rhshdp03.lab.eng.blr.redhat.com                0        0Bytes             0             0             0    in progress             0.00
volume rebalance: cinder-vol: success: 

7) Brought up rhshdp02.lab.eng.blr.redhat.com. Rebalance status successful.
 
[root@rhshdp01 ~]# gluster volume rebalance cinder-vol status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             2             0             0    in progress            71.00
         rhshdp04.lab.eng.blr.redhat.com                0        0Bytes            10             0             0      completed             0.00
         rhshdp03.lab.eng.blr.redhat.com                0        0Bytes            10             0             0      completed             0.00
         rhshdp02.lab.eng.blr.redhat.com                0        0Bytes             0             0             0    not started             0.00
volume rebalance: cinder-vol: success: 

8) Brought down rhshdp02.lab.eng.blr.redhat.com and the destination brick rhshdp03.lab.eng.blr.redhat.com. Rebalance status command fails without any error message

[root@rhshdp01 ~]# gluster volume rebalance cinder-vol status

9) Brought up both rhshdp02.lab.eng.blr.redhat.com and the destination brick rhshdp03.lab.eng.blr.redhat.com. Rebalance status still fails. Exits without any error message

[root@rhshdp01 ~]# gluster volume rebalance cinder-vol status
[root@rhshdp01 ~]# gluster volume rebalance cinder-vol status

10) Restarted glusterd and tried the rebalance status command again, rebalance status is successful

# service glusterd restart
Stopping glusterd:                                         [  OK  ]
Starting glusterd:                                         [  OK  ]
[root@rhshdp01 ~]# gluster volume rebalance cinder-vol status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0             0    not started             0.00
         rhshdp04.lab.eng.blr.redhat.com                0        0Bytes            10             0             0      completed             0.00
         rhshdp03.lab.eng.blr.redhat.com                0        0Bytes             0             0             0    not started             0.00
         rhshdp02.lab.eng.blr.redhat.com                0        0Bytes             0             0             0    not started             0.00
volume rebalance: cinder-vol: success: 

11) Brought down rhshdp02.lab.eng.blr.redhat.com  and the rebalance status command fails again.

[root@rhshdp01 ~]# gluster volume rebalance cinder-vol status
[root@rhshdp01 ~]#

Actual results:

add-brick and rebalance status command fails without an error message

Expected results:

Rebalance status should display the status of the bricks which are up. In case of failures, proper error messages should be displayed.

Additional info:

From glusterd log:

[2013-09-11 06:51:24.328394] I [glusterd-brick-ops.c:370:__glusterd_handle_add_brick] 0-management: Received add brick req
[2013-09-11 06:51:24.328524] E [glusterd-utils.c:149:glusterd_lock] 0-management: Unable to get lock for uuid: b5550547-9704-490f-a891-b102766b7da5, lock held by: b5550547-9704-490f-a89
1-b102766b7da5
[2013-09-11 06:51:24.328551] E [glusterd-syncop.c:1195:gd_sync_task_begin] 0-management: Unable to acquire lock
[2013-09-11 06:51:27.066882] W [socket.c:522:__socket_rwv] 0-management: readv on 10.70.36.117:24007 failed (Connection reset by peer)
[2013-09-11 06:51:27.067426] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x32d1a0e0f4] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_c
leanup+0xc3) [0x32d1a0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x32d1a0db4e]))) 0-management: forced unwinding frame type(glusterd mgmt) op(--(1)) called at 2013-09
-11 06:47:45.319227 (xid=0x13x)
[2013-09-11 06:51:27.067508] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x32d1a0e0f4] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_c
leanup+0xc3) [0x32d1a0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x32d1a0db4e]))) 0-management: forced unwinding frame type(Peer mgmt) op(--(4)) called at 2013-09-11 
06:51:03.642904 (xid=0x14x)
[2013-09-11 06:51:27.067525] E [glusterd-rpc-ops.c:539:__glusterd_friend_update_cbk] 0-management: RPC Error
[2013-09-11 06:51:27.068255] I [glusterd-rpc-ops.c:554:__glusterd_friend_update_cbk] 0-management: Received RJT from uuid: 00000000-0000-0000-0000-000000000000
[2013-09-11 06:51:27.075611] I [socket.c:3034:socket_submit_request] 0-management: not connected (priv->connected = 0)
[2013-09-11 06:51:27.075660] W [rpc-clnt.c:1500:rpc_clnt_submit] 0-management: failed to submit rpc-request (XID: 0x15x Program: glusterd mgmt, ProgVers: 2, Proc: 3) to rpc-transport (m
anagement)
[2013-09-11 06:51:27.076883] E [glusterd-syncop.c:102:gd_collate_errors] 0-: Unlocking failed on 9178fbf0-11a0-451c-90c5-853f0cca3e02. Please check log file for details.
[2013-09-11 06:51:27.077733] E [glusterd-syncop.c:1068:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)
[2013-09-11 06:51:27.077848] I [socket.c:3108:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2013-09-11 06:51:27.077884] E [rpcsvc.c:1111:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1x, Program: GlusterD svc cli, ProgVers: 2, Proc: 13) to rpc-transpo
rt (socket.management)
[2013-09-11 06:51:27.077914] E [glusterd-utils.c:380:glusterd_submit_reply] 0-: Reply submission failed
[2013-09-11 06:51:27.077938] E [glusterd-utils.c:182:glusterd_unlock] 0-management: Cluster lock not held!
[2013-09-11 06:51:27.091029] I [glusterd-rpc-ops.c:357:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 9178fbf0-11a0-451c-90c5-853f0cca3e02, host: rhshdp02.lab.eng.blr.re
dhat.com, port: 0
[2013-09-11 06:51:27.095193] I [glusterd-handler.c:2190:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 9178fbf0-11a0-451c-90c5-853f0cca3e02
[2013-09-11 06:51:27.095260] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: 90cddee3-d79d-46eb-94c3-d79a4ed5dd55, hostname:rhshdp04.lab.eng.blr.redhat.co
m
[2013-09-11 06:51:27.095284] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: 6d63b2c2-6196-4e37-9968-99435cbcc85b, hostname:rhshdp03.lab.eng.blr.redhat.co
m
[2013-09-11 06:51:27.095303] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: b5550547-9704-490f-a891-b102766b7da5, hostname:10.70.36.116
[2013-09-11 06:51:27.095317] I [glusterd-handler.c:2244:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2013-09-11 06:52:09.467959] I [glusterd-brick-ops.c:370:__glusterd_handle_add_brick] 0-management: Received add brick req
[2013-09-11 06:52:09.475844] E [glusterd-syncop.c:102:gd_collate_errors] 0-: Staging failed on rhshdp04.lab.eng.blr.redhat.com. Error: /rhs/brick2/s4 is already part of a volume
[2013-09-11 06:52:09.477549] E [glusterd-syncop.c:102:gd_collate_errors] 0-: Staging failed on rhshdp03.lab.eng.blr.redhat.com. Error: /rhs/brick2/s3 is already part of a volume
[2013-09-11 06:53:16.508516] I [glusterd-brick-ops.c:370:__glusterd_handle_add_brick] 0-management: Received add brick req
[2013-09-11 06:53:16.516415] I [glusterd-brick-ops.c:1011:glusterd_op_perform_add_bricks] 0-management: type is set 0, need to change it
[2013-09-11 06:53:17.585680] E [glusterd-utils.c:3803:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/fe157fc96c155dde83e914d047d996c4.socket error: Permiss
ion denied

[2013-09-11 06:53:18.596109] E [glusterd-utils.c:3803:glusterd_nodesvc_unlink_socket_file] 0-management: Failed to remove /var/run/3737a9a4523c64f7f1a0cde33e90348e.socket error: No such
 file or directory
[2013-09-11 06:53:18.604247] I [rpc-clnt.c:974:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-09-11 06:53:18.604403] I [socket.c:3487:socket_init] 0-management: SSL support is NOT enabled
[2013-09-11 06:53:18.604428] I [socket.c:3502:socket_init] 0-management: using system polling thread
[2013-09-11 06:53:18.605701] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2013-09-11 06:53:18.605872] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2013-09-11 06:53:18.606100] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now
[2013-09-11 06:53:18.606178] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2013-09-11 06:53:18.606315] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2013-09-11 06:53:18.606548] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now
[2013-09-11 06:53:42.045742] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-09-11 06:53:58.775568] W [socket.c:522:__socket_rwv] 0-management: readv on 10.70.36.117:24007 failed (Connection timed out)
[2013-09-11 06:54:02.509973] I [glusterd-utils.c:8073:glusterd_generate_and_set_task_id] 0-management: Generated task-id 6a803a22-07ca-48f2-a537-16dc2f9c05ad for key rebalance-id
[2013-09-11 06:54:07.524792] I [rpc-clnt.c:974:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-09-11 06:54:07.524938] I [socket.c:3487:socket_init] 0-management: SSL support is NOT enabled
[2013-09-11 06:54:07.524960] I [socket.c:3502:socket_init] 0-management: using system polling thread
[2013-09-11 06:54:07.525342] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.70.36.117:24007 failed (No route to host)
[2013-09-11 06:54:12.540025] E [glusterd-utils.c:7363:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-09-11 06:54:12.545936] E [glusterd-utils.c:7363:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2013-09-11 06:55:16.723707] I [glusterd-handshake.c:556:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 2
[2013-09-11 06:55:16.749710] I [glusterd-handler.c:2028:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 9178fbf0-11a0-451c-90c5-853f0cca3e02
[2013-09-11 06:55:18.327970] I [glusterd-handler.c:2190:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 6d63b2c2-6196-4e37-9968-99435cbcc85b
[2013-09-11 06:55:18.328082] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: 9178fbf0-11a0-451c-90c5-853f0cca3e02, hostname:rhshdp02.lab.eng.blr.redhat.com
[2013-09-11 06:55:18.328109] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: 90cddee3-d79d-46eb-94c3-d79a4ed5dd55, hostname:rhshdp04.lab.eng.blr.redhat.com

Comment 2 Anush Shetty 2013-09-11 11:25:55 UTC
sosreports here:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1006824