Bug 1304976

Summary: Glusterfs :sync lost between two boards
Product: [Community] GlusterFS Reporter: xinsong <songxin_1980>
Component: glusterdAssignee: Gaurav Kumar Garg <ggarg>
Status: CLOSED NOTABUG QA Contact:
Severity: urgent Docs Contact:
Priority: medium    
Version: 3.7.6CC: amukherj, bugs, hgowtham, sankarshan, smohan, songxin_1980
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: powerpc   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-31 01:41:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description xinsong 2016-02-05 07:39:48 UTC
Description of problem:
Glusterfs :sync lost between two boards

Version-Release number of selected component (if applicable):
Glusterfs 3.7.6

How reproducible:


I use glusterfs (version 3.7.6) in replicate mode for sync between two boards in a node.

When one of the board is locked and replaced with new board and restarted we see that sync is lost between the two boards.The mounted glusterfs volume is not present on the replaced board. 

Output of some of gluster commands on the replaced board are as below.

002500> gluster volume status c_glusterfs 
Status of volume: c_glusterfs 
Gluster process TCP Port RDMA Port Online Pid 
------------------------------------------------------------------------------ 
Brick 192.32.0.48:/opt/lvmdir/c2/brick 49240 0 Y 1293 

Task Status of Volume c_glusterfs 
------------------------------------------------------------------------------ 
There are no active volume tasks 



002500> gluster volume info 

Volume Name: c_glusterfs 
Type: Distribute 
Volume ID: 3625f7ff-2b92-4ac4-9967-7abf966eceef 
Status: Started 
Number of Bricks: 1 
Transport-type: tcp 
Bricks: 
Brick1: 192.32.0.48:/opt/lvmdir/c2/brick 
Options Reconfigured: 
performance.readdir-ahead: on 
network.ping-timeout: 4 
nfs.disable: on 



In status info , we don't see the gluster process of the replaced board. The gluster process Brick 192.32.0.48:/opt/lvmdir/c2/brick is of the other board which is not replaced , 

Output of the commands on the other board are : 


# gluster volume info 

Volume Name: c_glusterfs 
Type: Distribute 
Volume ID: 3625f7ff-2b92-4ac4-9967-7abf966eceef 
Status: Started 
Number of Bricks: 1 
Transport-type: tcp 
Bricks: 
Brick1: 192.32.0.48:/opt/lvmdir/c2/brick 
Options Reconfigured: 
performance.readdir-ahead: on 
network.ping-timeout: 4 
nfs.disable: on 

# gluster peer status 
Number of Peers: 2 

Hostname: 192.32.1.144 
Uuid: bbe2a458-ad3d-406d-b233-b6027c12174e 
State: Peer in Cluster (Connected) 

Hostname: 192.32.1.144 
Uuid: bbe2a458-ad3d-406d-b233-b6027c12174e 
State: Peer in Cluster (Connected) 

gluster peer status shows the same host twice , the gluster process of same is missing in the volume info, Also , the command gluster volume status c_glusterfs hangs 

From the logs of gluster at /var/log/glusterfs , we observed some errors 

cmd_history.log : 
volume add-brick c_glusterfs replica 2 192.32.1.144:/opt/lvmdir/c2/brick force : FAILED : Locking failed on 192.32.1.144. Please check log file for details. 
cli.log: 
[2016-01-30 04:32:40.179381] I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6 
[2016-01-30 04:32:40.191715] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 
[2016-01-30 04:32:40.193246] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now 
[2016-01-30 04:32:40.196551] I [cli-rpc-ops.c:2465:gf_cli_add_brick_cbk] 0-cli: Received resp to add brick 
[2016-01-30 04:32:40.196684] I [input.c:36:cli_batch] 0-: Exiting with: -1 


Can any one help me to analyze the reason?

Comment 2 Pranith Kumar K 2016-02-05 07:44:40 UTC
I just replied on the bug you raised and also on the mail you sent to gluster-users. It seems like the volume is Distribute volume which means it doesn't sync. I also don't understand how you ended up in a situation where two of the peers have same uuid and hostname. What are the steps you took to get into this situation? What are the two bricks you want to be in sync? May be we can help once you give this information.

gluster :sync only syncs the volume information. Are you on IRC or something where this can be discussed further?

Pranith

Comment 3 xinsong 2016-02-16 01:48:52 UTC
Hi,
Thanks for your reply.
The gluster volume is shown as the distribute volume because the add-brick has been failed.
 
In this test case we manually replace the board , and ended up with two same entries in the peer status. 


Some err log is print in cmd_history.log as below.

[2016-01-30 04:32:51.692254]  : volume add-brick c_glusterfs replica 2 10.32.1.144:/opt/lvmdir/c2/brick force : FAILED : Another transaction is in progress for c_glusterfs. Please try again after sometime.
[2016-01-30 04:39:16.976377]  : volume add-brick c_glusterfs replica 2 10.32.1.144:/opt/lvmdir/c2/brick force : FAILED : Locking failed on 10.32.1.144. Please check log file for details.
[2016-02-01 12:12:03.345309]  : volume status : FAILED : Staging failed on 10.32.1.144. Please check log file for details.
[2016-02-01 12:12:44.349098]  : volume status all : FAILED : Staging failed on 10.32.1.144. Please check log file for details.
[2016-02-01 12:30:04.436199]  : volume status all : FAILED : Staging failed on 10.32.1.144. Please check log file for details.
[2016-02-01 12:30:04.437085]  : volume status : FAILED : Staging failed on 10.32.1.144. Please check log file for details

Some err log is print in cli.log as below.

[2016-01-30 04:31:13.376653] I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6
[2016-01-30 04:31:13.389877] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-01-30 04:31:13.391528] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now
[2016-01-30 04:31:13.392530] I [cli-rpc-ops.c:2465:gf_cli_add_brick_cbk] 0-cli: Received resp to add brick
[2016-01-30 04:31:13.392687] I [input.c:36:cli_batch] 0-: Exiting with: -1
[2016-01-30 04:31:14.422344] I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6
[2016-01-30 04:31:14.434340] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-01-30 04:31:14.434792] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now
[2016-01-30 04:31:14.437131] I [cli-rpc-ops.c:2465:gf_cli_add_brick_cbk] 0-cli: Received resp to add brick
[2016-01-30 04:31:14.437266] I [input.c:36:cli_batch] 0-: Exiting with: -1



Thanks,
Xin

Comment 4 xinsong 2016-02-19 13:26:06 UTC
Hi,
From the log in board 10.32.0.48 I found the following things. 

[2016-01-30 04:29:07.271245] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.32.1.144. Please check log file for details. 
[2016-01-30 04:29:07.271358] E [MSGID: 106151] [glusterd-syncop.c:1861:gd_sync_task_begin] 0-management: Locking Peers Failed. 
... 

At this point the 10.32.0.48 has lock the volume c_glusterfs successfully, but failed when it lock the 10.32.1.144.I think it may be because the 10.32.0.48 want to lock twice 10.32.1.144.The reason is because there are same host twice as below. 

Number of Peers: 2 

Hostname: 10.32.1.144 
Uuid: bbe2a458-ad3d-406d-b233-b6027c12174e 
State: Peer in Cluster (Connected) 

Hostname: 10.32.1.144 
Uuid: bbe2a458-ad3d-406d-b233-b6027c12174e 
State: Peer in Cluster (Connected) 


And the 10.32.0.48 didn't release the volume c_glusterfs lock until 04:39:16 as below. 
... 
[2016-01-30 04:39:16.976206] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.32.1.144. Please check log file for details. 
[2016-01-30 04:39:16.976292] E [MSGID: 106152] [glusterd-syncop.c:1562:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s) 
... 

From 04:29:07 to 04:39:16 ,the command “add brick” try to lock volume c_glusterfs lock, but it failed because the lock hasn't been released. 

But I don't understand why the 10.32.0.48 has two peers which are both 10.32.1.144. 

Who can help me to answer my quesion?

Comment 5 Atin Mukherjee 2016-02-22 05:00:33 UTC
All the error logs basically boil down to a single problem where you have duplicate peer entries. As Pranith pointed out we are not sure how did you end up in this situation. Do you recollect what all steps did you perform in terms of forming the cluster and then volume operations? Attaching glusterd & cmd_history log files from all the nodes would certainly help in analysing the root cause.

Comment 6 xinsong 2016-03-31 01:41:53 UTC
Hi,
This issue has been solved.This issue is cause by we use a new node to replace a bad node.But we did not copy glusterd.info from bad node to new node.

Thanks,
Xin