Bug 1304976
Summary: | Glusterfs :sync lost between two boards | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | xinsong <songxin_1980> |
Component: | glusterd | Assignee: | Gaurav Kumar Garg <ggarg> |
Status: | CLOSED NOTABUG | QA Contact: | |
Severity: | urgent | Docs Contact: | |
Priority: | medium | ||
Version: | 3.7.6 | CC: | amukherj, bugs, hgowtham, sankarshan, smohan, songxin_1980 |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | powerpc | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-31 01:41:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
xinsong
2016-02-05 07:39:48 UTC
I just replied on the bug you raised and also on the mail you sent to gluster-users. It seems like the volume is Distribute volume which means it doesn't sync. I also don't understand how you ended up in a situation where two of the peers have same uuid and hostname. What are the steps you took to get into this situation? What are the two bricks you want to be in sync? May be we can help once you give this information. gluster :sync only syncs the volume information. Are you on IRC or something where this can be discussed further? Pranith Hi, Thanks for your reply. The gluster volume is shown as the distribute volume because the add-brick has been failed. In this test case we manually replace the board , and ended up with two same entries in the peer status. Some err log is print in cmd_history.log as below. [2016-01-30 04:32:51.692254] : volume add-brick c_glusterfs replica 2 10.32.1.144:/opt/lvmdir/c2/brick force : FAILED : Another transaction is in progress for c_glusterfs. Please try again after sometime. [2016-01-30 04:39:16.976377] : volume add-brick c_glusterfs replica 2 10.32.1.144:/opt/lvmdir/c2/brick force : FAILED : Locking failed on 10.32.1.144. Please check log file for details. [2016-02-01 12:12:03.345309] : volume status : FAILED : Staging failed on 10.32.1.144. Please check log file for details. [2016-02-01 12:12:44.349098] : volume status all : FAILED : Staging failed on 10.32.1.144. Please check log file for details. [2016-02-01 12:30:04.436199] : volume status all : FAILED : Staging failed on 10.32.1.144. Please check log file for details. [2016-02-01 12:30:04.437085] : volume status : FAILED : Staging failed on 10.32.1.144. Please check log file for details Some err log is print in cli.log as below. [2016-01-30 04:31:13.376653] I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6 [2016-01-30 04:31:13.389877] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-01-30 04:31:13.391528] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now [2016-01-30 04:31:13.392530] I [cli-rpc-ops.c:2465:gf_cli_add_brick_cbk] 0-cli: Received resp to add brick [2016-01-30 04:31:13.392687] I [input.c:36:cli_batch] 0-: Exiting with: -1 [2016-01-30 04:31:14.422344] I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6 [2016-01-30 04:31:14.434340] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-01-30 04:31:14.434792] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now [2016-01-30 04:31:14.437131] I [cli-rpc-ops.c:2465:gf_cli_add_brick_cbk] 0-cli: Received resp to add brick [2016-01-30 04:31:14.437266] I [input.c:36:cli_batch] 0-: Exiting with: -1 Thanks, Xin Hi, From the log in board 10.32.0.48 I found the following things. [2016-01-30 04:29:07.271245] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.32.1.144. Please check log file for details. [2016-01-30 04:29:07.271358] E [MSGID: 106151] [glusterd-syncop.c:1861:gd_sync_task_begin] 0-management: Locking Peers Failed. ... At this point the 10.32.0.48 has lock the volume c_glusterfs successfully, but failed when it lock the 10.32.1.144.I think it may be because the 10.32.0.48 want to lock twice 10.32.1.144.The reason is because there are same host twice as below. Number of Peers: 2 Hostname: 10.32.1.144 Uuid: bbe2a458-ad3d-406d-b233-b6027c12174e State: Peer in Cluster (Connected) Hostname: 10.32.1.144 Uuid: bbe2a458-ad3d-406d-b233-b6027c12174e State: Peer in Cluster (Connected) And the 10.32.0.48 didn't release the volume c_glusterfs lock until 04:39:16 as below. ... [2016-01-30 04:39:16.976206] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.32.1.144. Please check log file for details. [2016-01-30 04:39:16.976292] E [MSGID: 106152] [glusterd-syncop.c:1562:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s) ... From 04:29:07 to 04:39:16 ,the command “add brick” try to lock volume c_glusterfs lock, but it failed because the lock hasn't been released. But I don't understand why the 10.32.0.48 has two peers which are both 10.32.1.144. Who can help me to answer my quesion? All the error logs basically boil down to a single problem where you have duplicate peer entries. As Pranith pointed out we are not sure how did you end up in this situation. Do you recollect what all steps did you perform in terms of forming the cluster and then volume operations? Attaching glusterd & cmd_history log files from all the nodes would certainly help in analysing the root cause. Hi, This issue has been solved.This issue is cause by we use a new node to replace a bad node.But we did not copy glusterd.info from bad node to new node. Thanks, Xin |