1781003 – Gluster: Unable to add bricks to volume/ is already part of a volume/ Transport End point is not connected.

Bug 1781003 - Gluster: Unable to add bricks to volume/ is already part of a volume/ Transport End point is not connected.

Summary: Gluster: Unable to add bricks to volume/ is already part of a volume/ Transp...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Assignee:	Sanju
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-09 04:44 UTC by akshsy
Modified:	2020-01-06 05:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-01-06 05:32:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
I have attached logs from both node one and node two. (58.18 KB, application/zip) 2019-12-09 04:44 UTC, akshsy	no flags	Details
Gluster logs (110.53 KB, application/zip) 2019-12-23 07:36 UTC, akshsy	no flags	Details
View All

Description akshsy 2019-12-09 04:44:06 UTC

Created attachment 1643161 [details]
I have attached logs from both node one and node two.

Description of problem:

Unable to add bricks/ added brick is not reflected in the server.

I have a created a volume (say clusterfs) in the server (say server1). I will try to add a brick from server 2 to the volume(clusterfs). i got an error saying the brick  is already part of the volume. 
So i did "gluster volume info" on the server 2, I can see two bricks in the volume, but i also did "gluster volume info" on server 1 where i can see only one brick(server1).

 Also i had mounted my volume clusterfs in the server on to a directory. but now it is throwing an error "Transport end point not connected".
  

Version-Release number of selected component (if applicable): Cent OS: gluster 5.10


How reproducible: it is constantly reproducible. 


Steps to Reproduce:
1. Start a fresh node. Create a volume.
2. Start three new nodes and add the bricks to the volume.
3. hard shutdown the first node and start it back.
4. Now you can see volume is stable and my mount point exists.
5. But when i will poweroff all the nodes and start the a first node and detach the old peers and remove all previous peers bricks.
6. Now you can start second node, remove previous bricks and detach the old peers.
7. perform a peer probe on the node 1 to node 2.
8. Not try to add the brick from node 2. you will face this issue.

Actual results: Unable to add bricks or added brick on the second node is not reflected on the first node.


Expected results: I must be able to add the brick from the second node.


Additional info:

I have attached logs from both node one and node two.

Comment 1 Sahina Bose 2019-12-09 08:13:25 UTC

Seems to be some missing steps where you've not probed the nodes to form a cluster?
"16349c14-b378-4b09-bbb6-1c3407c171ba doesn't belong to the cluster. Ignoring request." - from the logs.

Moving to the community project for triage.

Comment 2 akshsy 2019-12-09 08:19:11 UTC

I have probed the server to form the cluster. You can see the node 1  cmd_history logs 

[2019-12-06 10:34:31.904048]  : peer probe 192.168.2.171 : SUCCESS    
[2019-12-06 10:34:54.566524]  : peer probe 192.168.2.171 : SUCCESS : Host 192.168.2.171 port 24007 already in peer listt

Comment 3 Sanju 2019-12-11 05:04:29 UTC

I see below errors n glusterd logs:

[2019-12-06 10:34:47.101646] E [MSGID: 106053] [glusterd-utils.c:13943:glusterd_handle_replicate_brick_ops] 0-management: Failed to set extended attribute trusted.add-brick : Transport endpoint is not connected [Transport endpoint is not connected]
[2019-12-06 10:34:47.129558] E [MSGID: 106073] [glusterd-brick-ops.c:2595:glusterd_op_add_brick] 0-glusterd: Unable to add bricks
[2019-12-06 10:34:47.129681] E [MSGID: 106122] [glusterd-mgmt.c:299:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit failed.

@Karthik, I couldn't find when lsetxattr system call returns with "Transport endpoint is not connected" error. Please let us know.

Thanks,
Sanju

Comment 4 Karthik U S 2019-12-12 05:23:30 UTC

Hi Akshay,

According to the glusterd log it is failing to add brick because the brick is already part of a volume.

[2019-12-06 10:34:47.035961] E [MSGID: 106451] [glusterd-utils.c:7705:glusterd_is_path_in_use] 0-management: /mnt/glusterfs/bricks/clusterfs is already part of a volume [File exists]

You will get this error if the brick you are trying to add to the volume is already has either the xattr "trusted.gfid" or "trusted.glusterfs.volume-id" is present on the root of the brick. Looks like the brick is not formatted properly before adding it to the volume again. Can you try formatting the brick in step 6 and clear any gluster specific xattrs on the brick (use "getfattr -d -m . -e hex <brick-path>" to get the xattrs) and the ".glusterfs" directory inside it and try adding them?

Regards,
Karthik

Comment 5 akshsy 2019-12-12 05:34:44 UTC

@karthik,

the brick was never part of the volume as it was a new machine in the cluster. Here what is happening.

When i will try to add the brick from node 2, I am getting brick already exists.

So i did "gluster volume info" in node 2 , i can see both bricks (node 1 and node 2) are present in the volume. 
But when i will go to node 1, perform same command. 

I dont see bricks of node 2 in the volume. only i can see are bricks from node 1.

So our expectation is that if i do "gluster volume info" in both the nodes, I should see both the bricks in the volume in both of nodes. But seems the sync is not happening. 

Please let me know if i am doing something wrong.

Thanks
akshay

Comment 6 Karthik U S 2019-12-12 05:52:20 UTC

According to the cmd_history.log you are trying to add the same brick which was removed from the volume before
[2019-12-06 10:31:49.899700]  : volume remove-brick clusterfs replica 4 192.168.2.171:/mnt/glusterfs/bricks/clusterfs force : SUCCESS

Then you detached and added the same node to the peer list
[2019-12-06 10:31:55.783666]  : peer detach 192.168.2.171 force : SUCCESS
[2019-12-06 10:34:31.904048]  : peer probe 192.168.2.171 : SUCCESS

After that you are trying to add the same brick on the same node back to the volume
[2019-12-06 10:34:47.131632]  : volume add-brick clusterfs replica 2 192.168.2.171:/mnt/glusterfs/bricks/clusterfs force : FAILED : /mnt/glusterfs/bricks/clusterfs is already part of a volume

From these logs by looking at the IP I doubt this is a new machine & from the brick path it is same as the one which was removed earlier. Please correct me if I am missing something here. Please get the xattrs on the bricks before trying to add the brick, to confirm whether my hypothesis was right or not.
As far as the difference in the volume info is considered there seems to be a problem with syncing. I think restarting glusterd on both the machine one after the other should solve this issue. @Sanju shouldn't the glusterd get synced in step 7 after the peer probe?

Comment 7 akshsy 2019-12-17 02:47:38 UTC

Hi Karthik,

My new machine refers to new node in the cluster which has some bricks from previous old cluster. I have tried to do xattrs but it is not helping. As i told you  the  number of bricks are not sync. 

Means:

Node 1 : It is only node 1 brick.

Node 2 : It has both node 1 and node 2 brick.

I am not sure why these are not in sync. But i restarted glusterd on both the nodes, Then 

Node 1: It has only 1 brick 
Node 2: It has only 1 bricks of node 1.

Now i have tried to add brick to node 2 and it is successful.

Means: 

Node 1 : It has both node 1 and node 2 brick.

Node 2 : It has both node 1 and node 2 brick.

Any reason why bricks are not in sync as we dont want to restart glusterd as it is heavy operation?

Any other solution ?

Thanks
Akshay

Comment 8 Karthik U S 2019-12-19 12:05:02 UTC

@Akshay, Thanks for trying the workarounds suggested and getting back with the results.

@Sanju could you let us know whether there is any problem in the glusterd graph syc part? After restarting the glusterd the issue went away in their setup. Looking at the description it should be easily reproducible as well.

Comment 9 akshsy 2019-12-20 08:48:54 UTC

@Karthik,

Here are more simpler steps:

1. Start a node (node1)
2. create and start volume say (vol1) on node 1.
3. Start node 2. 
4. Perform peer probe to node 2 to node 1.
5. Add the bricks of node 2 (replica 2) to vol1
6. Poweroff the node 1.
7. Start the node 3
8. Perform peer probe from node 2  to node 3
9. Try to add the bricks to vol1 from node 3 (replica 3).

Comment 10 Sanju 2019-12-20 13:17:24 UTC

There should not be any out of sync situation with above steps unless there are partial commits.

Akshay, have you see any commit failed messages with any of the operations?

Thanks,
Sanju

Comment 11 akshsy 2019-12-23 05:09:28 UTC

Yes we will see some commit error while adding bricks some time 

[2019-12-23 05:04:16.766015]  : volume add-brick clusterfs replica 3 10.223.96.139:/new/new force : FAILED : Commit failed on localhost. Please check log file for details.

What this signifies and not sure why we are getting this error in add-brick operation

akshay

Comment 12 Sanju 2019-12-23 06:35:35 UTC

commit failed means, the operation is failed in commit phase. Every transaction in goes through 4 phases: 1.locking 2.staging 3.commit 4.unlock
locking is where the required locks are taken.
staging is where the validations are done.
commit is when the operation is truely sone and written to the store.
unlock is releasing the locks.

You will be seeing some errors in glusterd.log regarding the commit failure. Please check the log.

Thanks,
Sanju

Comment 13 akshsy 2019-12-23 07:36:59 UTC

Created attachment 1647289 [details]
Gluster logs

Comment 14 akshsy 2019-12-23 07:37:54 UTC

Attached Gluster logs

Comment 15 Sanju 2019-12-23 07:46:20 UTC

I see the same errors as in https://bugzilla.redhat.com/show_bug.cgi?id=1781003#c3. @Karthik, please help him with "Transport endpoint is not connected" error.

Comment 16 Karthik U S 2019-12-23 11:40:09 UTC

Hi Akshay,

- I guess the logs which you have attached is for the steps given in comment #9.
- If yes, then I guess you are hitting the bug https://bugzilla.redhat.com/show_bug.cgi?id=1572599
- To verify that can you mount the volume after step 2 or step 5. Then continue with the rest of the steps and let us know whether you still hit the same problem?

Regards,
Karthik

Comment 17 Karthik U S 2019-12-23 11:52:38 UTC

(In reply to Sanju from comment #15)
> I see the same errors as in
> https://bugzilla.redhat.com/show_bug.cgi?id=1781003#c3. @Karthik, please
> help him with "Transport endpoint is not connected" error.

If the logs provided is for the steps mentioned in comment #9, then it is a different problem.
In the initial case, there was a mount for the volume. In that case even after removing the gluster specific xattrs and formatting the brick if it was failing then it is because of the graph sync issue, that 1st node was not aware of the brick added by the 2nd node.
In this case (comment #9) there is no mount as per the steps mentioned, and it will fail to add brick due to the lookup failing on quorum number of bricks. To confirm this let us wait for the results of comment #16.

Comment 18 akshsy 2020-01-06 05:31:14 UTC

I have upgarded to gluster 7 and seems like it has resolved the issue 

Thanks for the respond.

I will let you guys know if i encounter further any issues

Note You need to log in before you can comment on or make changes to this bug.