Created attachment 1643161 [details] I have attached logs from both node one and node two. Description of problem: Unable to add bricks/ added brick is not reflected in the server. I have a created a volume (say clusterfs) in the server (say server1). I will try to add a brick from server 2 to the volume(clusterfs). i got an error saying the brick is already part of the volume. So i did "gluster volume info" on the server 2, I can see two bricks in the volume, but i also did "gluster volume info" on server 1 where i can see only one brick(server1). Also i had mounted my volume clusterfs in the server on to a directory. but now it is throwing an error "Transport end point not connected". Version-Release number of selected component (if applicable): Cent OS: gluster 5.10 How reproducible: it is constantly reproducible. Steps to Reproduce: 1. Start a fresh node. Create a volume. 2. Start three new nodes and add the bricks to the volume. 3. hard shutdown the first node and start it back. 4. Now you can see volume is stable and my mount point exists. 5. But when i will poweroff all the nodes and start the a first node and detach the old peers and remove all previous peers bricks. 6. Now you can start second node, remove previous bricks and detach the old peers. 7. perform a peer probe on the node 1 to node 2. 8. Not try to add the brick from node 2. you will face this issue. Actual results: Unable to add bricks or added brick on the second node is not reflected on the first node. Expected results: I must be able to add the brick from the second node. Additional info: I have attached logs from both node one and node two.
Seems to be some missing steps where you've not probed the nodes to form a cluster? "16349c14-b378-4b09-bbb6-1c3407c171ba doesn't belong to the cluster. Ignoring request." - from the logs. Moving to the community project for triage.
I have probed the server to form the cluster. You can see the node 1 cmd_history logs [2019-12-06 10:34:31.904048] : peer probe 192.168.2.171 : SUCCESS [2019-12-06 10:34:54.566524] : peer probe 192.168.2.171 : SUCCESS : Host 192.168.2.171 port 24007 already in peer listt
I see below errors n glusterd logs: [2019-12-06 10:34:47.101646] E [MSGID: 106053] [glusterd-utils.c:13943:glusterd_handle_replicate_brick_ops] 0-management: Failed to set extended attribute trusted.add-brick : Transport endpoint is not connected [Transport endpoint is not connected] [2019-12-06 10:34:47.129558] E [MSGID: 106073] [glusterd-brick-ops.c:2595:glusterd_op_add_brick] 0-glusterd: Unable to add bricks [2019-12-06 10:34:47.129681] E [MSGID: 106122] [glusterd-mgmt.c:299:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit failed. @Karthik, I couldn't find when lsetxattr system call returns with "Transport endpoint is not connected" error. Please let us know. Thanks, Sanju
Hi Akshay, According to the glusterd log it is failing to add brick because the brick is already part of a volume. [2019-12-06 10:34:47.035961] E [MSGID: 106451] [glusterd-utils.c:7705:glusterd_is_path_in_use] 0-management: /mnt/glusterfs/bricks/clusterfs is already part of a volume [File exists] You will get this error if the brick you are trying to add to the volume is already has either the xattr "trusted.gfid" or "trusted.glusterfs.volume-id" is present on the root of the brick. Looks like the brick is not formatted properly before adding it to the volume again. Can you try formatting the brick in step 6 and clear any gluster specific xattrs on the brick (use "getfattr -d -m . -e hex <brick-path>" to get the xattrs) and the ".glusterfs" directory inside it and try adding them? Regards, Karthik
@karthik, the brick was never part of the volume as it was a new machine in the cluster. Here what is happening. When i will try to add the brick from node 2, I am getting brick already exists. So i did "gluster volume info" in node 2 , i can see both bricks (node 1 and node 2) are present in the volume. But when i will go to node 1, perform same command. I dont see bricks of node 2 in the volume. only i can see are bricks from node 1. So our expectation is that if i do "gluster volume info" in both the nodes, I should see both the bricks in the volume in both of nodes. But seems the sync is not happening. Please let me know if i am doing something wrong. Thanks akshay
According to the cmd_history.log you are trying to add the same brick which was removed from the volume before [2019-12-06 10:31:49.899700] : volume remove-brick clusterfs replica 4 192.168.2.171:/mnt/glusterfs/bricks/clusterfs force : SUCCESS Then you detached and added the same node to the peer list [2019-12-06 10:31:55.783666] : peer detach 192.168.2.171 force : SUCCESS [2019-12-06 10:34:31.904048] : peer probe 192.168.2.171 : SUCCESS After that you are trying to add the same brick on the same node back to the volume [2019-12-06 10:34:47.131632] : volume add-brick clusterfs replica 2 192.168.2.171:/mnt/glusterfs/bricks/clusterfs force : FAILED : /mnt/glusterfs/bricks/clusterfs is already part of a volume From these logs by looking at the IP I doubt this is a new machine & from the brick path it is same as the one which was removed earlier. Please correct me if I am missing something here. Please get the xattrs on the bricks before trying to add the brick, to confirm whether my hypothesis was right or not. As far as the difference in the volume info is considered there seems to be a problem with syncing. I think restarting glusterd on both the machine one after the other should solve this issue. @Sanju shouldn't the glusterd get synced in step 7 after the peer probe?
Hi Karthik, My new machine refers to new node in the cluster which has some bricks from previous old cluster. I have tried to do xattrs but it is not helping. As i told you the number of bricks are not sync. Means: Node 1 : It is only node 1 brick. Node 2 : It has both node 1 and node 2 brick. I am not sure why these are not in sync. But i restarted glusterd on both the nodes, Then Node 1: It has only 1 brick Node 2: It has only 1 bricks of node 1. Now i have tried to add brick to node 2 and it is successful. Means: Node 1 : It has both node 1 and node 2 brick. Node 2 : It has both node 1 and node 2 brick. Any reason why bricks are not in sync as we dont want to restart glusterd as it is heavy operation? Any other solution ? Thanks Akshay
@Akshay, Thanks for trying the workarounds suggested and getting back with the results. @Sanju could you let us know whether there is any problem in the glusterd graph syc part? After restarting the glusterd the issue went away in their setup. Looking at the description it should be easily reproducible as well.
@Karthik, Here are more simpler steps: 1. Start a node (node1) 2. create and start volume say (vol1) on node 1. 3. Start node 2. 4. Perform peer probe to node 2 to node 1. 5. Add the bricks of node 2 (replica 2) to vol1 6. Poweroff the node 1. 7. Start the node 3 8. Perform peer probe from node 2 to node 3 9. Try to add the bricks to vol1 from node 3 (replica 3).
There should not be any out of sync situation with above steps unless there are partial commits. Akshay, have you see any commit failed messages with any of the operations? Thanks, Sanju
Yes we will see some commit error while adding bricks some time [2019-12-23 05:04:16.766015] : volume add-brick clusterfs replica 3 10.223.96.139:/new/new force : FAILED : Commit failed on localhost. Please check log file for details. What this signifies and not sure why we are getting this error in add-brick operation akshay
commit failed means, the operation is failed in commit phase. Every transaction in goes through 4 phases: 1.locking 2.staging 3.commit 4.unlock locking is where the required locks are taken. staging is where the validations are done. commit is when the operation is truely sone and written to the store. unlock is releasing the locks. You will be seeing some errors in glusterd.log regarding the commit failure. Please check the log. Thanks, Sanju
Created attachment 1647289 [details] Gluster logs
Attached Gluster logs
I see the same errors as in https://bugzilla.redhat.com/show_bug.cgi?id=1781003#c3. @Karthik, please help him with "Transport endpoint is not connected" error.
Hi Akshay, - I guess the logs which you have attached is for the steps given in comment #9. - If yes, then I guess you are hitting the bug https://bugzilla.redhat.com/show_bug.cgi?id=1572599 - To verify that can you mount the volume after step 2 or step 5. Then continue with the rest of the steps and let us know whether you still hit the same problem? Regards, Karthik
(In reply to Sanju from comment #15) > I see the same errors as in > https://bugzilla.redhat.com/show_bug.cgi?id=1781003#c3. @Karthik, please > help him with "Transport endpoint is not connected" error. If the logs provided is for the steps mentioned in comment #9, then it is a different problem. In the initial case, there was a mount for the volume. In that case even after removing the gluster specific xattrs and formatting the brick if it was failing then it is because of the graph sync issue, that 1st node was not aware of the brick added by the 2nd node. In this case (comment #9) there is no mount as per the steps mentioned, and it will fail to add brick due to the lookup failing on quorum number of bricks. To confirm this let us wait for the results of comment #16.
I have upgarded to gluster 7 and seems like it has resolved the issue Thanks for the respond. I will let you guys know if i encounter further any issues