Bug 1129675

Summary: [SNAPSHOT]: Once the snapshot is retored, the gluster volume heal <vol-name> shows "Transport endpoint is not connected" and writes from client is pending for this brick
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: snapshotAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: Rahul Hinduja <rhinduja>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: asriram, mzywusko, rhs-bugs, sankarshan, smohan
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: SNAPSHOT
Fixed In Version: Doc Type: Known Issue
Doc Text:
If glusterd is down in one of the nodes in cluster or if the node itself is down, then performing a snapshot restore operation leads to inconsistency. Workaround (if any): Perform snapshot restore only if all the nodes and their corresponding glusterd services are running. In the following conditions after restore, restart glusterd service using "service glusterd start" command -Executing "gluster volume heal <vol-name> info" command displays the error message "Transport endpoint not connected". -Error occurs when clients try to connect to glusterd service.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-16 16:03:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1087818    

Description Rahul Hinduja 2014-08-13 12:57:16 UTC
Description of problem:
========================

In a scenario, where snap restore is performed when glusterd was down in one of the node in cluster. Restore is successful and entry is updated in the missed_snaps_list with entry 2:1. When a glusterd is brought online, the missed entry list restores and update its entry to 2:2 that means the restore is successful on this node as well.

But if you than issue a command "gluster volume heal <vol-name> info" it gives a error "Transport endpoint is not connected" for the restored brick(where glusterd was down during the restore). But gluster volume status shows that the brick is online.

As follows:
===========

[root@inception ~]# gluster volume heal vol1 info
Brick inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps
Number of entries: 0

Brick rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/48bef55ddc1a4266ba49a7873d91c457/brick2/b2
Status: Transport endpoint is not connected

Brick rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/48bef55ddc1a4266ba49a7873d91c457/brick3/b2/
Number of entries: 0

Brick rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/48bef55ddc1a4266ba49a7873d91c457/brick4/b2/
Number of entries: 0

[root@inception ~]# 



Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.6.0.27-1.el6rhs.x86_64


How reproducible:
=================
always


Steps to Reproduce:
===================
1. Create 4 node cluster
2. Create a 2*2 volume
3. Create a snapshot (snap1) of the volume
4. Check the gluster volume heal <vol-name> info, it should be successful
5. bring down glusterd on one of the node(for ex node2)
6. offline the volume using "gluster volume stop vol"
7. Restore the volume to snap1. Restore should be successful
8. Start the volume
9. Start the glusterd on node2
10. check gluster volume status <vol-name>, it should list all the process online.
11. Check the gluster volume heal <vol-name> info

Actual results:
===============

It errors as "Status: Transport endpoint is not connected" for a brick participating in the node where glusterd was down during restore (node2)


Expected results:
=================

When glusterd is brought online at step 9 the "gluster volume heal  <vol-name> info" should not error with "Transport endpoint is not connected"

Comment 2 Rahul Hinduja 2014-08-13 13:55:04 UTC
Marking this bug urgent as client also doesnt connect to the brick which is part of the node2. Any writes from client is pending to this brick.

Comment 7 Shalaka 2014-09-20 09:37:32 UTC
Please review and sign-off edited doc text.

Comment 8 Shalaka 2014-09-26 05:45:25 UTC
Canceling need_info as Rajesh reviewed and signed-off doc text during online review meeting.