Bug 846619

Summary: Client doesn't reconnect after server comes back online
Product: [Community] GlusterFS Reporter: Martin Erlingsson <martin.erlingsson>
Component: unclassifiedAssignee: Vijay Bellur <vbellur>
Status: CLOSED DEFERRED QA Contact:
Severity: urgent Docs Contact:
Priority: high    
Version: 3.3.0CC: bugs, gluster-bugs, joe, mailbox, viktor
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-14 19:40:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
extracts from log files, cli output, statedumps none

Description Martin Erlingsson 2012-08-08 09:05:10 UTC
Created attachment 602973 [details]
extracts from log files, cli output, statedumps

Description of problem:
Client doesn't reconnect after server comes back online (but logs and volume status say it does). Requires manual remount.
We have one client and two servers. They are all running Debian Squeeze and we compiled glusterfs-3.3.0 from source on each one of them.

Version-Release number of selected component (if applicable):
Glusterfs-3.3.0

How reproducible:
100% reproducible in our lab setup

Steps to Reproduce:
1. Run ls -lR in mounted gluster folder on client to start directory listing
2. Pull network cable from server2 for one minute. Directory listing waits for gluster timeout and resumes with only one server.
3. Reconnect network cable to server2. 
4. Create a new file in mounted gluster folder on client. Run ls -lR in mounted gluster folder on client and in the exported folder on the servers.
  
Actual results:
The new file only shows up on server1. glusterfsd process uses ~10% CPU time on server1 but ~0% on server2. Redundancy lost. 

Expected results:
Client should properly connect to server2 when it comes back online. The new file should show up on both servers. glusterfsd process should be using cpu time on both servers. Redundancy should be re-aquired.

Additional info:
Despite not connecting properly, the client log states that the client connects to the reconnected server: "Connected to 10.128.196.182:24010, attached to remote volume '/data/exp'.". Both servers show all bricks connected when running gluster volume status or gluster volume info.
Same problem occurs when I unplug server2 during a dbench on gluster mount. Same problem occurs whichever of server1 and server2 I unplug.
When the client/servers are in this state I can manually unmount and mount the filesystem again (umount /storage/asd and then mount -a) to get a proper connection to both servers. If I instead unplug the remaining server from the network, I get "Transport endpoint is not connected" when trying to use the mounted gluster folder.
The client and the servers are virtual machines in VMWare vSphere. I unplug the network by de-selecting "Connected" on the settings for Network for the virtual machine.

Attaching file: 
gluster-connect-bug-2012-08-08.txt with extracts from log files, gluster commands output, and volume statedumps from the servers.

Comment 1 Niels de Vos 2014-11-27 14:53:46 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.