Bug 762486 (GLUSTER-754)

Summary: enable tcp keepalive
Product: [Community] GlusterFS Reporter: Krishna Srinivas <krishna>
Component: protocolAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: urgent    
Version: 3.0.0CC: amarts, anush, gluster-bugs, pavan
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Trace of a connection sending keep-alives every ten seconds none

Description Krishna Srinivas 2010-03-24 04:10:41 UTC
On Tue, Mar 23, 2010 at 11:28 PM, Anand Avati <avati> wrote:
>
>> http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#preventingdisconnection
>> >
>> > Fixing this problem in glusterfs is very simple, just call on socket
>> fd:
>> >   optval = 1;
>> >   optlen = sizeof(optval);
>> >   if(setsockopt(s, SOL_SOCKET, SO_KEEPALIVE, &optval, optlen) < 0)
>> {
>> >      /* ERROR */
>> >   }
>> >
>> > This is very light weight on the network - by default it sends a
>> > keepalive packet every 2 hours (configurable in /proc).
>
> Is keepalive interval tunable per-socket from the systemcall level? setting the tcp keepalive is simple and easy, except that I'm wondering if 2-hrs is sufficient. Can we make it something like 10mins?

http://www.linux.org/docs/ldp/howto/TCP-Keepalive-HOWTO/programming.html

yeah looks like that can be configured too (per socket fd) with this call:

getsockopt(s, SOL_TCP, TCP_KEEPIDLE, &optval, &optlen)

you can choose the default value for this based on your discretion and
make it configurable. shall i confirm to Humedica that we will do this
in our code?

Krishna

Comment 1 Krishna Srinivas 2010-03-24 07:09:53 UTC
Customer has a firewall between clients and servers. This firewall
breaks idle TCP connections and this causes the first access on the
client mount point to return error. Subsequent access works fine.
Hence automated scripts that are run the first time after idle
connection is broken will fail.

Apparently this is a common problem:
http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#preventingdisconnection

Fixing this problem in glusterfs is very simple, just call on socket fd:
  optval = 1;
  optlen = sizeof(optval);
  if(setsockopt(s, SOL_SOCKET, SO_KEEPALIVE, &optval, optlen) < 0) {
     /* ERROR */
  }

This is very light weight on the network - by default it sends a
keepalive packet every 2 hours (configurable in /proc). This will be
useful for other customers who have firewalls that terminate idle
connections.

Krishna

Comment 2 Shehjar Tikoo 2010-05-17 03:18:30 UTC
Setting p1 to show up in my list of prio bugs

Comment 3 Shehjar Tikoo 2010-05-17 06:01:11 UTC
Created attachment 204 [details]
Proposed patch (as in full description) as an attachment

Trace of a connection sending keep-alives every ten seconds. View using wireshark.

Option for enabling keep-alive:
option transport.socket.keepalive-interval 10

Comment 4 Anand Avati 2010-05-21 04:32:11 UTC
PATCH: http://patches.gluster.com/patch/3287 in master (socket: Support TCP-KEEPALIVE)

Comment 5 Anand Avati 2010-05-21 04:32:30 UTC
PATCH: http://patches.gluster.com/patch/3288 in release-3.0 (socket: Support TCP-KEEPALIVE)

Comment 6 Anand Avati 2010-05-26 04:26:05 UTC
PATCH: http://patches.gluster.com/patch/3303 in master (socket: make tcp keepalive work on OS X)

Comment 7 Anand Avati 2010-05-27 06:00:59 UTC
PATCH: http://patches.gluster.com/patch/3322 in release-3.0 (socket: make tcp keepalive work on OS X)