Hide Forgot
Description of problem: Log messages that are seen when a remote endpoint is not reachable for a time period greater than the configured frame timeout (default 30mins). The frame timeout is the length of the duration for which the 'client' waits on the 'server', that is not disconnected, for a response. After frame time-out, the 'client' bails the RPC. This event is logged since its valuable information about the network status of that connection pair (ie, client-server). But this log message can't be unambiguously identified with a connection pair. This makes debugging such network failures frustrating. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create a volume (say test-vol) and start it. 2.Add the following iptables rule, # iptables -I INPUT -p tcp --dport 24007 -j DROP 3. # gluster volume status test-vol Actual results: call bail logs don't have connection pair information Expected results: call bail logs should have enough information to determine the connection pair. Additional info: Upstream fix - http://review.gluster.org/6091
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/16279/
KP, Please review the doc text. I've edited it and made it a little less technical for the purpose of errata.
Pavithra, The edits look good to me.
Tested with glusterfs -3.4.0.55rhs-1.el6rhs Verified with the following steps : 1. Created a 2 Node cluster 2. Created a distribute replicate volume with 2X2 bricks 3. Started the volume 4. Set the frame-timeout to 10seconds (i.e) gluster volume set <vol-name> network.frame-timeout 10 5. Set the ping-timeout to max value (i.e) 1013 seconds (i.e) gluster volume set <vol-name> network.ping-timeout 1013 6. Fuse mount the volume 7. Block all the traffic to glusterd and bricks (i.e) iptables -I INPUT 1 -p tcp --dport 553:24007 -j DROP 8. execute "df- Th" Following snip is found in mount log : <snip> [2014-01-13 18:09:37.167035] E [rpc-clnt.c:208:call_bail] 0-drvol-client-0: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.115990. timeout = 10 for 10.70.37.10:49153 [2014-01-13 18:09:37.167092] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-0: remote operation failed: Transport endpoint is not connected [2014-01-13 18:09:37.167110] E [rpc-clnt.c:208:call_bail] 0-drvol-client-2: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x12x sent = 2014-01-13 18:09:17.116044. timeout = 10 for 10.70.37.10:49154 [2014-01-13 18:09:37.167120] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-2: remote operation failed: Transport endpoint is not connected [2014-01-13 18:09:37.167131] E [rpc-clnt.c:208:call_bail] 0-drvol-client-1: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.116020. timeout = 10 for 10.70.37.11:49153 [2014-01-13 18:09:37.167131] E [rpc-clnt.c:208:call_bail] 0-drvol-client-1: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.116020. timeout = 10 for 10.70.37.11:49153 [2014-01-13 18:09:37.167151] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-1: remote operation failed: Transport endpoint is not connected [2014-01-13 18:09:37.167170] E [rpc-clnt.c:208:call_bail] 0-drvol-client-3: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x12x sent = 2014-01-13 18:09:17.116061. timeout = 10 for 10.70.37.11:49154 </snip> From the above snip its clear that call bail logs have corresponding IP address of the machine
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html