Bug 1034547 - call_bail logs should have remote host information for debugging purposes
Summary: call_bail logs should have remote host information for debugging purposes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 2.1.2
Assignee: krishnan parthasarathi
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-26 05:28 UTC by krishnan parthasarathi
Modified: 2015-11-03 23:05 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.4.0.49rhs
Doc Type: Bug Fix
Doc Text:
Previously, when a mount process waits for response from a brick process, for longer than 30 minutes (default timeout), the mount would fail the request from the application. This event would generate log messages with information about the request that timed out. With this update, the information in the log messages is enhanced to provide the brick processes' hostname to help with meaningful analysis of log messages.
Clone Of:
Environment:
Last Closed: 2014-02-25 08:06:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0208 0 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #2 2014-02-25 12:20:30 UTC

Description krishnan parthasarathi 2013-11-26 05:28:59 UTC
Description of problem:
Log messages that are seen when a remote endpoint is not reachable for a time period greater than the configured frame timeout (default 30mins). The frame timeout is the length of the duration for which the 'client' waits on the 'server', that is not disconnected, for a response. After frame time-out, the 'client' bails the RPC. This event is logged since its valuable information about the network status of that connection pair (ie, client-server).
But this log message can't be unambiguously identified with a connection pair. This makes debugging such network failures frustrating.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Create a volume (say test-vol) and start it. 

2.Add the following iptables rule,
# iptables -I INPUT -p tcp --dport 24007 -j DROP
 
3. # gluster volume status test-vol

Actual results:
call bail logs don't have connection pair information

Expected results:
call bail logs should have enough information to determine the connection pair.

Additional info:
Upstream fix - http://review.gluster.org/6091

Comment 2 Niels de Vos 2013-12-04 11:59:42 UTC
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/16279/

Comment 3 Pavithra 2013-12-31 06:10:39 UTC
KP,

Please review the doc text. I've edited it and made it a little less technical for the purpose of errata.

Comment 4 krishnan parthasarathi 2014-01-03 10:49:31 UTC
Pavithra,
The edits look good to me.

Comment 5 SATHEESARAN 2014-01-13 12:55:47 UTC
Tested with glusterfs -3.4.0.55rhs-1.el6rhs

Verified with the following steps :
1. Created a 2 Node cluster
2. Created a distribute replicate volume with 2X2 bricks
3. Started the volume
4. Set the frame-timeout to 10seconds
 (i.e) gluster volume set <vol-name> network.frame-timeout 10
5. Set the ping-timeout to max value (i.e) 1013 seconds
(i.e) gluster volume set <vol-name> network.ping-timeout 1013
6. Fuse mount the volume
7. Block all the traffic to glusterd and bricks
(i.e) iptables -I INPUT 1 -p tcp --dport 553:24007 -j DROP
8. execute "df- Th" 

Following snip is found in mount log :
<snip>
[2014-01-13 18:09:37.167035] E [rpc-clnt.c:208:call_bail] 0-drvol-client-0: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.115990. timeout = 10 for 10.70.37.10:49153
[2014-01-13 18:09:37.167092] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-0: remote operation failed: Transport endpoint is not connected
[2014-01-13 18:09:37.167110] E [rpc-clnt.c:208:call_bail] 0-drvol-client-2: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x12x sent = 2014-01-13 18:09:17.116044. timeout = 10 for 10.70.37.10:49154
[2014-01-13 18:09:37.167120] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-2: remote operation failed: Transport endpoint is not connected
[2014-01-13 18:09:37.167131] E [rpc-clnt.c:208:call_bail] 0-drvol-client-1: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.116020. timeout = 10 for 10.70.37.11:49153
[2014-01-13 18:09:37.167131] E [rpc-clnt.c:208:call_bail] 0-drvol-client-1: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.116020. timeout = 10 for 10.70.37.11:49153
[2014-01-13 18:09:37.167151] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-1: remote operation failed: Transport endpoint is not connected
[2014-01-13 18:09:37.167170] E [rpc-clnt.c:208:call_bail] 0-drvol-client-3: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x12x sent = 2014-01-13 18:09:17.116061. timeout = 10 for 10.70.37.11:49154
</snip>

From the above snip its clear that call bail logs have corresponding IP address of the machine

Comment 7 errata-xmlrpc 2014-02-25 08:06:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html


Note You need to log in before you can comment on or make changes to this bug.