Bug 1034547

Summary: call_bail logs should have remote host information for debugging purposes
Product: Red Hat Gluster Storage Reporter: krishnan parthasarathi <kparthas>
Component: glusterfsAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: kparthas, ndevos, nsathyan, psriniva, vagarwal, vbellur, vraman
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.49rhs Doc Type: Bug Fix
Doc Text:
Previously, when a mount process waits for response from a brick process, for longer than 30 minutes (default timeout), the mount would fail the request from the application. This event would generate log messages with information about the request that timed out. With this update, the information in the log messages is enhanced to provide the brick processes' hostname to help with meaningful analysis of log messages.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 08:06:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description krishnan parthasarathi 2013-11-26 05:28:59 UTC
Description of problem:
Log messages that are seen when a remote endpoint is not reachable for a time period greater than the configured frame timeout (default 30mins). The frame timeout is the length of the duration for which the 'client' waits on the 'server', that is not disconnected, for a response. After frame time-out, the 'client' bails the RPC. This event is logged since its valuable information about the network status of that connection pair (ie, client-server).
But this log message can't be unambiguously identified with a connection pair. This makes debugging such network failures frustrating.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Create a volume (say test-vol) and start it. 

2.Add the following iptables rule,
# iptables -I INPUT -p tcp --dport 24007 -j DROP
 
3. # gluster volume status test-vol

Actual results:
call bail logs don't have connection pair information

Expected results:
call bail logs should have enough information to determine the connection pair.

Additional info:
Upstream fix - http://review.gluster.org/6091

Comment 2 Niels de Vos 2013-12-04 11:59:42 UTC
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/16279/

Comment 3 Pavithra 2013-12-31 06:10:39 UTC
KP,

Please review the doc text. I've edited it and made it a little less technical for the purpose of errata.

Comment 4 krishnan parthasarathi 2014-01-03 10:49:31 UTC
Pavithra,
The edits look good to me.

Comment 5 SATHEESARAN 2014-01-13 12:55:47 UTC
Tested with glusterfs -3.4.0.55rhs-1.el6rhs

Verified with the following steps :
1. Created a 2 Node cluster
2. Created a distribute replicate volume with 2X2 bricks
3. Started the volume
4. Set the frame-timeout to 10seconds
 (i.e) gluster volume set <vol-name> network.frame-timeout 10
5. Set the ping-timeout to max value (i.e) 1013 seconds
(i.e) gluster volume set <vol-name> network.ping-timeout 1013
6. Fuse mount the volume
7. Block all the traffic to glusterd and bricks
(i.e) iptables -I INPUT 1 -p tcp --dport 553:24007 -j DROP
8. execute "df- Th" 

Following snip is found in mount log :
<snip>
[2014-01-13 18:09:37.167035] E [rpc-clnt.c:208:call_bail] 0-drvol-client-0: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.115990. timeout = 10 for 10.70.37.10:49153
[2014-01-13 18:09:37.167092] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-0: remote operation failed: Transport endpoint is not connected
[2014-01-13 18:09:37.167110] E [rpc-clnt.c:208:call_bail] 0-drvol-client-2: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x12x sent = 2014-01-13 18:09:17.116044. timeout = 10 for 10.70.37.10:49154
[2014-01-13 18:09:37.167120] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-2: remote operation failed: Transport endpoint is not connected
[2014-01-13 18:09:37.167131] E [rpc-clnt.c:208:call_bail] 0-drvol-client-1: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.116020. timeout = 10 for 10.70.37.11:49153
[2014-01-13 18:09:37.167131] E [rpc-clnt.c:208:call_bail] 0-drvol-client-1: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x13x sent = 2014-01-13 18:09:17.116020. timeout = 10 for 10.70.37.11:49153
[2014-01-13 18:09:37.167151] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 0-drvol-client-1: remote operation failed: Transport endpoint is not connected
[2014-01-13 18:09:37.167170] E [rpc-clnt.c:208:call_bail] 0-drvol-client-3: bailing out frame type(GlusterFS 3.3) op(STATFS(14)) xid = 0x12x sent = 2014-01-13 18:09:17.116061. timeout = 10 for 10.70.37.11:49154
</snip>

From the above snip its clear that call bail logs have corresponding IP address of the machine

Comment 7 errata-xmlrpc 2014-02-25 08:06:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html