Bug 1394219

Summary: Better logging when reporting failures of the kind "<file-path> Failing MKNOD as quorum is not met"
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Krutika Dhananjay <kdhananj>
Component: replicateAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: bugs, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---Keywords: Triaged
Target Release: RHGS 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1392865 Environment:
Last Closed: 2017-03-23 06:18:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1392865    
Bug Blocks: 1351528, 1393629, 1393630, 1393631    

Description Krutika Dhananjay 2016-11-11 12:33:23 UTC
+++ This bug was initially created as a clone of Bug #1392865 +++

Description of problem:

Often when debugging quorum failures, it helps to know the exact error code because of which the fop is being failed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2016-11-08 06:56:35 EST ---

REVIEW: http://review.gluster.org/15800 (cluster/afr: When failing fop due to lack of quorum, also log error string) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2016-11-09 05:27:50 EST ---

REVIEW: http://review.gluster.org/15800 (cluster/afr: When failing fop due to lack of quorum, also log error string) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2016-11-09 10:35:02 EST ---

REVIEW: http://review.gluster.org/15800 (cluster/afr: When failing fop due to lack of quorum, also log error string) posted (#3) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2016-11-09 19:45:19 EST ---

COMMIT: http://review.gluster.org/15800 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit f2e6cf9d9c578c58cdd348ebcd14dcbfdfbafadd
Author: Krutika Dhananjay <kdhananj>
Date:   Tue Nov 8 17:24:41 2016 +0530

    cluster/afr: When failing fop due to lack of quorum, also log error string
    
    Change-Id: I39de2bcfc660f23a4229a885a0e0420ca949ffc9
    BUG: 1392865
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/15800
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 2 Krutika Dhananjay 2016-11-11 12:41:39 UTC
The error string in this particular log message, had it been logged would have been much more helpful in debugging https://bugzilla.redhat.com/show_bug.cgi?id=1370350.

This bug was raised just to fix that - quorum loss log messaging also printing the error string for better debuggability.

Also note that the occurrence of this new information in the log can be easily verified as part of verification of https://bugzilla.redhat.com/show_bug.cgi?id=1370350

-Krutika

Comment 4 Krutika Dhananjay 2016-11-14 10:22:45 UTC
https://code.engineering.redhat.com/gerrit/#/c/90066/

Comment 7 Nag Pavan Chilakam 2017-01-11 08:40:16 UTC
qa_validation:
This is what i see when quorum is lost and the fuse client discovers
(confirmed with dev on the expected o/p, which is as below hence moving to verified

BEFORE:

[2017-01-10 12:17:28.684316] W [MSGID: 108001] [afr-transaction.c:804:afr_handle_quorum] 0-rep3-replicate-0: /dir1/linux-4.9.2/Documentation/devicetree/bindings/display/arm,pl11x.txt: Failing SETATTR as quorum is not met

AFTER:

[2017-01-10 14:21:07.187875] W [MSGID: 108001] [afr-transaction.c:806:afr_handle_quorum] 0-rep3-replicate-0: 0621b9e8-f9c2-435e-b679-267dfba99ef2: Failing WRITE as quorum is not met [Transport endpoint is not connected]


test version:3.8.4-11




===>test version:glusterfs-libs-3.8.4-3.el7rhgs.x86_64
 
[2017-01-11 06:21:14.112038] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f381bb905b2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f381b95675e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f381b95686e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f381b957fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f381b9588a0] ))))) 0-rep3-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2017-01-11 06:21:14.102905 (xid=0x2b87ea)
[2017-01-11 06:21:14.116366] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-rep3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:21:14.119704] I [socket.c:3401:socket_submit_request] 0-rep3-client-0: not connected (priv->connected = 0)
[2017-01-11 06:21:14.119751] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-rep3-client-0: failed to submit rpc-request (XID: 0x2b87eb Program: GlusterFS 3.3, ProgVers: 330, Proc: 16) to rpc-transport (rep3-client-0)
[2017-01-11 06:21:14.119770] W [MSGID: 114031] [client-rpc-fops.c:974:client3_3_fsync_cbk] 0-rep3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:21:14.119794] W [MSGID: 108035] [afr-transaction.c:2150:afr_changelog_fsync_cbk] 0-rep3-replicate-0: fsync(83c34e00-0e26-4089-ba4a-1f54bcb14213) failed on subvolume rep3-client-0. Transaction was WRITE [Transport endpoint is not connected]
[2017-01-11 06:21:14.119948] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-0: disconnected from rep3-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-11 06:21:14.120191] E [MSGID: 114058] [client-handshake.c:1533:client_query_portmap_cbk] 0-rep3-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-01-11 06:21:14.409300] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-rep3-client-0: remote operation failed
[2017-01-11 06:21:14.410305] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-rep3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:21:24.704657] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-0: disconnected from rep3-client-0. Client process will keep trying to connect to glusterd until brick's port is available


=====================>
test version:glusterfs-api-3.8.4-11.el7rhgs.x86_64




[2017-01-11 06:51:28.108980] W [socket.c:590:__socket_rwv] 0-rep3-client-1: readv on 10.70.47.105:49154 failed (No data available)
[2017-01-11 06:51:28.111944] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7ff70a0b2602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7ff709e7876e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7ff709e7887e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7ff709e79fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7ff709e7a874] ))))) 0-rep3-client-1: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2017-01-11 06:51:28.108777 (xid=0xd20c8)
[2017-01-11 06:51:28.111991] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-rep3-client-1: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:51:28.112031] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-1: disconnected from rep3-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-11 06:51:28.112217] W [MSGID: 114061] [client-common.c:460:client_pre_fsync] 0-rep3-client-1:  (dce618ba-01fa-453d-b8ed-67af2f77d1aa) remote_fd is -1. EBADFD [File descriptor in bad state]
[2017-01-11 06:51:28.112267] W [MSGID: 108035] [afr-transaction.c:2221:afr_changelog_fsync_cbk] 0-rep3-replicate-0: fsync(dce618ba-01fa-453d-b8ed-67af2f77d1aa) failed on subvolume rep3-client-1. Transaction was WRITE [File descriptor in bad state]
[2017-01-11 06:51:33.425876] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-rep3-client-1: remote operation failed
[2017-01-11 06:51:33.430676] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-rep3-client-1: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:51:38.485138] E [MSGID: 114058] [client-handshake.c:1533:client_query_portmap_cbk] 0-rep3-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-01-11 06:51:38.485228] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-1: disconnected from rep3-client-1. Client process will keep trying to connect to glusterd until brick's port is available





steps:
1)create a 3 node cluster
2)create a 1x3 volume
3)enable server quorum by setting cluster.server-quorum-type to server
4) mount vol on a client a do a dd of 5 gb
5)while file write is happening bring down glusterd of n2,n3
the above logs will be observed on fuse

and below is the difference before and after fix


BEFORE:

[2017-01-10 12:17:28.684316] W [MSGID: 108001] [afr-transaction.c:804:afr_handle_quorum] 0-rep3-replicate-0: /dir1/linux-4.9.2/Documentation/devicetree/bindings/display/arm,pl11x.txt: Failing SETATTR as quorum is not met

AFTER:

[2017-01-10 14:21:07.187875] W [MSGID: 108001] [afr-transaction.c:806:afr_handle_quorum] 0-rep3-replicate-0: 0621b9e8-f9c2-435e-b679-267dfba99ef2: Failing WRITE as quorum is not met [Transport endpoint is not connected]

Comment 9 errata-xmlrpc 2017-03-23 06:18:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html