Bug 1394219 - Better logging when reporting failures of the kind "<file-path> Failing MKNOD as quorum is not met"
Summary: Better logging when reporting failures of the kind "<file-path> Failing MKNOD...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.2.0
Assignee: Krutika Dhananjay
QA Contact: nchilaka
URL:
Whiteboard:
Depends On: 1392865
Blocks: 1351528 1393629 1393630 1393631
TreeView+ depends on / blocked
 
Reported: 2016-11-11 12:33 UTC by Krutika Dhananjay
Modified: 2017-03-23 06:18 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.8.4-6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1392865
Environment:
Last Closed: 2017-03-23 06:18:30 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Krutika Dhananjay 2016-11-11 12:33:23 UTC
+++ This bug was initially created as a clone of Bug #1392865 +++

Description of problem:

Often when debugging quorum failures, it helps to know the exact error code because of which the fop is being failed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2016-11-08 06:56:35 EST ---

REVIEW: http://review.gluster.org/15800 (cluster/afr: When failing fop due to lack of quorum, also log error string) posted (#1) for review on master by Krutika Dhananjay (kdhananj@redhat.com)

--- Additional comment from Worker Ant on 2016-11-09 05:27:50 EST ---

REVIEW: http://review.gluster.org/15800 (cluster/afr: When failing fop due to lack of quorum, also log error string) posted (#2) for review on master by Krutika Dhananjay (kdhananj@redhat.com)

--- Additional comment from Worker Ant on 2016-11-09 10:35:02 EST ---

REVIEW: http://review.gluster.org/15800 (cluster/afr: When failing fop due to lack of quorum, also log error string) posted (#3) for review on master by Krutika Dhananjay (kdhananj@redhat.com)

--- Additional comment from Worker Ant on 2016-11-09 19:45:19 EST ---

COMMIT: http://review.gluster.org/15800 committed in master by Pranith Kumar Karampuri (pkarampu@redhat.com) 
------
commit f2e6cf9d9c578c58cdd348ebcd14dcbfdfbafadd
Author: Krutika Dhananjay <kdhananj@redhat.com>
Date:   Tue Nov 8 17:24:41 2016 +0530

    cluster/afr: When failing fop due to lack of quorum, also log error string
    
    Change-Id: I39de2bcfc660f23a4229a885a0e0420ca949ffc9
    BUG: 1392865
    Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
    Reviewed-on: http://review.gluster.org/15800
    Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
    Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>

Comment 2 Krutika Dhananjay 2016-11-11 12:41:39 UTC
The error string in this particular log message, had it been logged would have been much more helpful in debugging https://bugzilla.redhat.com/show_bug.cgi?id=1370350.

This bug was raised just to fix that - quorum loss log messaging also printing the error string for better debuggability.

Also note that the occurrence of this new information in the log can be easily verified as part of verification of https://bugzilla.redhat.com/show_bug.cgi?id=1370350

-Krutika

Comment 4 Krutika Dhananjay 2016-11-14 10:22:45 UTC
https://code.engineering.redhat.com/gerrit/#/c/90066/

Comment 7 nchilaka 2017-01-11 08:40:16 UTC
qa_validation:
This is what i see when quorum is lost and the fuse client discovers
(confirmed with dev on the expected o/p, which is as below hence moving to verified

BEFORE:

[2017-01-10 12:17:28.684316] W [MSGID: 108001] [afr-transaction.c:804:afr_handle_quorum] 0-rep3-replicate-0: /dir1/linux-4.9.2/Documentation/devicetree/bindings/display/arm,pl11x.txt: Failing SETATTR as quorum is not met

AFTER:

[2017-01-10 14:21:07.187875] W [MSGID: 108001] [afr-transaction.c:806:afr_handle_quorum] 0-rep3-replicate-0: 0621b9e8-f9c2-435e-b679-267dfba99ef2: Failing WRITE as quorum is not met [Transport endpoint is not connected]


test version:3.8.4-11




===>test version:glusterfs-libs-3.8.4-3.el7rhgs.x86_64
 
[2017-01-11 06:21:14.112038] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f381bb905b2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f381b95675e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f381b95686e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f381b957fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f381b9588a0] ))))) 0-rep3-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2017-01-11 06:21:14.102905 (xid=0x2b87ea)
[2017-01-11 06:21:14.116366] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-rep3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:21:14.119704] I [socket.c:3401:socket_submit_request] 0-rep3-client-0: not connected (priv->connected = 0)
[2017-01-11 06:21:14.119751] W [rpc-clnt.c:1640:rpc_clnt_submit] 0-rep3-client-0: failed to submit rpc-request (XID: 0x2b87eb Program: GlusterFS 3.3, ProgVers: 330, Proc: 16) to rpc-transport (rep3-client-0)
[2017-01-11 06:21:14.119770] W [MSGID: 114031] [client-rpc-fops.c:974:client3_3_fsync_cbk] 0-rep3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:21:14.119794] W [MSGID: 108035] [afr-transaction.c:2150:afr_changelog_fsync_cbk] 0-rep3-replicate-0: fsync(83c34e00-0e26-4089-ba4a-1f54bcb14213) failed on subvolume rep3-client-0. Transaction was WRITE [Transport endpoint is not connected]
[2017-01-11 06:21:14.119948] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-0: disconnected from rep3-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-11 06:21:14.120191] E [MSGID: 114058] [client-handshake.c:1533:client_query_portmap_cbk] 0-rep3-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-01-11 06:21:14.409300] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-rep3-client-0: remote operation failed
[2017-01-11 06:21:14.410305] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-rep3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:21:24.704657] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-0: disconnected from rep3-client-0. Client process will keep trying to connect to glusterd until brick's port is available


=====================>
test version:glusterfs-api-3.8.4-11.el7rhgs.x86_64




[2017-01-11 06:51:28.108980] W [socket.c:590:__socket_rwv] 0-rep3-client-1: readv on 10.70.47.105:49154 failed (No data available)
[2017-01-11 06:51:28.111944] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7ff70a0b2602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7ff709e7876e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7ff709e7887e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7ff709e79fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7ff709e7a874] ))))) 0-rep3-client-1: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2017-01-11 06:51:28.108777 (xid=0xd20c8)
[2017-01-11 06:51:28.111991] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-rep3-client-1: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:51:28.112031] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-1: disconnected from rep3-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-11 06:51:28.112217] W [MSGID: 114061] [client-common.c:460:client_pre_fsync] 0-rep3-client-1:  (dce618ba-01fa-453d-b8ed-67af2f77d1aa) remote_fd is -1. EBADFD [File descriptor in bad state]
[2017-01-11 06:51:28.112267] W [MSGID: 108035] [afr-transaction.c:2221:afr_changelog_fsync_cbk] 0-rep3-replicate-0: fsync(dce618ba-01fa-453d-b8ed-67af2f77d1aa) failed on subvolume rep3-client-1. Transaction was WRITE [File descriptor in bad state]
[2017-01-11 06:51:33.425876] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-rep3-client-1: remote operation failed
[2017-01-11 06:51:33.430676] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-rep3-client-1: remote operation failed [Transport endpoint is not connected]
[2017-01-11 06:51:38.485138] E [MSGID: 114058] [client-handshake.c:1533:client_query_portmap_cbk] 0-rep3-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-01-11 06:51:38.485228] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-rep3-client-1: disconnected from rep3-client-1. Client process will keep trying to connect to glusterd until brick's port is available





steps:
1)create a 3 node cluster
2)create a 1x3 volume
3)enable server quorum by setting cluster.server-quorum-type to server
4) mount vol on a client a do a dd of 5 gb
5)while file write is happening bring down glusterd of n2,n3
the above logs will be observed on fuse

and below is the difference before and after fix


BEFORE:

[2017-01-10 12:17:28.684316] W [MSGID: 108001] [afr-transaction.c:804:afr_handle_quorum] 0-rep3-replicate-0: /dir1/linux-4.9.2/Documentation/devicetree/bindings/display/arm,pl11x.txt: Failing SETATTR as quorum is not met

AFTER:

[2017-01-10 14:21:07.187875] W [MSGID: 108001] [afr-transaction.c:806:afr_handle_quorum] 0-rep3-replicate-0: 0621b9e8-f9c2-435e-b679-267dfba99ef2: Failing WRITE as quorum is not met [Transport endpoint is not connected]

Comment 9 errata-xmlrpc 2017-03-23 06:18:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.