Bug 1351672 - Possible bug in portmap on client reconnection
Summary: Possible bug in portmap on client reconnection
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: protocol
Version: 3.7.11
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1333317 1379216 1379217
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-30 14:23 UTC by Oleksandr Natalenko
Modified: 2017-03-08 10:49 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-08 10:49:21 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Oleksandr Natalenko 2016-06-30 14:23:01 UTC
Occasionally in replica 2 setup one of the node (1.2.3.22) gets the following errors in /var/log/glusterfs/etc-glusterfs-glusterd.vol.log file:

===
[2016-06-30 11:00:27.034268] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.22:65488
[2016-06-30 11:00:27.034487] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2016-06-30 11:00:30.029924] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.77:65530  
[2016-06-30 11:00:30.029957] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully  
[2016-06-30 11:00:30.029995] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.77:65530
[2016-06-30 11:00:30.030005] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2016-06-30 11:00:32.380737] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 1.2.3.60:65534 
===

1.2.3.22 and 1.2.3.77 are GlusterFS server nodes (replica 2), and 1.2.3.60 is FUSE client.

After those warnings/errors the following happens on the client:

===
[2016-06-30 11:00:32.351619] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-asterisk_agents_videos-client-2: server 1.2.3.22:49214 has not responded in the last 10 seconds, disco
nnecting.
[2016-06-30 11:00:32.361727] E [rpc-clnt.c:370:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fd4a47f8b42] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1d
e)[0x7fd4a45c376e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fd4a45c387e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fd4a45c4fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7fd4a45c5828] ))))) 0-asterisk_agents_videos-client-2: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-06-30 11:00:17.764449 (xid=0xa9865d)
[2016-06-30 11:00:32.361755] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-asterisk_agents_videos-client-2: remote operation failed [Transport endpoint is not connected]
[2016-06-30 11:00:32.378754] I [socket.c:3304:socket_submit_request] 0-asterisk_agents_videos-client-2: not connected (priv->connected = 0)
[2016-06-30 11:00:32.380399] W [rpc-clnt.c:1644:rpc_clnt_submit] 0-asterisk_agents_videos-client-2: failed to submit rpc-request (XID: 0xa98681 Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (asterisk_agents_videos-client-2)

…

[2016-06-30 11:00:38.177288] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-asterisk_agents_videos-client-2: remote operation failed. Path: /2016 (ab47c5bd-c117-4423-b624-b6f88e243304) [Transport endpoint is not connected]  
[2016-06-30 11:00:38.182921] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-asterisk_agents_videos-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]  
[2016-06-30 11:00:38.186938] E [MSGID: 114031] [client-rpc-fops.c:1730:client3_3_entrylk_cbk] 0-asterisk_agents_videos-client-2: remote operation failed [Transport endpoint is not connected]
…
===

As far as I understand, client have lost a connection to one of the bricks in distributed-replicated volume. Kaushal explained me in #gluster:

===
[15:51] <kshlm> That error says that the process doesn't understand the GLusterFS FOP RPCs.
[15:51] <kshlm> Sometimes a client tries to send requests to glusterd instead of bricks.
[15:52] <kshlm> I've seen it sporadically, but never bothered to root cause it.
[15:52] <kshlm> It could possibly occur because of a client portmap request failing.
[15:53] <kshlm> A client translator needs to connect to glusterd to get the port of a brick by doing a portmap query.
[15:53] <kshlm> It should reconnect to the brick port once it gets the port.
[15:54] <kshlm> But maybe sometimes this reconnection doesn't happen.
[15:54] <kshlm> An client xlator continues to use the glusterd connection itself to send IO requests.
[15:55] <kshlm> The portmap query and reconnection code is not very clean. This is something I'm hoping to get fixed soon.
[15:56] <kshlm> You may have to restart those client processes to get them to reconnect correctly to the bricks.
===

Restarting the client fixed the issue, and entries in replica started to heal.

Also, another hint from Kaushal:

===
[17:04] <kshlm> This patch https://review.gluster.org/14254 is cleaning up portmap reconnection.
[17:05] <kshlm> It mentions encryption in the title, but the changed to reconnection apply to unencrypted connections as well.
===

The issue happened several times for us, and once it led to split-brain that required manual intervention.

Also, custom gfapi was affected by this as well (however, another volume and another brick was involved), and the app was restarted to fix the issue. Here is short log from this app:

===
Jun 30 11:00:27 1.2.3.22 ottd[6785]: [2016-06-30 11:00:27.021060] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-ott_chunks-client-2: server 1.2.3.22:49217 has not responded in the last 10 seconds, disconnecting.
===

Comment 1 Kaushal 2016-06-30 14:28:16 UTC
As Oleksander has already added, https://review.gluster.org/14254, for #1333317 should help fix this issue. The change cleans up the portmap query/reconnection process in the client xlator.

I'm adding this bug as a dependency onto #1333317 to help track this for now.

Comment 2 Kaushal 2017-03-08 10:49:21 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.


Note You need to log in before you can comment on or make changes to this bug.