1370683 – "rpc_clnt_ping_timer_expired" when copying big file (reproducible)

Bug 1370683 - "rpc_clnt_ping_timer_expired" when copying big file (reproducible)

Summary: "rpc_clnt_ping_timer_expired" when copying big file (reproducible)

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	fuse
Sub Component:
Version:	3.7.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:	rpc-ping-timeout
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-27 01:04 UTC by Christopher Pereira
Modified:	2017-03-08 10:52 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-03-08 10:52:33 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Christopher Pereira 2016-08-27 01:04:00 UTC

Description of problem:

I'm copying a 3.9 G file to a gluster mount (served on a remote host) and the transfer is *always* interrupted after transfering 958 MB with a "rpc_clnt_ping_timer_expired" error.

I thought it was a network problem, but after retrying 5 times with same result, it seems to be a bug in GlusterFS.

Network admins confirmed they are not restricting the transfer. Besides, transferring other big files doesn't trigger this problem.

Version-Release number of selected component (if applicable):

glusterfs-3.7.11-1.el7.x86_64
glusterfs-fuse-3.7.11-1.el7.x86_64
glusterfs-libs-3.7.11-1.el7.x86_64
glusterfs-client-xlators-3.7.11-1.el7.x86_64
glusterfs-cli-3.7.11-1.el7.x86_64
glusterfs-api-3.7.11-1.el7.x86_64
glusterfs-server-3.7.11-1.el7.x86_64

Linux h8.imatronix.com 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CentOS Linux release 7.2.1511 (Core)

How reproducible:

cp -a big.qcow2 /mnt/backups/destfile'

Actual results:

Transfer is interrupted after aprox. 958 MB:

Test 1: transferred 980864 (x 1024 bytes)
Test 2: transferred 980992 (x 1024 bytes)
Other tests are very similar

Expected results:

Complete 'cp' process.

Additional info:

Using rsync instead of cp works, but dmesg reports:
kernel: INFO: task rsync:12452 blocked for more than 120 seconds.

I have been copying big files for half a year without this issue.
This is a production environment, so packages have not been updated all this time.

I'm reporting because this seems to be the kind of bug that turns chronic (stays present for many releases) since it looks more like a network problem and because of the hassle to reproduce...but I'm able to reproduce and will keep the offending file for some time in case it is needed.

Logs:

[2016-08-27 00:42:01.722293] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-backups-h9-client-0: server <host-9>:49152 has not responded in the last 42 seconds, disconnecting.
[2016-08-27 00:42:01.724838] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.354197 (xid=0x5dfbd8)
[2016-08-27 00:42:01.724894] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-backups-h9-client-0: remote operation failed [Transport endpoint is not connected]
[2016-08-27 00:42:01.725230] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.453085 (xid=0x5dfbd9)
[2016-08-27 00:42:01.725524] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.454572 (xid=0x5dfbda)
[2016-08-27 00:42:01.725803] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.491924 (xid=0x5dfbdb)
[2016-08-27 00:42:01.726074] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.615236 (xid=0x5dfbdc)
[2016-08-27 00:42:01.726345] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.616659 (xid=0x5dfbdd)
[2016-08-27 00:42:01.726625] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.722719 (xid=0x5dfbde)
[2016-08-27 00:42:01.726897] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.724086 (xid=0x5dfbdf)
[2016-08-27 00:42:01.727159] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.761585 (xid=0x5dfbe0)
[2016-08-27 00:42:01.737321] I [socket.c:3304:socket_submit_request] 0-backups-h9-client-0: not connected (priv->connected = 0)
[2016-08-27 00:42:01.737364] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe2 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.737675] W [fuse-bridge.c:2302:fuse_writev_cbk] 0-glusterfs-fuse: 9885708: WRITE => -1 (Transport endpoint is not connected)
[2016-08-27 00:42:01.737758] W [MSGID: 114029] [client-rpc-fops.c:4415:client3_3_writev] 0-backups-h9-client-0: failed to send the fop
[2016-08-27 00:42:01.738026] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-08-27 00:41:19.712717 (xid=0x5dfbe1)
[2016-08-27 00:42:01.738056] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk] 0-backups-h9-client-0: socket disconnected
[2016-08-27 00:42:01.738082] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-backups-h9-client-0: disconnected from backups-h9-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2016-08-27 00:42:01.738331] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe3 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738420] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe4 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738492] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe5 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738568] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe6 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738980] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739572] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe8 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739639] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe9 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739680] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbea Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739718] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbeb Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739760] W [fuse-bridge.c:1287:fuse_err_cbk] 0-glusterfs-fuse: 9885709: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2016-08-27 00:42:01.747159] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-backups-h9-client-0: changing port to 49152 (from 0)
[2016-08-27 00:42:01.761133] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-backups-h9-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-08-27 00:42:01.763595] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-backups-h9-client-0: Connected to backups-h9-client-0, attached to remote volume '/home/datacenter/gluster-bricks/backups-h9'.
[2016-08-27 00:42:01.763639] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-backups-h9-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2016-08-27 00:42:01.763663] I [MSGID: 114042] [client-handshake.c:1056:client_post_handshake] 0-backups-h9-client-0: 3 fds open - Delaying child_up until they are re-opened
[2016-08-27 00:42:01.771345] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 0-backups-h9-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2016-08-27 00:42:01.773563] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-backups-h9-client-0: Server lk version = 1
The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-backups-h9-client-0: remote operation failed [Transport endpoint is not connected]" repeated 18 times between [2016-08-27 00:42:01.724894] and [2016-08-27 00:42:01.739734]
The message "W [MSGID: 114029] [client-rpc-fops.c:4415:client3_3_writev] 0-backups-h9-client-0: failed to send the fop" repeated 9 times between [2016-08-27 00:42:01.737758] and [2016-08-27 00:42:01.739809]

Comment 1 Kaushal 2016-08-30 12:32:09 UTC

Hi,

It would help us if you could provide the following additional information.

- The volume type of the volume to which the file is being copied to. The output of `gluster volume info` will be good.

- Is the timer expiry always for the same server? In the logs you've pasted, <host-9> has timed out.

- The brick logs from <host-9> or the server which timed-out will be helpful. It can help identify why the brick is blocked.

- Just to clarify again, it's just this one file which causes problems, and nothing else.

Thanks.

Comment 2 Christopher Pereira 2016-08-30 19:28:59 UTC

Hi Kaushal,

1) Volume Type:

Volume Name: backups-h9
Type: Distribute
Volume ID: 782b8005-6db7-4b91-9854-a9a7ae326fef
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: <host-9>:/home/datacenter/gluster-bricks/backups-h9
Options Reconfigured:
performance.readdir-ahead: on

2) Yes, <host-9> is the gluster server.

3) Brick logs (mount point is on <host-8>):

[2016-08-27 00:27:00.792426] W [socket.c:1236:__socket_read_simple_msg] 0-tcp.backups-h9-server: reading from socket failed. Error (Connection timed out), peer (200.63.97.100:65526)
[2016-08-27 00:27:00.792708] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-backups-h9-server: disconnecting connection from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-115
[2016-08-27 00:27:00.792761] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:27:00.792807] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp
[2016-08-27 00:27:00.792931] I [MSGID: 101055] [client_t.c:420:gf_client_unref] 0-backups-h9-server: Shutting down connection <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-115
[2016-08-27 00:27:44.561842] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-backups-h9-server: accepted client from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-116 (version: 3.7.11)
[2016-08-27 00:27:00.792794] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:45:48.792359] W [socket.c:1236:__socket_read_simple_msg] 0-tcp.backups-h9-server: reading from socket failed. Error (Connection timed out), peer (200.63.97.100:65525)
[2016-08-27 00:45:48.792446] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-backups-h9-server: disconnecting connection from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-116
[2016-08-27 00:45:48.792500] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:45:48.792542] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp3
[2016-08-27 00:45:48.792663] I [MSGID: 101055] [client_t.c:420:gf_client_unref] 0-backups-h9-server: Shutting down connection <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-116
[2016-08-27 00:45:48.792531] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:45:58.712780] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-backups-h9-server: accepted client from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-117 (version: 3.7.11)
[2016-08-29 06:07:02.214318] I [MSGID: 100011] [glusterfsd.c:1323:reincarnate] 0-glusterfsd: Fetching the volume file from server...

4) Yes, file transfers have been working for months without issues and this is the only conflicting file. It fails each time I retry to 'cp' it.

Do you see something suspicious?

Comment 3 Christopher Pereira 2016-08-30 19:33:57 UTC

PS: I have no control or information about the firewalls in the middle. Maybe there is some software detecting a given byte sequence and dropping the connection.

Comment 4 Kaushal 2017-03-08 10:52:33 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.