Bug 1370683

Summary: "rpc_clnt_ping_timer_expired" when copying big file (reproducible)
Product: [Community] GlusterFS Reporter: Christopher Pereira <kripper>
Component: fuseAssignee: bugs <bugs>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3.7.11CC: bugs, kaushal, rgowdapp
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: rpc-ping-timeout
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-08 10:52:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christopher Pereira 2016-08-27 01:04:00 UTC
Description of problem:

I'm copying a 3.9 G file to a gluster mount (served on a remote host) and the transfer is *always* interrupted after transfering 958 MB with a "rpc_clnt_ping_timer_expired" error.

I thought it was a network problem, but after retrying 5 times with same result, it seems to be a bug in GlusterFS.

Network admins confirmed they are not restricting the transfer. Besides, transferring other big files doesn't trigger this problem.

Version-Release number of selected component (if applicable):

glusterfs-3.7.11-1.el7.x86_64
glusterfs-fuse-3.7.11-1.el7.x86_64
glusterfs-libs-3.7.11-1.el7.x86_64
glusterfs-client-xlators-3.7.11-1.el7.x86_64
glusterfs-cli-3.7.11-1.el7.x86_64
glusterfs-api-3.7.11-1.el7.x86_64
glusterfs-server-3.7.11-1.el7.x86_64

Linux h8.imatronix.com 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CentOS Linux release 7.2.1511 (Core)

How reproducible:

cp -a big.qcow2 /mnt/backups/destfile'

Actual results:

Transfer is interrupted after aprox. 958 MB:

Test 1: transferred 980864 (x 1024 bytes)
Test 2: transferred 980992 (x 1024 bytes)
Other tests are very similar

Expected results:

Complete 'cp' process.

Additional info:

Using rsync instead of cp works, but dmesg reports:
kernel: INFO: task rsync:12452 blocked for more than 120 seconds.

I have been copying big files for half a year without this issue.
This is a production environment, so packages have not been updated all this time.

I'm reporting because this seems to be the kind of bug that turns chronic (stays present for many releases) since it looks more like a network problem and because of the hassle to reproduce...but I'm able to reproduce and will keep the offending file for some time in case it is needed.

Logs:

[2016-08-27 00:42:01.722293] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-backups-h9-client-0: server <host-9>:49152 has not responded in the last 42 seconds, disconnecting.
[2016-08-27 00:42:01.724838] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.354197 (xid=0x5dfbd8)
[2016-08-27 00:42:01.724894] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-backups-h9-client-0: remote operation failed [Transport endpoint is not connected]
[2016-08-27 00:42:01.725230] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.453085 (xid=0x5dfbd9)
[2016-08-27 00:42:01.725524] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.454572 (xid=0x5dfbda)
[2016-08-27 00:42:01.725803] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.491924 (xid=0x5dfbdb)
[2016-08-27 00:42:01.726074] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.615236 (xid=0x5dfbdc)
[2016-08-27 00:42:01.726345] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.616659 (xid=0x5dfbdd)
[2016-08-27 00:42:01.726625] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.722719 (xid=0x5dfbde)
[2016-08-27 00:42:01.726897] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.724086 (xid=0x5dfbdf)
[2016-08-27 00:42:01.727159] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-08-27 00:41:13.761585 (xid=0x5dfbe0)
[2016-08-27 00:42:01.737321] I [socket.c:3304:socket_submit_request] 0-backups-h9-client-0: not connected (priv->connected = 0)
[2016-08-27 00:42:01.737364] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe2 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.737675] W [fuse-bridge.c:2302:fuse_writev_cbk] 0-glusterfs-fuse: 9885708: WRITE => -1 (Transport endpoint is not connected)
[2016-08-27 00:42:01.737758] W [MSGID: 114029] [client-rpc-fops.c:4415:client3_3_writev] 0-backups-h9-client-0: failed to send the fop
[2016-08-27 00:42:01.738026] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f63bfca0ae2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f63bfa6b90e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f63bfa6ba1e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f63bfa6d40a] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f63bfa6dc38] ))))) 0-backups-h9-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-08-27 00:41:19.712717 (xid=0x5dfbe1)
[2016-08-27 00:42:01.738056] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk] 0-backups-h9-client-0: socket disconnected
[2016-08-27 00:42:01.738082] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-backups-h9-client-0: disconnected from backups-h9-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2016-08-27 00:42:01.738331] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe3 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738420] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe4 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738492] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe5 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738568] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe6 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.738980] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739572] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe8 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739639] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbe9 Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739680] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbea Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739718] W [rpc-clnt.c:1606:rpc_clnt_submit] 0-backups-h9-client-0: failed to submit rpc-request (XID: 0x5dfbeb Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (backups-h9-client-0)
[2016-08-27 00:42:01.739760] W [fuse-bridge.c:1287:fuse_err_cbk] 0-glusterfs-fuse: 9885709: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2016-08-27 00:42:01.747159] I [rpc-clnt.c:1868:rpc_clnt_reconfig] 0-backups-h9-client-0: changing port to 49152 (from 0)
[2016-08-27 00:42:01.761133] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-backups-h9-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-08-27 00:42:01.763595] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-backups-h9-client-0: Connected to backups-h9-client-0, attached to remote volume '/home/datacenter/gluster-bricks/backups-h9'.
[2016-08-27 00:42:01.763639] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-backups-h9-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2016-08-27 00:42:01.763663] I [MSGID: 114042] [client-handshake.c:1056:client_post_handshake] 0-backups-h9-client-0: 3 fds open - Delaying child_up until they are re-opened
[2016-08-27 00:42:01.771345] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 0-backups-h9-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2016-08-27 00:42:01.773563] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-backups-h9-client-0: Server lk version = 1
The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-backups-h9-client-0: remote operation failed [Transport endpoint is not connected]" repeated 18 times between [2016-08-27 00:42:01.724894] and [2016-08-27 00:42:01.739734]
The message "W [MSGID: 114029] [client-rpc-fops.c:4415:client3_3_writev] 0-backups-h9-client-0: failed to send the fop" repeated 9 times between [2016-08-27 00:42:01.737758] and [2016-08-27 00:42:01.739809]

Comment 1 Kaushal 2016-08-30 12:32:09 UTC
Hi,

It would help us if you could provide the following additional information.

- The volume type of the volume to which the file is being copied to. The output of `gluster volume info` will be good.

- Is the timer expiry always for the same server? In the logs you've pasted, <host-9> has timed out.

- The brick logs from <host-9> or the server which timed-out will be helpful. It can help identify why the brick is blocked.

- Just to clarify again, it's just this one file which causes problems, and nothing else.

Thanks.

Comment 2 Christopher Pereira 2016-08-30 19:28:59 UTC
Hi Kaushal,

1) Volume Type:

Volume Name: backups-h9
Type: Distribute
Volume ID: 782b8005-6db7-4b91-9854-a9a7ae326fef
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: <host-9>:/home/datacenter/gluster-bricks/backups-h9
Options Reconfigured:
performance.readdir-ahead: on

2) Yes, <host-9> is the gluster server.

3) Brick logs (mount point is on <host-8>):

[2016-08-27 00:27:00.792426] W [socket.c:1236:__socket_read_simple_msg] 0-tcp.backups-h9-server: reading from socket failed. Error (Connection timed out), peer (200.63.97.100:65526)
[2016-08-27 00:27:00.792708] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-backups-h9-server: disconnecting connection from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-115
[2016-08-27 00:27:00.792761] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:27:00.792807] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp
[2016-08-27 00:27:00.792931] I [MSGID: 101055] [client_t.c:420:gf_client_unref] 0-backups-h9-server: Shutting down connection <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-115
[2016-08-27 00:27:44.561842] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-backups-h9-server: accepted client from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-116 (version: 3.7.11)
[2016-08-27 00:27:00.792794] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:45:48.792359] W [socket.c:1236:__socket_read_simple_msg] 0-tcp.backups-h9-server: reading from socket failed. Error (Connection timed out), peer (200.63.97.100:65525)
[2016-08-27 00:45:48.792446] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-backups-h9-server: disconnecting connection from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-116
[2016-08-27 00:45:48.792500] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:45:48.792542] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp3
[2016-08-27 00:45:48.792663] I [MSGID: 101055] [client_t.c:420:gf_client_unref] 0-backups-h9-server: Shutting down connection <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-116
[2016-08-27 00:45:48.792531] I [MSGID: 115013] [server-helpers.c:294:do_fd_cleanup] 0-backups-h9-server: fd cleanup on /ixdc/images/snapshots/ixdc.vda.dOUjqL.tmp2
[2016-08-27 00:45:58.712780] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-backups-h9-server: accepted client from <host-8>-4531-2016/05/18-13:54:19:979576-backups-h9-client-0-0-117 (version: 3.7.11)
[2016-08-29 06:07:02.214318] I [MSGID: 100011] [glusterfsd.c:1323:reincarnate] 0-glusterfsd: Fetching the volume file from server...

4) Yes, file transfers have been working for months without issues and this is the only conflicting file. It fails each time I retry to 'cp' it.

Do you see something suspicious?

Comment 3 Christopher Pereira 2016-08-30 19:33:57 UTC
PS: I have no control or information about the firewalls in the middle. Maybe there is some software detecting a given byte sequence and dropping the connection.

Comment 4 Kaushal 2017-03-08 10:52:33 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.