| Summary: | Infiniteband errors on large file find | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Allen Lu <allen> | ||||||||||||
| Component: | ib-verbs | Assignee: | Raghavendra G <raghavendra> | ||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||||||
| Severity: | low | Docs Contact: | |||||||||||||
| Priority: | medium | ||||||||||||||
| Version: | 3.1.0 | CC: | anush, gluster-bugs, mrobbert, rahulcs, vijay | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | x86_64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | Type: | --- | |||||||||||||
| Regression: | RTNR | Mount Type: | fuse | ||||||||||||
| Documentation: | DNR | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Mike Robbert
2010-11-05 19:08:15 UTC
Created attachment 368 Created attachment 369 Created attachment 370 Created attachment 371 Created attachment 372 One final comment before I leave for the weekend. At first I thought that the errors I was seeing may have been due to the fact that the directory I was looking at had been untouched on the servers since I was running 3.0.5 (which by the way ran quite well on the same hardware) and I've done lots of changes and tests to the server code and maybe there was some kind of hash change since then. so, on a whim I decided to try to untar a similar directory, openmpi-1.4.2.tar.bz2. I started the process and after a couple of files it appeared to hang for a minute or so and similar messages showed up in the logs before it did a few more files. so, writing new files to the volume works no better than reading. Filing on behalf of Michael who will research more to get reproducible steps. Today I did a test that failed and produced lots of log output. I tried to do a find on a large directory in order to produce a file list of that directory. I have a copy of this directory on Lustre and a find there took 22 seconds. When I did it on Gluster it sat there for a minute or two then exited with an error after producing no output. [root@compute-8-27 imfergus]# find wy1980/ > /tmp/wy1980.filelist find: wy1980/: Invalid argument Here are the logs: [2010-11-03 12:21:17.302496] E [client-handshake.c:116:rpc_client_ping_timer_expired] gluster-test-client-0: Server 172.16.2.4:24009 has not responded in the last 42 seconds, disconnecting. [2010-11-03 12:21:17.303459] E [rdma.c:4443:rdma_event_handler] rpc-transport/rdma: gluster-test-client-0: pollin received on tcp socket (peer: 172.16.2.4:24009) after handshake is complete [2010-11-03 12:21:17.304020] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(OPENDIR(20)) called at 2010-11-03 12:20:14.912201 [2010-11-03 12:21:17.304085] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2010-11-03 12:20:35.261201 [2010-11-03 12:21:17.315900] E [client3_1-fops.c:1857:client3_1_lookup_cbk] : error [2010-11-03 12:21:17.315927] I [dht-common.c:369:dht_revalidate_cbk] gluster-test-dht: subvolume gluster-test-client-0 returned -1 (Invalid argument) [2010-11-03 12:21:17.315971] I [client-handshake.c:993:select_server_supported_programs] gluster-test-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-11-03 12:21:17.316147] I [client-handshake.c:829:client_setvolume_cbk] gluster-test-client-0: Connected to 172.16.2.4:24009, attached to remote volume '/mnt/brick1'. [2010-11-03 12:22:41.389021] E [client-handshake.c:116:rpc_client_ping_timer_expired] gluster-test-client-2: Server 172.16.2.6:24009 has not responded in the last 42 seconds, disconnecting. [2010-11-03 12:22:41.389132] E [client-handshake.c:116:rpc_client_ping_timer_expired] gluster-test-client-3: Server 172.16.2.7:24009 has not responded in the last 42 seconds, disconnecting. [2010-11-03 12:22:41.389163] E [client-handshake.c:116:rpc_client_ping_timer_expired] gluster-test-client-1: Server 172.16.2.5:24009 has not responded in the last 42 seconds, disconnecting. [2010-11-03 12:22:41.389211] E [rdma.c:4443:rdma_event_handler] rpc-transport/rdma: gluster-test-client-1: pollin received on tcp socket (peer: 172.16.2.5:24009) after handshake is complete [2010-11-03 12:22:41.389565] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2010-11-03 12:21:17.315689 [2010-11-03 12:22:41.389665] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2010-11-03 12:21:59.346239 [2010-11-03 12:22:41.389688] E [rdma.c:4443:rdma_event_handler] rpc-transport/rdma: gluster-test-client-3: pollin received on tcp socket (peer: 172.16.2.7:24009) after handshake is complete [2010-11-03 12:22:41.389920] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2010-11-03 12:21:17.312567 [2010-11-03 12:22:41.389963] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2010-11-03 12:21:59.346230 [2010-11-03 12:22:41.389982] E [rdma.c:4443:rdma_event_handler] rpc-transport/rdma: gluster-test-client-2: pollin received on tcp socket (peer: 172.16.2.6:24009) after handshake is complete [2010-11-03 12:22:41.390209] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2010-11-03 12:21:17.312555 [2010-11-03 12:22:41.402188] I [dht-layout.c:588:dht_layout_normalize] gluster-test-dht: found anomalies in /imfergus/wy1980. holes=1 overlaps=0 [2010-11-03 12:22:41.402244] W [fuse-bridge.c:180:fuse_entry_cbk] glusterfs-fuse: 243: LOOKUP() /imfergus/wy1980 => -1 (Stale NFS file handle) [2010-11-03 12:22:41.402385] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2b42825bb689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2b42825bae2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2b42825bad9e]))) rpc-clnt: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2010-11-03 12:21:59.346216 [2010-11-03 12:22:41.404245] I [client-handshake.c:993:select_server_supported_programs] gluster-test-client-2: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-11-03 12:22:41.404517] I [client-handshake.c:829:client_setvolume_cbk] gluster-test-client-2: Connected to 172.16.2.6:24009, attached to remote volume '/mnt/brick3'. [2010-11-03 12:22:41.404547] I [client-handshake.c:698:client_post_handshake] gluster-test-client-2: 1 fds open - Delaying child_up until they are re-opened [2010-11-03 12:22:41.405754] E [client3_1-fops.c:1857:client3_1_lookup_cbk] : error [2010-11-03 12:22:41.405996] I [client-handshake.c:993:select_server_supported_programs] gluster-test-client-1: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-11-03 12:22:41.406059] I [client-handshake.c:993:select_server_supported_programs] gluster-test-client-3: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-11-03 12:22:41.406131] E [client3_1-fops.c:1857:client3_1_lookup_cbk] : error [2010-11-03 12:22:41.406152] W [fuse-bridge.c:180:fuse_entry_cbk] glusterfs-fuse: 245: LOOKUP() /imfergus/wy1980 => -1 (Invalid argument) [2010-11-03 12:22:41.406211] I [client-handshake.c:829:client_setvolume_cbk] gluster-test-client-1: Connected to 172.16.2.5:24009, attached to remote volume '/mnt/brick2'. [2010-11-03 12:22:41.406244] I [client-handshake.c:698:client_post_handshake] gluster-test-client-1: 1 fds open - Delaying child_up until they are re-opened [2010-11-03 12:22:41.406298] I [client-handshake.c:829:client_setvolume_cbk] gluster-test-client-3: Connected to 172.16.2.7:24009, attached to remote volume '/mnt/brick4'. [2010-11-03 12:22:41.406313] I [client-handshake.c:698:client_post_handshake] gluster-test-client-3: 1 fds open - Delaying child_up until they are re-opened Any thoughts on what is going on? ls and cd on all directories above this are working fine today. There are no logs on the servers. Mike *** Bug 2043 has been marked as a duplicate of this bug. *** PATCH: http://patches.gluster.com/patch/5656 in master (rpc-transport/rdma: set the default quota of recv buffers to 128.) Hi Michael/Allen, Can you check whether this patch fixes the issue? We are not able to reproduce the issue locally. regards, Raghavendra |