Description of problem: When I run bonnie++ on my 1x2 NFS mounted volume I get: Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...Can't write block.: Input/output error Can't write block 1433870. real 35m28.539s user 0m0.308s sys 0m8.005s In /var/log/messages on the server being mounted I saw: Jun 21 10:42:29 storage-qe03 GlusterFS[7288]: [2013-06-21 14:42:29.488010] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-testvol-client-0: c annot lookup the saved frame corresponding to xid (137339) Jun 21 10:43:02 storage-qe03 logger: 2013-06-21 10:43:02 /usr/bin/rhts-test-runner.sh 86519 2640 hearbeat... Jun 21 10:43:13 storage-qe03 GlusterFS[7288]: [2013-06-21 14:43:13.537970] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-testvol-client-1: c annot lookup the saved frame corresponding to xid (137500) And on the client I saw: Jun 21 10:43:19 storage-qe05 kernel: nfs: server storage-qe03.lab.eng.rdu2.redhat.com OK Jun 21 10:43:19 storage-qe05 kernel: nfs: server storage-qe03.lab.eng.rdu2.redhat.com OK Jun 21 10:43:19 storage-qe05 kernel: nfs: server storage-qe03.lab.eng.rdu2.redhat.com OK I have reproduced this on two different sets of bare metal systems and I have attached sosreports from the second repro attempt. Version-Release number of selected component (if applicable): glusterfs-3.4.0.10rhs-1.el6rhs.x86_64 How reproducible: Both times I have tried thus far Steps to Reproduce: 1. Run bonnie++ on a 2x1 NFS mounted volume from a RHEL 6.4 client 2. 3. Actual results: I/O error on client. Expected results: Normal bonnie++ operation. Additional info: Sosreports attached.
More errors from the server being mounted: [2013-06-21 14:42:29.010032] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137608x sent = 2013-06-21 14:12:28.982549. timeout = 1800 [2013-06-21 14:42:29.486573] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137607x sent = 2013-06-21 14:12:28.975136. timeout = 1800 [2013-06-21 14:42:29.488010] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-testvol-client-0: cannot lookup the saved frame corresponding to xid (137339) [2013-06-21 14:42:29.488132] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137606x sent = 2013-06-21 14:12:28.967714. timeout = 1800 [2013-06-21 14:42:29.488328] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137605x sent = 2013-06-21 14:12:28.960302. timeout = 1800 [2013-06-21 14:42:29.488404] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f9daeb6def8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f9daeb6db23] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f9daeb6d23e]))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2013-06-21 14:12:29.057388 (xid=0x137609x) [2013-06-21 14:42:29.493496] E [rpc-clnt.c:207:call_bail] 0-testvol-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x137604x sent = 2013-06-21 14:12:28.952882. timeout = 1800
Created attachment 763893 [details] sos server mounted NFS
Created attachment 763894 [details] Sos server
Created attachment 763895 [details] sos client
In the NFS server log, I see the bricks are not connected. Log snippet: ============ 0-testvol-client-0 ------------------ [2013-06-21 14:42:29.559517] W [client-rpc-fops.c:1811:client3_3_fxattrop_cbk] 0-testvol-client-0: remote operation failed: Transport endpoint is not connected [2013-06-21 14:42:29.559565] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f9daeb6def8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f9daeb6db23] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f9daeb6d23e]))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2013-06-21 14:12:36.611217 (xid=0x138333x) [2013-06-21 14:42:29.559572] W [client-rpc-fops.c:866:client3_3_writev_cbk] 0-testvol-client-0: remote operation failed: Transport endpoint is not connected [2013-06-21 14:42:29.559591] W [rpc-clnt.c:1487:rpc_clnt_submit] 0-testvol-client-0: failed to submit rpc-request (XID: 0x156467x Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (testvol-client-0) 0-testvol-client-1 ------------------ [2013-06-21 14:43:15.323237] W [client-rpc-fops.c:1811:client3_3_fxattrop_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected [2013-06-21 14:43:15.323252] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f9daeb6def8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f9daeb6db23] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f9daeb6d23e]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2013-06-21 14:43:13.538009 (xid=0x155907x) Could you check why they were down? Thanks, Santosh
I tried with latest build i.e. glusterfs-devel-3.4.0.12rhs-1 My bonnie++ test PASSED with following results with 1x2 repli-distr volume. ===========Test result start============= #/opt/qa/tools/bonnie++ -u root -d /mnt/nfsdir Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em...done...done...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP santosh-1.locald 4G 1527 98 23227 3 3847 1 3155 99 288897 31 1325 98 Latency 12597us 5126ms 6014ms 9277us 6590us 2987ms Version 1.96 ------Sequential Create------ --------Random Create-------- santosh-1.localdoma -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 170 2 553 8 491 6 253 3 1225 9 772 6 Latency 51831us 1127ms 74608us 52319us 8247us 64872us 1.96,1.96,santosh-1.localdomain,1,1372295002,4G,,1527,98,23227,3,3847,1,3155,99,288897,31,1325,98,16,,,,,170,2,553,8,491,6,253,3,1225,9,772,6,12597us,5126ms,6014ms,9277us,6590us,2987ms,51831us,1127ms,74608us,52319us,8247us,64872us ===============Test result End==================== Could you retry the test case? I am moving it to ON_QA for the time being, please feel free to reopen if hit the issue again. Thanks, Santosh
I saw this problem in the glusterfs-3.4.0.10rhs-2.el6rhs.x86_64 and glusterfs-3.4.0.11rhs-2.el6rhs.x86_64 builds but I haven't hit this in daily runs since the glusterfs-devel-3.4.0.12rhs-1 build dropped. Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html