Description of problem: ======================== When we try to modify the same file from different clients, using truncate or other tools like dd, I see that the operation hangs on the client(s) also I see the following brick log message [2016-11-22 11:34:52.002005] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2df4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to rpc-transport (tcp.cfops-server) [2016-11-22 11:34:52.002576] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18d62) [0x7f13da709d62] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x18689) [0x7f13da2a9689] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f13da29a186] ) 0-: Reply submission failed [2016-11-22 11:34:52.002638] I [MSGID: 115013] [server-helpers.c:296:do_fd_cleanup] 0-cfops-server: fd cleanup on /file.1 [2016-11-22 11:34:52.002950] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-cfops-server: Shutting down connection dhcp35-126.lab.eng.blr.redhat.com-14670-2016/11/22-11:07:32:276386-cfops-client-0-0-0 The message "I [MSGID: 115013] [server-helpers.c:296:do_fd_cleanup] 0-cfops-server: fd cleanup on /file.1" repeated 2 times between [2016-11-22 11:34:52.002638] and [2016-11-22 11:34:52.002680] [2016-11-22 11:36:59.289861] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-cfops-server: accepted client from dhcp35-126.lab.eng.blr.redhat.com-2527-2016/11/22-11:36:57:368392-cfops-client-0-0-0 (version: 3.8.4) [2016-11-22 11:46:57.012641] E [inodelk.c:304:__inode_unlock_lock] 0-cfops-locks: Matching lock not found for unlock 0-9223372036854775807, by 6c6091ed947f0000 on 0x7f13d4008e60 [2016-11-22 11:46:57.012727] E [MSGID: 136002] [decompounder.c:370:dc_finodelk_cbk] 0-cfops-decompounder: fop number 2 failed. Unwinding. [Invalid argument] [2016-11-22 11:46:57.012975] E [MSGID: 115090] [server-rpc-fops.c:2087:server_compound_cbk] 0-cfops-server: 533: COMPOUND0 (820c579f-9c93-4716-8009-79b6cda76672) ==> (Invalid argument) [Invalid argument] Version-Release number of selected component (if applicable): 3.8.4-5 How reproducible: ================ was able to reproduce with 50% hit ratio able to see this atleast 4-5 times Steps to Reproduce: 1.create a 1x2 volume enable compound fops 2.mount vol on two client 3. from both the cleints try to truncate the same file with different sizes one of the client hangs or sometimes both brick logs show some cfops errors
[root@dhcp35-37 ~]# gluster v status cfops Status of volume: cfops Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.37:/rhs/brick2/cfops 49153 0 Y 26953 Brick 10.70.35.116:/rhs/brick2/cfops 49153 0 Y 5548 Self-heal Daemon on localhost N/A N/A Y 26973 Self-heal Daemon on 10.70.35.8 N/A N/A Y 22350 Self-heal Daemon on 10.70.35.196 N/A N/A Y 24188 Self-heal Daemon on 10.70.35.135 N/A N/A Y 22694 Self-heal Daemon on 10.70.35.116 N/A N/A Y 5570 Self-heal Daemon on 10.70.35.239 N/A N/A Y 19391 Task Status of Volume cfops ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-37 ~]# gluster v info cfops Volume Name: cfops Type: Replicate Volume ID: d4fab55e-8d96-4675-aa69-664b26170a16 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.35.37:/rhs/brick2/cfops Brick2: 10.70.35.116:/rhs/brick2/cfops Options Reconfigured: cluster.shd-max-threads: 4 cluster.use-compound-fops: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on [root@dhcp35-37 ~]#
Patch https://code.engineering.redhat.com/gerrit/91332 fixes this bug and is merged now. Moving this bug to MODIFIED state.
hit this while validating RFE 1360978 - [RFE]Reducing number of network round trips
Ran the testcase for which bz was raised on 3.8.4-8 Not seeing anymore hang, moving to verified also tried with brick down and when cfop is not enabled too All cases passed, hence moving to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html