Bug 1397364 - [compound FOPs]: file operation hangs with compound fops
Summary: [compound FOPs]: file operation hangs with compound fops
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Krutika Dhananjay
QA Contact: nchilaka
URL:
Whiteboard:
Depends On:
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-11-22 11:54 UTC by nchilaka
Modified: 2017-03-23 06:20 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.8.4-6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 06:20:45 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description nchilaka 2016-11-22 11:54:00 UTC
Description of problem:
========================
When we try to modify the same file from different clients, using truncate or other tools like dd,
I see that the operation hangs on the client(s)
also I see the following brick log message


[2016-11-22 11:34:52.002005] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x2df4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to rpc-transport (tcp.cfops-server)
[2016-11-22 11:34:52.002576] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18d62) [0x7f13da709d62] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x18689) [0x7f13da2a9689] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f13da29a186] ) 0-: Reply submission failed
[2016-11-22 11:34:52.002638] I [MSGID: 115013] [server-helpers.c:296:do_fd_cleanup] 0-cfops-server: fd cleanup on /file.1
[2016-11-22 11:34:52.002950] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-cfops-server: Shutting down connection dhcp35-126.lab.eng.blr.redhat.com-14670-2016/11/22-11:07:32:276386-cfops-client-0-0-0
The message "I [MSGID: 115013] [server-helpers.c:296:do_fd_cleanup] 0-cfops-server: fd cleanup on /file.1" repeated 2 times between [2016-11-22 11:34:52.002638] and [2016-11-22 11:34:52.002680]
[2016-11-22 11:36:59.289861] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-cfops-server: accepted client from dhcp35-126.lab.eng.blr.redhat.com-2527-2016/11/22-11:36:57:368392-cfops-client-0-0-0 (version: 3.8.4)
[2016-11-22 11:46:57.012641] E [inodelk.c:304:__inode_unlock_lock] 0-cfops-locks:  Matching lock not found for unlock 0-9223372036854775807, by 6c6091ed947f0000 on 0x7f13d4008e60
[2016-11-22 11:46:57.012727] E [MSGID: 136002] [decompounder.c:370:dc_finodelk_cbk] 0-cfops-decompounder: fop number 2 failed. Unwinding. [Invalid argument]
[2016-11-22 11:46:57.012975] E [MSGID: 115090] [server-rpc-fops.c:2087:server_compound_cbk] 0-cfops-server: 533: COMPOUND0 (820c579f-9c93-4716-8009-79b6cda76672) ==> (Invalid argument) [Invalid argument]



Version-Release number of selected component (if applicable):
3.8.4-5


How reproducible:
================
was able to reproduce with 50% hit ratio
able to see this atleast 4-5 times

Steps to Reproduce:
1.create a 1x2 volume   enable compound fops
2.mount vol on two client
3. from both the cleints try to truncate the same file with different sizes



one of the client hangs or sometimes both

brick logs show some cfops errors

Comment 3 nchilaka 2016-11-22 13:00:12 UTC
[root@dhcp35-37 ~]# gluster v status cfops
Status of volume: cfops
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.37:/rhs/brick2/cfops         49153     0          Y       26953
Brick 10.70.35.116:/rhs/brick2/cfops        49153     0          Y       5548 
Self-heal Daemon on localhost               N/A       N/A        Y       26973
Self-heal Daemon on 10.70.35.8              N/A       N/A        Y       22350
Self-heal Daemon on 10.70.35.196            N/A       N/A        Y       24188
Self-heal Daemon on 10.70.35.135            N/A       N/A        Y       22694
Self-heal Daemon on 10.70.35.116            N/A       N/A        Y       5570 
Self-heal Daemon on 10.70.35.239            N/A       N/A        Y       19391
 
Task Status of Volume cfops
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-37 ~]# gluster v info cfops
 
Volume Name: cfops
Type: Replicate
Volume ID: d4fab55e-8d96-4675-aa69-664b26170a16
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.35.37:/rhs/brick2/cfops
Brick2: 10.70.35.116:/rhs/brick2/cfops
Options Reconfigured:
cluster.shd-max-threads: 4
cluster.use-compound-fops: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
[root@dhcp35-37 ~]#

Comment 6 Krutika Dhananjay 2016-11-28 06:23:52 UTC
Patch https://code.engineering.redhat.com/gerrit/91332 fixes this bug and is merged now. Moving this bug to MODIFIED state.

Comment 8 nchilaka 2016-11-29 06:56:57 UTC
hit this while validating RFE 1360978 - [RFE]Reducing number of network round trips

Comment 9 nchilaka 2016-12-13 13:08:54 UTC
Ran the testcase for which bz was raised on 3.8.4-8
Not seeing anymore hang, moving to verified
also tried with brick down and when cfop is not enabled too
All cases passed, hence moving to verified

Comment 11 errata-xmlrpc 2017-03-23 06:20:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.