Bug 1410425

Summary:	[GNFS+EC] Cthon failures/issues with Lock/Special Test cases on disperse volume with GNFS mount
Product:	[Community] GlusterFS	Reporter:	Pranith Kumar K <pkarampu>
Component:	disperse	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	urgent
Version:	mainline	CC:	amukherj, aspandey, bugs, jahernan, jthottan, kkeithle, msaini, nchilaka, ndevos, pkarampu, rcyriac, rhinduja, rhs-bugs, sarumuga, skoduri, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1408705	Environment:
Last Closed:	2017-05-30 18:38:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1408705
Bug Blocks:

Comment 1 Pranith Kumar K 2017-01-05 13:17:30 UTC

Looks like there is an issue with posix locks in EC xlator.

Test #3 - Try to lock just the 1st byte.
	Parent: 3.0  - F_TLOCK [               0,               1] PASSED.
	Child:  3.1  - F_TEST  [               0,               1] PASSED.
	Child:  3.2  - F_TEST  [               0,          ENDING] PASSED.
	Child:  3.3  - F_TEST  [               1,               1] FAILED!
	Child:  **** Expected success, returned EACCES...
	Child:  **** Probably implementation error.

Here Parent process has lock starting at byte '0' and length '1'. Child process is trying to take a lock starting at byte '1' of length '1' which should have ideally got granted.


(gdb) bt
#0  client3_3_lk (frame=0x7f8e5388bdbc, this=0x7f8e44020b00, data=0x7f8e48db7ec0) at client-rpc-fops.c:5349
#1  0x00007f8e4836e4be in client_lk (frame=0x7f8e5388bdbc, this=<optimized out>, fd=<optimized out>, cmd=5, 
    lock=<optimized out>, xdata=<optimized out>) at client.c:1652
#2  0x00007f8e4370204d in ec_wind_lk (ec=0x7f8e44067700, fop=0x7f8e380dc06c, idx=0) at ec-locks.c:1003
#3  0x00007f8e436f7f08 in ec_dispatch_mask (fop=fop@entry=0x7f8e380dc06c, mask=63) at ec-common.c:521
#4  0x00007f8e436f8159 in ec_dispatch_all (fop=fop@entry=0x7f8e380dc06c) at ec-common.c:597
#5  0x00007f8e43704280 in ec_manager_lk (fop=0x7f8e380dc06c, state=<optimized out>) at ec-locks.c:1028
#6  0x00007f8e436f779b in __ec_manager (fop=0x7f8e380dc06c, error=0) at ec-common.c:2287
#7  0x00007f8e436f148c in ec_gf_lk (frame=<optimized out>, this=<optimized out>, fd=<optimized out>, 
    cmd=<optimized out>, flock=<optimized out>, xdata=<optimized out>) at ec.c:888
#8  0x00007f8e48141865 in dht_lk (frame=frame@entry=0x7f8e53894808, this=this@entry=0x7f8e440320f0, 
    fd=fd@entry=0x7f8e5609d0f8, cmd=cmd@entry=5, flock=flock@entry=0x7f8e48db87a0, xdata=xdata@entry=0x0)
    at dht-inode-read.c:1067
#9  0x00007f8e55e24e39 in default_lk (frame=0x7f8e53894808, this=<optimized out>, fd=0x7f8e5609d0f8, cmd=5, 
    lock=0x7f8e48db87a0, xdata=0x0) at defaults.c:2524
#10 0x00007f8e43bd39e4 in io_stats_lk (frame=0x7f8e53889a20, this=0x7f8e44033e70, fd=0x7f8e5609d0f8, cmd=5, 
    lock=0x7f8e48db87a0, xdata=0x0) at io-stats.c:3345
#11 0x00007f8e43495752 in nfs_fop_lk (nfsx=<optimized out>, xl=0x7f8e44033e70, nfu=nfu@entry=0x7f8e48db8bc0, 
    fd=0x7f8e5609d0f8, cmd=cmd@entry=5, flock=flock@entry=0x7f8e48db87a0, cbk=0x7f8e434be0c0 <nlm4svc_test_cbk>, 
    local=0x7f8e398cbd04) at nfs-fops.c:1565
#12 0x00007f8e43498105 in nfs_lk (nfsx=<optimized out>, xl=<optimized out>, nfu=nfu@entry=0x7f8e48db8bc0, 
    fd=<optimized out>, cmd=cmd@entry=5, flock=flock@entry=0x7f8e48db87a0, 
    cbk=cbk@entry=0x7f8e434be0c0 <nlm4svc_test_cbk>, local=local@entry=0x7f8e398cbd04) at nfs-generics.c:151
#13 0x00007f8e434be6bc in nlm4_test_fd_resume (carg=carg@entry=0x7f8e398cbd04) at nlm4.c:806
#14 0x00007f8e434be74a in nlm4_test_resume (carg=0x7f8e398cbd04) at nlm4.c:830
#15 0x00007f8e434baa9c in nfs3_fh_resolve_inode_done (cs=cs@entry=0x7f8e398cbd04, inode=inode@entry=0x7f8e41e25128)
    at nfs3-helpers.c:3619
#16 0x00007f8e434bb2db in nfs3_fh_resolve_inode (cs=0x7f8e398cbd04) at nfs3-helpers.c:3828
#17 0x00007f8e434bb385 in nfs3_fh_resolve_resume (cs=cs@entry=0x7f8e398cbd04) at nfs3-helpers.c:3860
#18 0x00007f8e434bb5a8 in nfs3_fh_resolve_root (cs=cs@entry=0x7f8e398cbd04) at nfs3-helpers.c:3915
#19 0x00007f8e434bb7f1 in nfs3_fh_resolve_and_resume (cs=cs@entry=0x7f8e398cbd04, fh=fh@entry=0x7f8e48db98f0, 
    entry=entry@entry=0x0, resum_fn=resum_fn@entry=0x7f8e434be6f0 <nlm4_test_resume>) at nfs3-helpers.c:4011
#20 0x00007f8e434be311 in nlm4svc_test (req=0x7f8e48040b70) at nlm4.c:887
#21 0x00007f8e55b6f775 in rpcsvc_handle_rpc_call (svc=0x7f8e440480e0, trans=trans@entry=0x7f8e44522130, 
    msg=<optimized out>) at rpcsvc.c:695
#22 0x00007f8e55b6f95b in rpcsvc_notify (trans=0x7f8e44522130, mydata=<optimized out>, event=<optimized out>, 
    data=<optimized out>) at rpcsvc.c:789
#23 0x00007f8e55b71893 in rpc_transport_notify (this=this@entry=0x7f8e44522130, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f8e44422750) at rpc-transport.c:538
#24 0x00007f8e4a6632d4 in socket_event_poll_in (this=this@entry=0x7f8e44522130) at socket.c:2267
#25 0x00007f8e4a665785 in socket_event_handler (fd=<optimized out>, idx=31, data=0x7f8e44522130, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#26 0x00007f8e55e05650 in event_dispatch_epoll_handler (event=0x7f8e48db9e80, event_pool=0x7f8e5791df00)
    at event-epoll.c:571
#27 event_dispatch_epoll_worker (data=0x7f8e579708b0) at event-epoll.c:674
#28 0x00007f8e54c0cdc5 in start_thread () from /lib64/libpthread.so.0
#29 0x00007f8e5455173d in clone () from /lib64/libc.so.6
(gdb) f 5
#5  0x00007f8e43704280 in ec_manager_lk (fop=0x7f8e380dc06c, state=<optimized out>) at ec-locks.c:1028
1028	            ec_dispatch_all(fop);
(gdb) l
1023	            }
1024	
1025	        /* Fall through */
1026	
1027	        case EC_STATE_DISPATCH:
1028	            ec_dispatch_all(fop);
1029	
1030	            return EC_STATE_PREPARE_ANSWER;
1031	
1032	        case EC_STATE_PREPARE_ANSWER:
(gdb) f 4
#4  0x00007f8e436f8159 in ec_dispatch_all (fop=fop@entry=0x7f8e380dc06c) at ec-common.c:597
597	                ec_dispatch_mask(fop, fop->remaining);
(gdb) l
592	
593	        if (ec_child_select(fop)) {
594	                fop->expected = gf_bits_count(fop->remaining);
595	                fop->first = 0;
596	
597	                ec_dispatch_mask(fop, fop->remaining);
598	        }
599	}
600	
601	void ec_dispatch_min(ec_fop_data_t * fop)
(gdb) f 3
#3  0x00007f8e436f7f08 in ec_dispatch_mask (fop=fop@entry=0x7f8e380dc06c, mask=63) at ec-common.c:521
521	            fop->wind(ec, fop, idx);
(gdb) l
516	    idx = 0;
517	    while (mask != 0)
518	    {
519	        if ((mask & 1) != 0)
520	        {
521	            fop->wind(ec, fop, idx);
522	        }
523	        idx++;
524	        mask >>= 1;
525	    }
(gdb) f 2
#2  0x00007f8e4370204d in ec_wind_lk (ec=0x7f8e44067700, fop=0x7f8e380dc06c, idx=0) at ec-locks.c:1003
1003	    STACK_WIND_COOKIE(fop->frame, ec_lk_cbk, (void *)(uintptr_t)idx,
(gdb) p fop->flock
$11 = {l_type = 1, l_whence = 0, l_start = 0, l_len = 512, l_pid = 164, l_owner = {len = 36, 
    data = "164.eng.blr.redhat.com", '\000' <repeats 987 times>}}
(gdb) f 8
#8  0x00007f8e48141865 in dht_lk (frame=frame@entry=0x7f8e53894808, this=this@entry=0x7f8e440320f0, 
    fd=fd@entry=0x7f8e5609d0f8, cmd=cmd@entry=5, flock=flock@entry=0x7f8e48db87a0, xdata=xdata@entry=0x0)
    at dht-inode-read.c:1067
1067	        STACK_WIND (frame, dht_lk_cbk, lock_subvol, lock_subvol->fops->lk, fd,
(gdb) p *flock
$12 = {l_type = 1, l_whence = 0, l_start = 1, l_len = 1, l_pid = 164, l_owner = {len = 36, 
    data = "164.eng.blr.redhat.com", '\000' <repeats 987 times>}}

>>>> As we can see above, till dht layer flock->start was '1' and len is '1'.

(gdb) f 7
#7  0x00007f8e436f148c in ec_gf_lk (frame=<optimized out>, this=<optimized out>, fd=<optimized out>, 
    cmd=<optimized out>, flock=<optimized out>, xdata=<optimized out>) at ec.c:888
888	    ec_lk(frame, this, -1, minimum, default_lk_cbk, NULL, fd, cmd,
(gdb) p *flock
value has been optimized out
(gdb) p fop
No symbol "fop" in current context.
(gdb) l
883	                 int32_t cmd, struct gf_flock * flock, dict_t * xdata)
884	{
885	    int32_t minimum = EC_MINIMUM_ALL;
886	    if (flock->l_type == F_UNLCK)
887	            minimum = EC_MINIMUM_ONE;
888	    ec_lk(frame, this, -1, minimum, default_lk_cbk, NULL, fd, cmd,
889	          flock, xdata);
890	
891	    return 0;
892	}
(gdb) p *flock
value has been optimized out
(gdb) f 6
#6  0x00007f8e436f779b in __ec_manager (fop=0x7f8e380dc06c, error=0) at ec-common.c:2287
2287	        fop->state = fop->handler(fop, fop->state);
(gdb) p fop
$13 = (ec_fop_data_t *) 0x7f8e380dc06c
(gdb) p fop->flock
$14 = {l_type = 1, l_whence = 0, l_start = 0, l_len = 512, l_pid = 164, l_owner = {len = 36, 
    data = "164.eng.blr.redhat.com", '\000' <repeats 987 times>}}
(gdb) 

>>> But in EC layer, fop->flock->start got reset to '0' and lock length to 512 resulting in  access denied for the other process trying to take lock in non-conflicting range.

Comment 2 Xavi Hernandez 2017-01-05 13:40:01 UTC

The problem is that the locked range is treated exactly as it's done in inodelk (i.e. the range is aligned to multiples of 512). However in this case it doesn't make sense to do that transformation because the range itself doesn't have any meaning for ec (if a later write tries to modify any region of the file, the proper inodelk will be taken).

I think we should remove the transformation and simply pass the input values to the lower subvolumes.

Comment 3 Pranith Kumar K 2017-01-05 13:43:42 UTC

(In reply to Xavier Hernandez from comment #2)
> The problem is that the locked range is treated exactly as it's done in
> inodelk (i.e. the range is aligned to multiples of 512). However in this
> case it doesn't make sense to do that transformation because the range
> itself doesn't have any meaning for ec (if a later write tries to modify any
> region of the file, the proper inodelk will be taken).
> 
> I think we should remove the transformation and simply pass the input values
> to the lower subvolumes.

Good that we are on the same page on this. I have the patch ready. I cloned it to send that patch :-). I wonder what happens for mandatory locking, where a write will be rejected if there is a lock in a region. Will post the patch as soon as I am done finding the answer.

Comment 4 Xavi Hernandez 2017-01-05 13:56:52 UTC

Mandatory locks will need some additional work.

If mandatory locks are handled by the features/locks xlator, it will need some additional info for each write to know the real offset/length of each write. Otherwise I don't see a way to allow a fine grained mandatory lock support for ec.

Comment 5 Pranith Kumar K 2017-01-05 14:03:36 UTC

(In reply to Xavier Hernandez from comment #4)
> Mandatory locks will need some additional work.
> 
> If mandatory locks are handled by the features/locks xlator, it will need
> some additional info for each write to know the real offset/length of each
> write. Otherwise I don't see a way to allow a fine grained mandatory lock
> support for ec.

Yay! I was thinking of passing that in xdata as well. Okay for now let's send this off, we can work on that as part of another bug. Only other bug I saw is doing dispatch_all for nonblocking locks. Posix locks does this lock merging etc right, so when we lock and unlock it may truncate the lock range.

i.e. If we already have a lock from 0-10 and then we do a lock for 5-15 by same fd/owner it will become a single lock 0-15. Now if we do unlock from 5-15 because some other node hit EAGAIN because a parallel conflicting lock on say range 11-12. Then the resulting unlock on 5-15 will lead to 0-5

In afr it is always wound incrementally one node after the other.

Comment 6 Xavi Hernandez 2017-01-09 07:50:22 UTC

(In reply to Pranith Kumar K from comment #5)
> Yay! I was thinking of passing that in xdata as well. Okay for now let's
> send this off, we can work on that as part of another bug.

Yes, that can be done in another bug.

> Only other bug I
> saw is doing dispatch_all for nonblocking locks. Posix locks does this lock
> merging etc right, so when we lock and unlock it may truncate the lock range.
> 
> i.e. If we already have a lock from 0-10 and then we do a lock for 5-15 by
> same fd/owner it will become a single lock 0-15. Now if we do unlock from
> 5-15 because some other node hit EAGAIN because a parallel conflicting lock
> on say range 11-12. Then the resulting unlock on 5-15 will lead to 0-5
> 
> In afr it is always wound incrementally one node after the other.

It seems ok to me to do an incremental locking for lk fop.

Comment 7 Worker Ant 2017-01-20 13:04:20 UTC

REVIEW: http://review.gluster.org/16445 (cluster/ec: Fix cthon failures observed with ec volumes) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 8 Worker Ant 2017-01-27 12:21:46 UTC

REVIEW: https://review.gluster.org/16445 (cluster/ec: Fix cthon failures observed with ec volumes) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 9 Worker Ant 2017-01-29 16:45:51 UTC

COMMIT: https://review.gluster.org/16445 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit f2406fa6155267fa747d9342092ee7709a2531a9
Author: Pranith Kumar K <pkarampu>
Date:   Fri Jan 27 16:17:49 2017 +0530

    cluster/ec: Fix cthon failures observed with ec volumes
    
    Since EC already winds one write after other there is no need to align
    application fcntl locks with ec blocks. Also added this locking to be
    done as a transaction to prevent partial upgrade/downgrade of locks
    happening.
    
    BUG: 1410425
    Change-Id: I7ce8955c2174f62b11e5cb16140e30ff0f7c4c31
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: https://review.gluster.org/16445
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 10 Shyamsundar 2017-05-30 18:38:06 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/