Bug 1238047 - Crash in Quota enforcer
Summary: Crash in Quota enforcer
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: quota
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1238048 1238049 1238747 1242898
TreeView+ depends on / blocked
 
Reported: 2015-07-01 05:26 UTC by Raghavendra G
Modified: 2018-08-29 03:18 UTC (History)
2 users (show)

Fixed In Version: glusterfs-4.1.3 (or higher)
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1238048 (view as bug list)
Environment:
Last Closed: 2018-08-29 03:18:19 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Raghavendra G 2015-07-01 05:26:01 UTC
Description of problem:
Following crash was observed while running regression test:
./tests/bugs/quota/bug-1235182.t

(gdb) p local->validate_loc.inode
$27 = (inode_t *) 0x7f1e4acd406c
(gdb) bt
#0  0x00000039694028b8 in uuid_is_null () from /lib64/libuuid.so.1
#1  0x00007f1e618110ef in gf_uuid_is_null (uuid=0x8 <Address 0x8 out of bounds>) at ../../../../../libglusterfs/src/compat-uuid.h:45
#2  quota_enforcer_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f1e6cd75244)
    at ../../../../../xlators/features/quota/src/quota-enforcer-client.c:169
#3  0x00007f1e6dcc6db4 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f1e5c0f2180, pollin=0x7f1e5c0c35e0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:759
#4  0x00007f1e6dcc7620 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f1e5c0f21b0, event=<optimized out>, data=<optimized out>)
    at ../../../../rpc/rpc-lib/src/rpc-clnt.c:887
#5  0x00007f1e6dcc3633 in rpc_transport_notify (this=this@entry=0x7f1e5c0f5240, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
    data=data@entry=0x7f1e5c0c35e0) at ../../../../rpc/rpc-lib/src/rpc-transport.c:538
#6  0x00007f1e63f6ad45 in socket_event_poll_in (this=this@entry=0x7f1e5c0f5240) at ../../../../../rpc/rpc-transport/socket/src/socket.c:2285
#7  0x00007f1e63f6dbbc in socket_event_handler (fd=<optimized out>, idx=<optimized out>, data=0x7f1e5c0f5240, poll_in=1, poll_out=0, poll_err=0)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:2398
#8  0x00007f1e6df5a90a in event_dispatch_epoll_handler (event=0x7f1e63d37e20, event_pool=0x1801c90) at ../../../libglusterfs/src/event-epoll.c:570
#9  event_dispatch_epoll_worker (data=0x1849540) at ../../../libglusterfs/src/event-epoll.c:673
#10 0x000000395ec07d14 in start_thread () from /lib64/libpthread.so.0
#11 0x000000395e8f199d in clone () from /lib64/libc.so.6
(gdb) p inode
$28 = (inode_t *) 0x0
(gdb) p local->validate_loc.inode
$29 = (inode_t *) 0x7f1e4acd406c
(gdb) f 2
#2  quota_enforcer_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f1e6cd75244)
    at ../../../../../xlators/features/quota/src/quota-enforcer-client.c:169
169                if ((!gf_uuid_is_null (inode->gfid))

As can be seen above, inode is NULL. However, local->validate_loc.inode is _NOT_ NULL. This indicates that most likely there is a race (somebody overwrote local->validate_loc.inode). Looking through the code, I found following implementation of quota_check_limit.

<quota_check_limit>

                ret = quota_check_object_limit (frame, ctx, priv, _inode, this,
                                                &op_errno, just_validated,
                                                local, &skip_check);
                if (skip_check == _gf_true)
                        goto done;

                if (ret) {
                        if (op_errno != EDQUOT)
                                gf_msg (this->name, GF_LOG_ERROR, 0,
                                        Q_MSG_ENFORCEMENT_FAILED, "Failed to "
                                        "check quota object limit");
                        goto err;
                }

                ret = quota_check_size_limit (frame, ctx, priv, _inode, this,
                                              &op_errno, just_validated, delta,
                                              local, &skip_check);
                if (skip_check == _gf_true)
                        goto done;

</quota_check_limit>

There are couple of issues with above implementation:
1. We continue to enforce on ancestors even while we are validating state of one of the ancestors. Earlier (with just implementation of quota_check_size_limit) this was not the case. Once we invoke validation, the enforcement was _stalled_ till we here back the result of validation. Current code just invokes validation and continues with enforcement. Note that local is shared for all the activities - enforcement, validation and (probably) build_ancestry. So, only one of enforcement, validation or build-ancestry can be active at any time.

2. This is a slight variant of problem 1. Assume that quota_check_size_limit invokes validation. This doesn't stop quota_check_inode_limit from again invoking validation (since first validation is not complete _yet_). So, quota_check_inode_limit invokes another instance of validation. Note that both instances of validation are sharing same local causing races.

The fix is that as mentioned in 1, only one of (that too only one instance of) enforcement, validation and build_ancestry can be in progress at any time. Otherwise if we want to make parallel, each one of them should've their own locals/frames to avoid corruption.

PS: Similar problem exists while enforcing limits on hardlinks.

Version-Release number of selected component (if applicable):
master


How reproducible:
Intermittent

Steps to Reproduce:
1. running ./tests/bugs/quota/bug-1235182.t in a loop hits the bug once in a while.

2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Mohammed Rafi KC 2015-07-07 12:25:33 UTC
upstream patch : http://review.gluster.org/11510


Note You need to log in before you can comment on or make changes to this bug.