Bug 1333023 - readdir-ahead does not fetch xattrs that md-cache needs in it's internal calls
Summary: readdir-ahead does not fetch xattrs that md-cache needs in it's internal calls
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: md-cache
Version: mainline
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Prashanth Pai
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1331260 1334699 1334700
TreeView+ depends on / blocked
 
Reported: 2016-05-04 14:18 UTC by Prashanth Pai
Modified: 2016-11-22 13:03 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1331260
: 1334699 1334700 (view as bug list)
Environment:
Last Closed: 2016-11-22 13:03:51 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions: v3.9.0
Embargoed:


Attachments (Terms of Use)
tcpdump of issue (47.97 KB, application/octet-stream)
2016-05-04 14:32 UTC, Prashanth Pai
no flags Details

Comment 1 Prashanth Pai 2016-05-04 14:30:17 UTC
There's no readdir-ahead under "component" section.

A gluster-swift test case fails when "cache-swift-metadata" is turned on (default).

Problem: glusterfs wrongly reports to gluster-swift that the xattr does not exist even though it exists in the backend. This triggers gluster-swift to create the xattr, thus over-writing the existing xattr causing serious metadata loss!

This is roughly the series of operations done by gluster-swift object server:


[pid  2937] open("/mnt/gluster-object/one/c1/manifest", O_RDONLY|O_CLOEXEC) = 10
[pid  2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 216
[pid  2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"0","X-Object-Manifest":"c1/segments1/","ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.27522","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 216) = 216
[pid  2937] close(10)                   = 0
[pid  2937] close(8)                    = 0
[pid  2937] open("/mnt/gluster-object/one/c1/segments1/0", O_RDONLY|O_CLOEXEC) = 10
[pid  2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 189
[pid  2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"3","ETag":"f97c5d29941bfb1b2fdab0874906ab82","X-Timestamp":"1461849857.48425","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/x-www-form-urlencoded"}", 189) = 189
[pid  2937] close(10)                   = 0
[pid  2937] close(8)                    = 0
[pid  2937] open("/mnt/gluster-object/one/c1/segments1/1", O_RDONLY|O_CLOEXEC) = 10
[pid  2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 189
[pid  2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"3","ETag":"b8a9f715dbb64fd5c56e7783c6820a61","X-Timestamp":"1461849867.95959","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/x-www-form-urlencoded"}", 189) = 189
[pid  2937] close(10)                   = 0
[pid  2937] close(8)                    = 0
[pid  2937] open("/mnt/gluster-object/one/c1/manifest", O_RDONLY|O_CLOEXEC) = 10
[pid  2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = -1 ENODATA (No data available)
[pid  2937] close(13)                   = 0
[pid  2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = -1 ENODATA (No data available)
[pid  2937] fsetxattr(10, "user.swift.metadata", "{"Content-Length":0,"ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.35395","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 178, 0) = 0
[pid  2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 178
[pid  2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":0,"ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.35395","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 178) = 178
[pid  2937] close(10)                   = 0
[pid  2937] close(8)                    = 0

As it can be seen in above strace output, In the second set of getattr() operations, glusterfs client returns ENODATA to gluster-swift. This should never happen as the xattr named user.swift.metadata always exists on the object in the backend bricks.

This issue cannot be reproduced when either one of these volume options are turned off.

# gluster volume set cache-swift-metadata off
OR
# gluster volume set one readdir-ahead off


1. md-cache caches user.swift.metadata xattr either on lookup_cbk or getxattr_cbk path.
2. md-cache returns the xattr from cache the first time gluster-swift asks for it.
3. When gluster-swift asks for the xattr the second time after a brief period, md-cache finds that it's not in it's cache. It reports back to gluster-swift with ENODATA. This is apparently how md-cache does negative caching.

Between operation 2 and 3 listed above, readdirp_cbk would have updated the cache for the file. During this update, the concerned xattr isn't present.

#0  mdc_inode_xatt_set (this=<value optimized out>, inode=<value optimized out>, dict=<value optimized out>) at md-cache.c:598
#1  0x00007f6f26ac975e in mdc_readdirp_cbk (frame=0x7f6f3308c678, cookie=<value optimized out>, this=0x7f6f20010630, op_ret=4, op_errno=2, entries=0x7f6f251dc850, xdata=0x7f6f32aa7b14) at md-cache.c:1998
#2  0x00007f6f26edc814 in qr_readdirp_cbk (frame=0x7f6f3308ca80, cookie=<value optimized out>, this=<value optimized out>, op_ret=4, op_errno=2, entries=0x7f6f251dc850, xdata=0x7f6f32aa7b14) at quick-read.c:520
$28 = (struct md_cache *) 0x7f6f20078090
$29 = {md_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, md_nlink = 1, md_uid = 0, md_gid = 0, md_atime = 1461997370, md_atime_nsec = 442368834, md_mtime = 1461997370, md_mtime_nsec = 442368834, md_ctime = 1461997370, md_ctime_nsec = 476368690, md_rdev = 0, md_size = 0, md_blocks = 0, xattr = 0x0, linkname = 0x0, ia_time = 1461997385, xa_time = 1461997385, lock = 1}

Breakpoint 1, mdc_inode_xatt_get (this=<value optimized out>, inode=<value optimized out>, dict=0x7f6f251dcac8) at md-cache.c:675
675	                if (!mdc->xattr)
#0  mdc_inode_xatt_get (this=<value optimized out>, inode=<value optimized out>, dict=0x7f6f251dcac8) at md-cache.c:675
#1  0x00007f6f26aca4c1 in mdc_getxattr (frame=0x7f6f3308c9d4, this=0x7f6f20010630, loc=0x7f6f1c0493e0, key=0x7f6f1c048350 "user.swift.metadata", xdata=0x0) at md-cache.c:1808
#2  0x00007f6f268ac681 in io_stats_getxattr (frame=0x7f6f3308cb2c, this=0x7f6f200119e0, loc=0x7f6f1c0493e0, name=0x7f6f1c048350 "user.swift.metadata", xdata=0x0) at io-stats.c:2289
$97 = (struct md_cache *) 0x7f6f20078090
$98 = {md_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, md_nlink = 1, md_uid = 0, md_gid = 0, md_atime = 1461997370, md_atime_nsec = 442368834, md_mtime = 1461997370, md_mtime_nsec = 442368834, md_ctime = 1461997370, md_ctime_nsec = 476368690, md_rdev = 0, md_size = 0, md_blocks = 0, xattr = 0x0, linkname = 0x0, ia_time = 1461997385, xa_time = 1461997385, lock = 0}


During readdirp, md-cache will set the xattr keys it wants to cache in xdata which is fetched from the backend bricks. However, during opendir_cbk fop, readdir-ahead xlator internally issues readdirp() calls which do not have any of the keys to be cached set in xdata. So they are not fetched. For the next readdirp() call, readdir-ahead returns the dirent entries from it's cache. And as these entries do not contain the xattrs, md-cache updates it's cache with wrong information which is inconsistent with the state in backend bricks. This wrong information is served to FUSE applications for a small window of time.

Comment 2 Prashanth Pai 2016-05-04 14:32:55 UTC
Created attachment 1153898 [details]
tcpdump of issue

Notice that in tcpdump traces, readdirp() calls do not contain "user.swift.metadata" in xdata.

Comment 3 Vijay Bellur 2016-05-04 16:32:38 UTC
REVIEW: http://review.gluster.org/14213 (md-cache: Don't cache swift metadata by default) posted (#1) for review on master by Prashanth Pai (ppai)

Comment 4 Vijay Bellur 2016-05-04 16:32:42 UTC
REVIEW: http://review.gluster.org/14214 (readdir-ahead: Prefetch xattrs needed by md-cache) posted (#1) for review on master by Prashanth Pai (ppai)

Comment 5 Vijay Bellur 2016-05-04 16:42:22 UTC
REVIEW: http://review.gluster.org/14213 (md-cache: Don't cache swift metadata by default) posted (#2) for review on master by Prashanth Pai (ppai)

Comment 6 Vijay Bellur 2016-05-04 16:42:26 UTC
REVIEW: http://review.gluster.org/14214 (readdir-ahead: Prefetch xattrs needed by md-cache) posted (#2) for review on master by Prashanth Pai (ppai)

Comment 7 Vijay Bellur 2016-05-04 16:46:40 UTC
REVIEW: http://review.gluster.org/14213 (md-cache: Don't cache swift metadata by default) posted (#3) for review on master by Prashanth Pai (ppai)

Comment 8 Vijay Bellur 2016-05-04 16:46:44 UTC
REVIEW: http://review.gluster.org/14214 (readdir-ahead: Prefetch xattrs needed by md-cache) posted (#3) for review on master by Prashanth Pai (ppai)

Comment 9 Vijay Bellur 2016-05-10 09:05:43 UTC
COMMIT: http://review.gluster.org/14214 committed in master by Raghavendra G (rgowdapp) 
------
commit 0c73e7050c4d30ace0c39cc9b9634e9c1b448cfb
Author: Prashanth Pai <ppai>
Date:   Wed May 4 16:56:50 2016 +0530

    readdir-ahead: Prefetch xattrs needed by md-cache
    
    Problem:
    Negative cache feature implementation in md-cache requires xattrs
    returned by posix to be intercepted for every call that can possibly
    return xattrs. This includes readdirp(). This is crucial to treat
    missing keys in cache as a case of negative entry (returns ENODATA)
    
    md-cache puts names of xattrs that it wants to cache in xdata and
    passes it down to posix which returns the specified xattrs in the
    callback. This is done in lookup() and readdirp(). Hence, a xattr
    that is cached can be invalidated during readdirp_cbk too.
    
    This is based on the assumption that readdirp() will always return
    all xattrs that md-cache is interested in. However, this is not the
    case when readdirp() call is served from readdir-ahead's cache.
    readdir-ahead xlator will pre-fetch dentries during opendir_cbk
    and readdirp. These internal readdirp() calls made by readdir-ahead
    xlator does not set xdata in it's requests. Hence, no xattrs are
    fetched and stored in it's internal cache.
    
    This causes metadata loss in gluster-swift. md-cache returns ENODATA
    during getxattr() call even though the xattr for that object exists on
    the brick. On receiving ENODATA, gluster-swift will create new metadata
    and do setxattr(). This results in loss of information stored in
    existing xattr.
    
    Fix:
    During opendir, md-cache will communicate to readdir-ahead asking it
    to store the names of xattrs it's interested in so that readdir-ahead
    can fetch those in all subsequent internal readdirp() calls issued by
    it. This stored names of xattrs is invalidated/updated on the next
    real readdirp() call issued by application. This readdirp() call will
    have xdata set correctly by md-cache xlator.
    
    BUG: 1333023
    Change-Id: I32d46f93a99d4ec34c741f3c52b0646d141614f9
    Reviewed-on: http://review.gluster.org/14214
    Tested-by: Prashanth Pai <ppai>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Tested-by: Gluster Build System <jenkins.com>
    Smoke: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra G <rgowdapp>
    Tested-by: Raghavendra G <rgowdapp>


Note You need to log in before you can comment on or make changes to this bug.