There's no readdir-ahead under "component" section. A gluster-swift test case fails when "cache-swift-metadata" is turned on (default). Problem: glusterfs wrongly reports to gluster-swift that the xattr does not exist even though it exists in the backend. This triggers gluster-swift to create the xattr, thus over-writing the existing xattr causing serious metadata loss! This is roughly the series of operations done by gluster-swift object server: [pid 2937] open("/mnt/gluster-object/one/c1/manifest", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 216 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"0","X-Object-Manifest":"c1/segments1/","ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.27522","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 216) = 216 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 [pid 2937] open("/mnt/gluster-object/one/c1/segments1/0", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 189 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"3","ETag":"f97c5d29941bfb1b2fdab0874906ab82","X-Timestamp":"1461849857.48425","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/x-www-form-urlencoded"}", 189) = 189 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 [pid 2937] open("/mnt/gluster-object/one/c1/segments1/1", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 189 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"3","ETag":"b8a9f715dbb64fd5c56e7783c6820a61","X-Timestamp":"1461849867.95959","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/x-www-form-urlencoded"}", 189) = 189 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 [pid 2937] open("/mnt/gluster-object/one/c1/manifest", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = -1 ENODATA (No data available) [pid 2937] close(13) = 0 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = -1 ENODATA (No data available) [pid 2937] fsetxattr(10, "user.swift.metadata", "{"Content-Length":0,"ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.35395","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 178, 0) = 0 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 178 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":0,"ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.35395","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 178) = 178 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 As it can be seen in above strace output, In the second set of getattr() operations, glusterfs client returns ENODATA to gluster-swift. This should never happen as the xattr named user.swift.metadata always exists on the object in the backend bricks. This issue cannot be reproduced when either one of these volume options are turned off. # gluster volume set cache-swift-metadata off OR # gluster volume set one readdir-ahead off 1. md-cache caches user.swift.metadata xattr either on lookup_cbk or getxattr_cbk path. 2. md-cache returns the xattr from cache the first time gluster-swift asks for it. 3. When gluster-swift asks for the xattr the second time after a brief period, md-cache finds that it's not in it's cache. It reports back to gluster-swift with ENODATA. This is apparently how md-cache does negative caching. Between operation 2 and 3 listed above, readdirp_cbk would have updated the cache for the file. During this update, the concerned xattr isn't present. #0 mdc_inode_xatt_set (this=<value optimized out>, inode=<value optimized out>, dict=<value optimized out>) at md-cache.c:598 #1 0x00007f6f26ac975e in mdc_readdirp_cbk (frame=0x7f6f3308c678, cookie=<value optimized out>, this=0x7f6f20010630, op_ret=4, op_errno=2, entries=0x7f6f251dc850, xdata=0x7f6f32aa7b14) at md-cache.c:1998 #2 0x00007f6f26edc814 in qr_readdirp_cbk (frame=0x7f6f3308ca80, cookie=<value optimized out>, this=<value optimized out>, op_ret=4, op_errno=2, entries=0x7f6f251dc850, xdata=0x7f6f32aa7b14) at quick-read.c:520 $28 = (struct md_cache *) 0x7f6f20078090 $29 = {md_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, md_nlink = 1, md_uid = 0, md_gid = 0, md_atime = 1461997370, md_atime_nsec = 442368834, md_mtime = 1461997370, md_mtime_nsec = 442368834, md_ctime = 1461997370, md_ctime_nsec = 476368690, md_rdev = 0, md_size = 0, md_blocks = 0, xattr = 0x0, linkname = 0x0, ia_time = 1461997385, xa_time = 1461997385, lock = 1} Breakpoint 1, mdc_inode_xatt_get (this=<value optimized out>, inode=<value optimized out>, dict=0x7f6f251dcac8) at md-cache.c:675 675 if (!mdc->xattr) #0 mdc_inode_xatt_get (this=<value optimized out>, inode=<value optimized out>, dict=0x7f6f251dcac8) at md-cache.c:675 #1 0x00007f6f26aca4c1 in mdc_getxattr (frame=0x7f6f3308c9d4, this=0x7f6f20010630, loc=0x7f6f1c0493e0, key=0x7f6f1c048350 "user.swift.metadata", xdata=0x0) at md-cache.c:1808 #2 0x00007f6f268ac681 in io_stats_getxattr (frame=0x7f6f3308cb2c, this=0x7f6f200119e0, loc=0x7f6f1c0493e0, name=0x7f6f1c048350 "user.swift.metadata", xdata=0x0) at io-stats.c:2289 $97 = (struct md_cache *) 0x7f6f20078090 $98 = {md_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, md_nlink = 1, md_uid = 0, md_gid = 0, md_atime = 1461997370, md_atime_nsec = 442368834, md_mtime = 1461997370, md_mtime_nsec = 442368834, md_ctime = 1461997370, md_ctime_nsec = 476368690, md_rdev = 0, md_size = 0, md_blocks = 0, xattr = 0x0, linkname = 0x0, ia_time = 1461997385, xa_time = 1461997385, lock = 0} During readdirp, md-cache will set the xattr keys it wants to cache in xdata which is fetched from the backend bricks. However, during opendir_cbk fop, readdir-ahead xlator internally issues readdirp() calls which do not have any of the keys to be cached set in xdata. So they are not fetched. For the next readdirp() call, readdir-ahead returns the dirent entries from it's cache. And as these entries do not contain the xattrs, md-cache updates it's cache with wrong information which is inconsistent with the state in backend bricks. This wrong information is served to FUSE applications for a small window of time.
Created attachment 1153898 [details] tcpdump of issue Notice that in tcpdump traces, readdirp() calls do not contain "user.swift.metadata" in xdata.
REVIEW: http://review.gluster.org/14213 (md-cache: Don't cache swift metadata by default) posted (#1) for review on master by Prashanth Pai (ppai)
REVIEW: http://review.gluster.org/14214 (readdir-ahead: Prefetch xattrs needed by md-cache) posted (#1) for review on master by Prashanth Pai (ppai)
REVIEW: http://review.gluster.org/14213 (md-cache: Don't cache swift metadata by default) posted (#2) for review on master by Prashanth Pai (ppai)
REVIEW: http://review.gluster.org/14214 (readdir-ahead: Prefetch xattrs needed by md-cache) posted (#2) for review on master by Prashanth Pai (ppai)
REVIEW: http://review.gluster.org/14213 (md-cache: Don't cache swift metadata by default) posted (#3) for review on master by Prashanth Pai (ppai)
REVIEW: http://review.gluster.org/14214 (readdir-ahead: Prefetch xattrs needed by md-cache) posted (#3) for review on master by Prashanth Pai (ppai)
COMMIT: http://review.gluster.org/14214 committed in master by Raghavendra G (rgowdapp) ------ commit 0c73e7050c4d30ace0c39cc9b9634e9c1b448cfb Author: Prashanth Pai <ppai> Date: Wed May 4 16:56:50 2016 +0530 readdir-ahead: Prefetch xattrs needed by md-cache Problem: Negative cache feature implementation in md-cache requires xattrs returned by posix to be intercepted for every call that can possibly return xattrs. This includes readdirp(). This is crucial to treat missing keys in cache as a case of negative entry (returns ENODATA) md-cache puts names of xattrs that it wants to cache in xdata and passes it down to posix which returns the specified xattrs in the callback. This is done in lookup() and readdirp(). Hence, a xattr that is cached can be invalidated during readdirp_cbk too. This is based on the assumption that readdirp() will always return all xattrs that md-cache is interested in. However, this is not the case when readdirp() call is served from readdir-ahead's cache. readdir-ahead xlator will pre-fetch dentries during opendir_cbk and readdirp. These internal readdirp() calls made by readdir-ahead xlator does not set xdata in it's requests. Hence, no xattrs are fetched and stored in it's internal cache. This causes metadata loss in gluster-swift. md-cache returns ENODATA during getxattr() call even though the xattr for that object exists on the brick. On receiving ENODATA, gluster-swift will create new metadata and do setxattr(). This results in loss of information stored in existing xattr. Fix: During opendir, md-cache will communicate to readdir-ahead asking it to store the names of xattrs it's interested in so that readdir-ahead can fetch those in all subsequent internal readdirp() calls issued by it. This stored names of xattrs is invalidated/updated on the next real readdirp() call issued by application. This readdirp() call will have xdata set correctly by md-cache xlator. BUG: 1333023 Change-Id: I32d46f93a99d4ec34c741f3c52b0646d141614f9 Reviewed-on: http://review.gluster.org/14214 Tested-by: Prashanth Pai <ppai> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Tested-by: Gluster Build System <jenkins.com> Smoke: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp> Tested-by: Raghavendra G <rgowdapp>