REVIEW: http://review.gluster.org/14282 (readdir-ahead: Prefetch xattrs needed by md-cache) posted (#1) for review on release-3.7 by Prashanth Pai (ppai)
Please add a public description of this bug.
A gluster-swift test case fails when "cache-swift-metadata" is turned on (default). Problem: glusterfs wrongly reports to gluster-swift that the xattr does not exist even though it exists in the backend. This triggers gluster-swift to create the xattr, thus over-writing the existing xattr causing serious metadata loss! This is roughly the series of operations done by gluster-swift object server: [pid 2937] open("/mnt/gluster-object/one/c1/manifest", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 216 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"0","X-Object-Manifest":"c1/segments1/","ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.27522","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 216) = 216 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 [pid 2937] open("/mnt/gluster-object/one/c1/segments1/0", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 189 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"3","ETag":"f97c5d29941bfb1b2fdab0874906ab82","X-Timestamp":"1461849857.48425","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/x-www-form-urlencoded"}", 189) = 189 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 [pid 2937] open("/mnt/gluster-object/one/c1/segments1/1", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 189 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":"3","ETag":"b8a9f715dbb64fd5c56e7783c6820a61","X-Timestamp":"1461849867.95959","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/x-www-form-urlencoded"}", 189) = 189 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 [pid 2937] open("/mnt/gluster-object/one/c1/manifest", O_RDONLY|O_CLOEXEC) = 10 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = -1 ENODATA (No data available) [pid 2937] close(13) = 0 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = -1 ENODATA (No data available) [pid 2937] fsetxattr(10, "user.swift.metadata", "{"Content-Length":0,"ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.35395","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 178, 0) = 0 [pid 2937] fgetxattr(10, "user.swift.metadata", 0x0, 0) = 178 [pid 2937] fgetxattr(10, "user.swift.metadata", "{"Content-Length":0,"ETag":"d41d8cd98f00b204e9800998ecf8427e","X-Timestamp":"1462164335.35395","X-Object-Type":"file","X-Type":"Object","Content-Type":"application/octet-stream"}", 178) = 178 [pid 2937] close(10) = 0 [pid 2937] close(8) = 0 As it can be seen in above strace output, In the second set of getattr() operations, glusterfs client returns ENODATA to gluster-swift. This should never happen as the xattr named user.swift.metadata always exists on the object in the backend bricks. This issue cannot be reproduced when either one of these volume options are turned off. # gluster volume set cache-swift-metadata off OR # gluster volume set one readdir-ahead off 1. md-cache caches user.swift.metadata xattr either on lookup_cbk or getxattr_cbk path. 2. md-cache returns the xattr from cache the first time gluster-swift asks for it. 3. When gluster-swift asks for the xattr the second time after a brief period, md-cache finds that it's not in it's cache. It reports back to gluster-swift with ENODATA. This is apparently how md-cache does negative caching. Between operation 2 and 3 listed above, readdirp_cbk would have updated the cache for the file. During this update, the concerned xattr isn't present. #0 mdc_inode_xatt_set (this=<value optimized out>, inode=<value optimized out>, dict=<value optimized out>) at md-cache.c:598 #1 0x00007f6f26ac975e in mdc_readdirp_cbk (frame=0x7f6f3308c678, cookie=<value optimized out>, this=0x7f6f20010630, op_ret=4, op_errno=2, entries=0x7f6f251dc850, xdata=0x7f6f32aa7b14) at md-cache.c:1998 #2 0x00007f6f26edc814 in qr_readdirp_cbk (frame=0x7f6f3308ca80, cookie=<value optimized out>, this=<value optimized out>, op_ret=4, op_errno=2, entries=0x7f6f251dc850, xdata=0x7f6f32aa7b14) at quick-read.c:520 $28 = (struct md_cache *) 0x7f6f20078090 $29 = {md_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, md_nlink = 1, md_uid = 0, md_gid = 0, md_atime = 1461997370, md_atime_nsec = 442368834, md_mtime = 1461997370, md_mtime_nsec = 442368834, md_ctime = 1461997370, md_ctime_nsec = 476368690, md_rdev = 0, md_size = 0, md_blocks = 0, xattr = 0x0, linkname = 0x0, ia_time = 1461997385, xa_time = 1461997385, lock = 1} Breakpoint 1, mdc_inode_xatt_get (this=<value optimized out>, inode=<value optimized out>, dict=0x7f6f251dcac8) at md-cache.c:675 675 if (!mdc->xattr) #0 mdc_inode_xatt_get (this=<value optimized out>, inode=<value optimized out>, dict=0x7f6f251dcac8) at md-cache.c:675 #1 0x00007f6f26aca4c1 in mdc_getxattr (frame=0x7f6f3308c9d4, this=0x7f6f20010630, loc=0x7f6f1c0493e0, key=0x7f6f1c048350 "user.swift.metadata", xdata=0x0) at md-cache.c:1808 #2 0x00007f6f268ac681 in io_stats_getxattr (frame=0x7f6f3308cb2c, this=0x7f6f200119e0, loc=0x7f6f1c0493e0, name=0x7f6f1c048350 "user.swift.metadata", xdata=0x0) at io-stats.c:2289 $97 = (struct md_cache *) 0x7f6f20078090 $98 = {md_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, md_nlink = 1, md_uid = 0, md_gid = 0, md_atime = 1461997370, md_atime_nsec = 442368834, md_mtime = 1461997370, md_mtime_nsec = 442368834, md_ctime = 1461997370, md_ctime_nsec = 476368690, md_rdev = 0, md_size = 0, md_blocks = 0, xattr = 0x0, linkname = 0x0, ia_time = 1461997385, xa_time = 1461997385, lock = 0} During readdirp, md-cache will set the xattr keys it wants to cache in xdata which is fetched from the backend bricks. However, during opendir_cbk fop, readdir-ahead xlator internally issues readdirp() calls which do not have any of the keys to be cached set in xdata. So they are not fetched. For the next readdirp() call, readdir-ahead returns the dirent entries from it's cache. And as these entries do not contain the xattrs, md-cache updates it's cache with wrong information which is inconsistent with the state in backend bricks. This wrong information is served to FUSE applications for a small window of time.
COMMIT: http://review.gluster.org/14282 committed in release-3.7 by Raghavendra G (rgowdapp) ------ commit 938f45746d245bfbe443d88d5cb0cf61f625f68b Author: Prashanth Pai <ppai> Date: Wed May 4 16:56:50 2016 +0530 readdir-ahead: Prefetch xattrs needed by md-cache Problem: Negative cache feature implementation in md-cache requires xattrs returned by posix to be intercepted for every call that can possibly return xattrs. This includes readdirp(). This is crucial to treat missing keys in cache as a case of negative entry (returns ENODATA) md-cache puts names of xattrs that it wants to cache in xdata and passes it down to posix which returns the specified xattrs in the callback. This is done in lookup() and readdirp(). Hence, a xattr that is cached can be invalidated during readdirp_cbk too. This is based on the assumption that readdirp() will always return all xattrs that md-cache is interested in. However, this is not the case when readdirp() call is served from readdir-ahead's cache. readdir-ahead xlator will pre-fetch dentries during opendir_cbk and readdirp. These internal readdirp() calls made by readdir-ahead xlator does not set xdata in it's requests. Hence, no xattrs are fetched and stored in it's internal cache. This causes metadata loss in gluster-swift. md-cache returns ENODATA during getxattr() call even though the xattr for that object exists on the brick. On receiving ENODATA, gluster-swift will create new metadata and do setxattr(). This results in loss of information stored in existing xattr. Fix: During opendir, md-cache will communicate to readdir-ahead asking it to store the names of xattrs it's interested in so that readdir-ahead can fetch those in all subsequent internal readdirp() calls issued by it. This stored names of xattrs is invalidated/updated on the next real readdirp() call issued by application. This readdirp() call will have xdata set correctly by md-cache xlator. > Reviewed-on: http://review.gluster.org/14214 > Tested-by: Prashanth Pai <ppai> > NetBSD-regression: NetBSD Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.com> > Tested-by: Gluster Build System <jenkins.com> > Smoke: Gluster Build System <jenkins.com> > Reviewed-by: Raghavendra G <rgowdapp> > Tested-by: Raghavendra G <rgowdapp> BUG: 1334700 Change-Id: I32d46f93a99d4ec34c741f3c52b0646d141614f9 (cherry picked from commit 0c73e7050c4d30ace0c39cc9b9634e9c1b448cfb) Reviewed-on: http://review.gluster.org/14282 Tested-by: Prashanth Pai <ppai> NetBSD-regression: NetBSD Build System <jenkins.org> Tested-by: Gluster Build System <jenkins.com> CentOS-regression: Gluster Build System <jenkins.com> Smoke: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.12, please open a new bug report. glusterfs-3.7.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-devel/2016-June/049918.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user