Description of problem: ======================== Seems like if we restart a volume , the negative lookup caching feature is lost. I did a lookup of a non existing file with nl caching enabled generally the first lookup takes about 7-13 lookup FOPs depending on the dht hash/unhashed brick On a second lookup, I notice that the lookup FOP generally seemed to take about 1 lookup FOP on each brick I then did a volume restart and reran above checks and found that even on repeatitive negative lookups, the lookup FOPs take the same as without nl cache enabled I saw the trusted vol file and didn't note any descrepencies, as the nl cache translators got loaded Version-Release number of selected component (if applicable): ====== [root@dhcp35-192 negative]# rpm -qa|grep gluster gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 python-gluster-3.8.4-24.el7rhgs.noarch glusterfs-libs-3.8.4-24.el7rhgs.x86_64 glusterfs-fuse-3.8.4-24.el7rhgs.x86_64 glusterfs-api-devel-3.8.4-24.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64 samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-3.8.4-24.el7rhgs.x86_64 glusterfs-api-3.8.4-24.el7rhgs.x86_64 glusterfs-cli-3.8.4-24.el7rhgs.x86_64 glusterfs-devel-3.8.4-24.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64 glusterfs-rdma-3.8.4-24.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-client-xlators-3.8.4-24.el7rhgs.x86_64 glusterfs-server-3.8.4-24.el7rhgs.x86_64 glusterfs-events-3.8.4-24.el7rhgs.x86_64 [root@dhcp35-192 negative]# rpm -qa|grep samba samba-common-4.6.3-0.el7rhgs.noarch samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64 samba-common-tools-4.6.3-0.el7rhgs.x86_64 samba-common-libs-4.6.3-0.el7rhgs.x86_64 samba-4.6.3-0.el7rhgs.x86_64 samba-client-4.6.3-0.el7rhgs.x86_64 samba-libs-4.6.3-0.el7rhgs.x86_64 samba-client-libs-4.6.3-0.el7rhgs.x86_64 [root@dhcp35-192 negative]# How reproducible: =============== consistent on my setup Steps to Reproduce: 1.had a 3 node cluster with samba setup using https://mojo.redhat.com/docs/DOC-1027985(no ctdb setup) 2.created a 2x2 volume as below ===>vol bricks are part of n2 and n3 nodes, while n1 was the samba server node [root@dhcp35-192 ~]# gluster v info Volume Name: negative Type: Distributed-Replicate Volume ID: 48c809a8-8a82-4806-aaf0-5ea0bbf423cb Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.215:/rhs/brick1/negative Brick2: 10.70.35.214:/rhs/brick1/negative Brick3: 10.70.35.215:/rhs/brick2/negative Brick4: 10.70.35.214:/rhs/brick2/negative Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on features.cache-invalidation-timeout: 3000 features.cache-invalidation: on performance.nl-cache: on nfs.disable: on transport.address-family: inet server.allow-insecure: on storage.batch-fsync-delay-usec: 0 VOLUME PROFILE IS STARTED 4. mounted using cifs on a rhel client 5.did some amount of testing wrt nl caching(for about half a day) 6. created a directory dir4 7. did a negative lookup by issuing ls x2 says file not existing as expected the profiling mentioned about 7-13 fops 8. again did a ls x2 to see nl cache working fop redcued to about 1 per brick 9. restarted vol 10. repeated 7,8 steps again and again, but the lookup fop count is not reducing before restart first negative lookup [root@dhcp35-192 ~]# gluster v profile negative info incremental Brick: 10.70.35.215:/rhs/brick2/negative ---------------------------------------- Interval 16 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 9.69 60.00 us 60.00 us 60.00 us 1 GETXATTR 90.31 111.80 us 82.00 us 143.00 us 5 LOOKUP Duration: 4 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.215:/rhs/brick1/negative ---------------------------------------- Interval 16 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 12.04 124.00 us 124.00 us 124.00 us 1 GETXATTR 87.96 113.25 us 63.00 us 252.00 us 8 LOOKUP Duration: 4 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.214:/rhs/brick2/negative ---------------------------------------- Interval 16 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 9.87 70.00 us 70.00 us 70.00 us 1 GETXATTR 90.13 127.80 us 100.00 us 152.00 us 5 LOOKUP Duration: 5 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.214:/rhs/brick1/negative ---------------------------------------- Interval 16 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 12.18 147.00 us 147.00 us 147.00 us 1 GETXATTR 87.82 132.50 us 79.00 us 299.00 us 8 LOOKUP Duration: 5 seconds Data Read: 0 bytes Data Written: 0 bytes second negative lookup [root@dhcp35-192 ~]# gluster v profile negative info incremental Brick: 10.70.35.215:/rhs/brick1/negative ---------------------------------------- Interval 17 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 336.00 us 336.00 us 336.00 us 1 LOOKUP Duration: 6 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.215:/rhs/brick2/negative ---------------------------------------- Interval 17 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 338.00 us 338.00 us 338.00 us 1 LOOKUP Duration: 6 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.214:/rhs/brick2/negative ---------------------------------------- Interval 17 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 144.00 us 144.00 us 144.00 us 1 LOOKUP Duration: 6 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.214:/rhs/brick1/negative ---------------------------------------- Interval 17 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 263.00 us 263.00 us 263.00 us 1 LOOKUP Duration: 6 seconds Data Read: 0 bytes Data Written: 0 bytes [root@dhcp35-192 ~]# ############################# on restart first negative lookup [root@dhcp35-192 ~]# gluster v profile negative info incremental Brick: 10.70.35.214:/rhs/brick2/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 121.29 us 77.00 us 251.00 us 7 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.214:/rhs/brick1/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 116.08 us 53.00 us 446.00 us 13 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.215:/rhs/brick1/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 125.92 us 69.00 us 477.00 us 13 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.215:/rhs/brick2/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 113.86 us 74.00 us 249.00 us 7 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes second negative lookup [root@dhcp35-192 ~]# gluster v profile negative info incremental Brick: 10.70.35.214:/rhs/brick2/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 121.29 us 77.00 us 251.00 us 7 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.214:/rhs/brick1/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 116.08 us 53.00 us 446.00 us 13 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.215:/rhs/brick1/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 125.92 us 69.00 us 477.00 us 13 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes Brick: 10.70.35.215:/rhs/brick2/negative ---------------------------------------- Interval 13 Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 100.00 113.86 us 74.00 us 249.00 us 7 LOOKUP Duration: 8 seconds Data Read: 0 bytes Data Written: 0 bytes Workaround: ======== post restart turn off and again turn on nl cache option
RCA: The cache is stored in the directories inode_ctx, and the time of cache is set to the time inode_ctx was created, and is not updated as and when new cache entries are added to the inode_ctx(as it is cumbersome to have timeout for each entry, and updating global time will result in stale cache of older entries). Consider a sequence of operation as below: At T1, lookup dir1/f1 (f1 doesn't exist) dir1 inode_ctx->cache_time = T1 cache = "f1" lookup dir f2 (f2 doesn't exist) dir1 inode_ctx->cache_time remains unchanged cache = "f1", "f2" Child_DOWN / all /some of the bricks down at T2 thus anything cached before T2 is considered invalid but is not cleared from memory. At T3, lookup dir1/f1 dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache. lookup is sent to the brick. At T4, lookup dir1/f3 dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache. lookup is sent to the brick. Thus the cache of dir1 is invalid until T1+10 min(cache timeout), and also for the 10 min no new cache is added to the dir1. Consequence: After any brick down/up, the cache of all the existing directories is invalid and cannot be corrected for max 10 min. Thus after brick down/up the nl-cache doesn't work for the directories that already had the cache, for the max of 10 min. Solution: If the inoce_ctx->cachetime < brick_down time, then delete the existing cache and start populating fresh cache.
Upstream patch : https://review.gluster.org/17453
(In reply to Atin Mukherjee from comment #5) > Upstream patch : https://review.gluster.org/17453 One more upstream patch https://review.gluster.org/#/c/17450/ is required here.
downstream patches: https://code.engineering.redhat.com/gerrit/#/c/108886/ https://code.engineering.redhat.com/gerrit/#/c/108892/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774