Bug 1450080 - [Negative Lookup]: negative lookup features doesn't seem to work on restart of volume
Summary: [Negative Lookup]: negative lookup features doesn't seem to work on restart o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: samba
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.3.0
Assignee: Poornima G
QA Contact: Vivek Das
URL:
Whiteboard:
Depends On:
Blocks: 1417151 1458539 1460899
TreeView+ depends on / blocked
 
Reported: 2017-05-11 13:51 UTC by Nag Pavan Chilakam
Modified: 2018-11-30 05:37 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.8.4-28
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1458539 (view as bug list)
Environment:
Last Closed: 2017-09-21 04:41:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-05-11 13:51:26 UTC
Description of problem:
========================
Seems like if we restart a volume , the negative lookup caching feature is lost.
I did a lookup of a non existing file with nl caching enabled
generally the first lookup takes about 7-13 lookup FOPs depending on the dht hash/unhashed brick
On a second lookup, I notice that the lookup FOP generally  seemed to take about 1 lookup FOP on each brick

I then did a volume restart and reran above checks and found that even on repeatitive  negative lookups, the lookup FOPs take the same as without nl cache enabled

I saw the trusted vol file and didn't note any descrepencies, as the nl cache translators got loaded


Version-Release number of selected component (if applicable):
======
[root@dhcp35-192 negative]# rpm -qa|grep gluster
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
python-gluster-3.8.4-24.el7rhgs.noarch
glusterfs-libs-3.8.4-24.el7rhgs.x86_64
glusterfs-fuse-3.8.4-24.el7rhgs.x86_64
glusterfs-api-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-3.8.4-24.el7rhgs.x86_64
glusterfs-api-3.8.4-24.el7rhgs.x86_64
glusterfs-cli-3.8.4-24.el7rhgs.x86_64
glusterfs-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64
glusterfs-rdma-3.8.4-24.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-24.el7rhgs.x86_64
glusterfs-server-3.8.4-24.el7rhgs.x86_64
glusterfs-events-3.8.4-24.el7rhgs.x86_64
[root@dhcp35-192 negative]# rpm -qa|grep samba
samba-common-4.6.3-0.el7rhgs.noarch
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
samba-common-tools-4.6.3-0.el7rhgs.x86_64
samba-common-libs-4.6.3-0.el7rhgs.x86_64
samba-4.6.3-0.el7rhgs.x86_64
samba-client-4.6.3-0.el7rhgs.x86_64
samba-libs-4.6.3-0.el7rhgs.x86_64
samba-client-libs-4.6.3-0.el7rhgs.x86_64
[root@dhcp35-192 negative]# 


How reproducible:
===============
consistent on my setup

Steps to Reproduce:
1.had a 3 node cluster with samba setup using https://mojo.redhat.com/docs/DOC-1027985(no ctdb setup)
2.created a 2x2 volume as below ===>vol bricks are part of n2 and n3 nodes, while n1 was the samba server node
[root@dhcp35-192 ~]# gluster v info
 
Volume Name: negative
Type: Distributed-Replicate
Volume ID: 48c809a8-8a82-4806-aaf0-5ea0bbf423cb
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.215:/rhs/brick1/negative
Brick2: 10.70.35.214:/rhs/brick1/negative
Brick3: 10.70.35.215:/rhs/brick2/negative
Brick4: 10.70.35.214:/rhs/brick2/negative
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
features.cache-invalidation-timeout: 3000
features.cache-invalidation: on
performance.nl-cache: on
nfs.disable: on
transport.address-family: inet
server.allow-insecure: on
storage.batch-fsync-delay-usec: 0

VOLUME PROFILE IS STARTED
4. mounted using cifs on a rhel client
5.did some amount of testing  wrt nl caching(for about half a day)
6. created a directory dir4 
7. did a negative lookup by issuing ls x2 
says file not existing as expected 
the profiling mentioned about 7-13 fops
8. again did a ls x2 to see nl cache working
fop redcued to about 1 per brick

9. restarted vol
10. repeated 7,8 steps again and again, but the lookup fop count is not reducing



before restart


first negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      9.69      60.00 us      60.00 us      60.00 us              1    GETXATTR
     90.31     111.80 us      82.00 us     143.00 us              5      LOOKUP
 
    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
     12.04     124.00 us     124.00 us     124.00 us              1    GETXATTR
     87.96     113.25 us      63.00 us     252.00 us              8      LOOKUP
 
    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      9.87      70.00 us      70.00 us      70.00 us              1    GETXATTR
     90.13     127.80 us     100.00 us     152.00 us              5      LOOKUP
 
    Duration: 5 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
     12.18     147.00 us     147.00 us     147.00 us              1    GETXATTR
     87.82     132.50 us      79.00 us     299.00 us              8      LOOKUP
 
    Duration: 5 seconds
   Data Read: 0 bytes
Data Written: 0 bytes



second negative lookup 
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     336.00 us     336.00 us     336.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     338.00 us     338.00 us     338.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     144.00 us     144.00 us     144.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     263.00 us     263.00 us     263.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
[root@dhcp35-192 ~]# 



#############################
on restart

first negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     121.29 us      77.00 us     251.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     116.08 us      53.00 us     446.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     125.92 us      69.00 us     477.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     113.86 us      74.00 us     249.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes


second negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     121.29 us      77.00 us     251.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     116.08 us      53.00 us     446.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     125.92 us      69.00 us     477.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     113.86 us      74.00 us     249.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes






Workaround:
========
post restart turn off and again turn on nl cache option

Comment 2 Poornima G 2017-05-12 12:50:26 UTC
RCA:

The cache is stored in the directories inode_ctx, and the time of cache is set to the time inode_ctx was created, and is not updated as and when new cache entries are added to the inode_ctx(as it is cumbersome to have timeout for each entry, and updating global time will result in stale cache of older entries). Consider a sequence of operation as below:

At T1, lookup dir1/f1 (f1 doesn't exist)
       dir1 inode_ctx->cache_time = T1
            cache = "f1"
       lookup dir f2 (f2 doesn't exist)
       dir1 inode_ctx->cache_time remains unchanged
            cache = "f1", "f2"

Child_DOWN / all /some of the bricks down at T2
       thus anything cached before T2 is considered invalid but is not cleared from memory.

At T3, lookup dir1/f1
       dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache.
       lookup is sent to the brick.

At T4, lookup dir1/f3
       dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache.
       lookup is sent to the brick.

Thus the cache of dir1 is invalid until T1+10 min(cache timeout), and also for the 10 min no new cache is added to the dir1.

Consequence:
After any brick down/up, the cache of all the existing directories is invalid and cannot be corrected for max 10 min. Thus after brick down/up the nl-cache doesn't work for the directories that already had the cache, for the max of 10 min.



Solution:
If the inoce_ctx->cachetime < brick_down time, then delete the existing cache and start populating fresh cache.

Comment 5 Atin Mukherjee 2017-06-04 08:03:28 UTC
Upstream patch : https://review.gluster.org/17453

Comment 6 Atin Mukherjee 2017-06-13 05:01:41 UTC
(In reply to Atin Mukherjee from comment #5)
> Upstream patch : https://review.gluster.org/17453

One more upstream patch https://review.gluster.org/#/c/17450/ is required here.

Comment 11 errata-xmlrpc 2017-09-21 04:41:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.