Bug 1450080

Summary:	[Negative Lookup]: negative lookup features doesn't seem to work on restart of volume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	samba	Assignee:	Poornima G <pgurusid>
Status:	CLOSED ERRATA	QA Contact:	Vivek Das <vdas>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, rhinduja, rhs-smb
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-28	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1458539 (view as bug list)		Environment:
Last Closed:	2017-09-21 04:41:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1417151, 1458539, 1460899

Description Nag Pavan Chilakam 2017-05-11 13:51:26 UTC

Description of problem:
========================
Seems like if we restart a volume , the negative lookup caching feature is lost.
I did a lookup of a non existing file with nl caching enabled
generally the first lookup takes about 7-13 lookup FOPs depending on the dht hash/unhashed brick
On a second lookup, I notice that the lookup FOP generally  seemed to take about 1 lookup FOP on each brick

I then did a volume restart and reran above checks and found that even on repeatitive  negative lookups, the lookup FOPs take the same as without nl cache enabled

I saw the trusted vol file and didn't note any descrepencies, as the nl cache translators got loaded


Version-Release number of selected component (if applicable):
======
[root@dhcp35-192 negative]# rpm -qa|grep gluster
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
python-gluster-3.8.4-24.el7rhgs.noarch
glusterfs-libs-3.8.4-24.el7rhgs.x86_64
glusterfs-fuse-3.8.4-24.el7rhgs.x86_64
glusterfs-api-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-3.8.4-24.el7rhgs.x86_64
glusterfs-api-3.8.4-24.el7rhgs.x86_64
glusterfs-cli-3.8.4-24.el7rhgs.x86_64
glusterfs-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64
glusterfs-rdma-3.8.4-24.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-24.el7rhgs.x86_64
glusterfs-server-3.8.4-24.el7rhgs.x86_64
glusterfs-events-3.8.4-24.el7rhgs.x86_64
[root@dhcp35-192 negative]# rpm -qa|grep samba
samba-common-4.6.3-0.el7rhgs.noarch
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
samba-common-tools-4.6.3-0.el7rhgs.x86_64
samba-common-libs-4.6.3-0.el7rhgs.x86_64
samba-4.6.3-0.el7rhgs.x86_64
samba-client-4.6.3-0.el7rhgs.x86_64
samba-libs-4.6.3-0.el7rhgs.x86_64
samba-client-libs-4.6.3-0.el7rhgs.x86_64
[root@dhcp35-192 negative]# 


How reproducible:
===============
consistent on my setup

Steps to Reproduce:
1.had a 3 node cluster with samba setup using https://mojo.redhat.com/docs/DOC-1027985(no ctdb setup)
2.created a 2x2 volume as below ===>vol bricks are part of n2 and n3 nodes, while n1 was the samba server node
[root@dhcp35-192 ~]# gluster v info
 
Volume Name: negative
Type: Distributed-Replicate
Volume ID: 48c809a8-8a82-4806-aaf0-5ea0bbf423cb
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.215:/rhs/brick1/negative
Brick2: 10.70.35.214:/rhs/brick1/negative
Brick3: 10.70.35.215:/rhs/brick2/negative
Brick4: 10.70.35.214:/rhs/brick2/negative
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
features.cache-invalidation-timeout: 3000
features.cache-invalidation: on
performance.nl-cache: on
nfs.disable: on
transport.address-family: inet
server.allow-insecure: on
storage.batch-fsync-delay-usec: 0

VOLUME PROFILE IS STARTED
4. mounted using cifs on a rhel client
5.did some amount of testing  wrt nl caching(for about half a day)
6. created a directory dir4 
7. did a negative lookup by issuing ls x2 
says file not existing as expected 
the profiling mentioned about 7-13 fops
8. again did a ls x2 to see nl cache working
fop redcued to about 1 per brick

9. restarted vol
10. repeated 7,8 steps again and again, but the lookup fop count is not reducing



before restart


first negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      9.69      60.00 us      60.00 us      60.00 us              1    GETXATTR
     90.31     111.80 us      82.00 us     143.00 us              5      LOOKUP
 
    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
     12.04     124.00 us     124.00 us     124.00 us              1    GETXATTR
     87.96     113.25 us      63.00 us     252.00 us              8      LOOKUP
 
    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      9.87      70.00 us      70.00 us      70.00 us              1    GETXATTR
     90.13     127.80 us     100.00 us     152.00 us              5      LOOKUP
 
    Duration: 5 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
     12.18     147.00 us     147.00 us     147.00 us              1    GETXATTR
     87.82     132.50 us      79.00 us     299.00 us              8      LOOKUP
 
    Duration: 5 seconds
   Data Read: 0 bytes
Data Written: 0 bytes



second negative lookup 
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     336.00 us     336.00 us     336.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     338.00 us     338.00 us     338.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     144.00 us     144.00 us     144.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     263.00 us     263.00 us     263.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
[root@dhcp35-192 ~]# 



#############################
on restart

first negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     121.29 us      77.00 us     251.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     116.08 us      53.00 us     446.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     125.92 us      69.00 us     477.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     113.86 us      74.00 us     249.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes


second negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: 10.70.35.214:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     121.29 us      77.00 us     251.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.214:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     116.08 us      53.00 us     446.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     125.92 us      69.00 us     477.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: 10.70.35.215:/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     113.86 us      74.00 us     249.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes






Workaround:
========
post restart turn off and again turn on nl cache option

Comment 2 Poornima G 2017-05-12 12:50:26 UTC

RCA:

The cache is stored in the directories inode_ctx, and the time of cache is set to the time inode_ctx was created, and is not updated as and when new cache entries are added to the inode_ctx(as it is cumbersome to have timeout for each entry, and updating global time will result in stale cache of older entries). Consider a sequence of operation as below:

At T1, lookup dir1/f1 (f1 doesn't exist)
       dir1 inode_ctx->cache_time = T1
            cache = "f1"
       lookup dir f2 (f2 doesn't exist)
       dir1 inode_ctx->cache_time remains unchanged
            cache = "f1", "f2"

Child_DOWN / all /some of the bricks down at T2
       thus anything cached before T2 is considered invalid but is not cleared from memory.

At T3, lookup dir1/f1
       dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache.
       lookup is sent to the brick.

At T4, lookup dir1/f3
       dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache.
       lookup is sent to the brick.

Thus the cache of dir1 is invalid until T1+10 min(cache timeout), and also for the 10 min no new cache is added to the dir1.

Consequence:
After any brick down/up, the cache of all the existing directories is invalid and cannot be corrected for max 10 min. Thus after brick down/up the nl-cache doesn't work for the directories that already had the cache, for the max of 10 min.



Solution:
If the inoce_ctx->cachetime < brick_down time, then delete the existing cache and start populating fresh cache.

Comment 5 Atin Mukherjee 2017-06-04 08:03:28 UTC

Upstream patch : https://review.gluster.org/17453

Comment 6 Atin Mukherjee 2017-06-13 05:01:41 UTC

(In reply to Atin Mukherjee from comment #5)
> Upstream patch : https://review.gluster.org/17453

One more upstream patch https://review.gluster.org/#/c/17450/ is required here.

Comment 7 Atin Mukherjee 2017-06-13 10:27:26 UTC

downstream patches:

https://code.engineering.redhat.com/gerrit/#/c/108886/
https://code.engineering.redhat.com/gerrit/#/c/108892/

Comment 11 errata-xmlrpc 2017-09-21 04:41:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774