Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1458539

Summary:	[Negative Lookup]: negative lookup features doesn't seem to work on restart of volume
Product:	[Community] GlusterFS	Reporter:	Poornima G <pgurusid>
Component:	gluster-smb	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bugs, nchilaka, rhinduja, rhs-smb, sbhaloth
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.12.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1450080
Clones:	1460899 (view as bug list)		Environment:
Last Closed:	2017-09-05 17:33:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1450080
Bug Blocks:	1460899

Description Poornima G 2017-06-04 05:34:50 UTC

+++ This bug was initially created as a clone of Bug #1450080 +++

Description of problem:
========================
Seems like if we restart a volume , the negative lookup caching feature is lost.
I did a lookup of a non existing file with nl caching enabled
generally the first lookup takes about 7-13 lookup FOPs depending on the dht hash/unhashed brick
On a second lookup, I notice that the lookup FOP generally  seemed to take about 1 lookup FOP on each brick

I then did a volume restart and reran above checks and found that even on repeatitive  negative lookups, the lookup FOPs take the same as without nl cache enabled

I saw the trusted vol file and didn't note any descrepencies, as the nl cache translators got loaded


Version-Release number of selected component (if applicable):
======
[root@dhcp35-192 negative]# rpm -qa|grep gluster
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
python-gluster-3.8.4-24.el7rhgs.noarch
glusterfs-libs-3.8.4-24.el7rhgs.x86_64
glusterfs-fuse-3.8.4-24.el7rhgs.x86_64
glusterfs-api-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-3.8.4-24.el7rhgs.x86_64
glusterfs-api-3.8.4-24.el7rhgs.x86_64
glusterfs-cli-3.8.4-24.el7rhgs.x86_64
glusterfs-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64
glusterfs-rdma-3.8.4-24.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-24.el7rhgs.x86_64
glusterfs-server-3.8.4-24.el7rhgs.x86_64
glusterfs-events-3.8.4-24.el7rhgs.x86_64
[root@dhcp35-192 negative]# rpm -qa|grep samba
samba-common-4.6.3-0.el7rhgs.noarch
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
samba-common-tools-4.6.3-0.el7rhgs.x86_64
samba-common-libs-4.6.3-0.el7rhgs.x86_64
samba-4.6.3-0.el7rhgs.x86_64
samba-client-4.6.3-0.el7rhgs.x86_64
samba-libs-4.6.3-0.el7rhgs.x86_64
samba-client-libs-4.6.3-0.el7rhgs.x86_64
[root@dhcp35-192 negative]# 


How reproducible:
===============
consistent on my setup

Steps to Reproduce:
1.had a 3 node cluster with samba setup using https://mojo.redhat.com/docs/DOC-1027985(no ctdb setup)
2.created a 2x2 volume as below ===>vol bricks are part of n2 and n3 nodes, while n1 was the samba server node
[root@dhcp35-192 ~]# gluster v info
 
Volume Name: negative
Type: Distributed-Replicate
Volume ID: 48c809a8-8a82-4806-aaf0-5ea0bbf423cb
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: :/rhs/brick1/negative
Brick2: :/rhs/brick1/negative
Brick3: :/rhs/brick2/negative
Brick4: :/rhs/brick2/negative
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
features.cache-invalidation-timeout: 3000
features.cache-invalidation: on
performance.nl-cache: on
nfs.disable: on
transport.address-family: inet
server.allow-insecure: on
storage.batch-fsync-delay-usec: 0

VOLUME PROFILE IS STARTED
4. mounted using cifs on a rhel client
5.did some amount of testing  wrt nl caching(for about half a day)
6. created a directory dir4 
7. did a negative lookup by issuing ls x2 
says file not existing as expected 
the profiling mentioned about 7-13 fops
8. again did a ls x2 to see nl cache working
fop redcued to about 1 per brick

9. restarted vol
10. repeated 7,8 steps again and again, but the lookup fop count is not reducing



before restart


first negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      9.69      60.00 us      60.00 us      60.00 us              1    GETXATTR
     90.31     111.80 us      82.00 us     143.00 us              5      LOOKUP
 
    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
     12.04     124.00 us     124.00 us     124.00 us              1    GETXATTR
     87.96     113.25 us      63.00 us     252.00 us              8      LOOKUP
 
    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      9.87      70.00 us      70.00 us      70.00 us              1    GETXATTR
     90.13     127.80 us     100.00 us     152.00 us              5      LOOKUP
 
    Duration: 5 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 16 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
     12.18     147.00 us     147.00 us     147.00 us              1    GETXATTR
     87.82     132.50 us      79.00 us     299.00 us              8      LOOKUP
 
    Duration: 5 seconds
   Data Read: 0 bytes
Data Written: 0 bytes



second negative lookup 
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     336.00 us     336.00 us     336.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     338.00 us     338.00 us     338.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     144.00 us     144.00 us     144.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 17 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     263.00 us     263.00 us     263.00 us              1      LOOKUP
 
    Duration: 6 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
[root@dhcp35-192 ~]# 



#############################
on restart

first negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     121.29 us      77.00 us     251.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     116.08 us      53.00 us     446.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     125.92 us      69.00 us     477.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     113.86 us      74.00 us     249.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes


second negative lookup
[root@dhcp35-192 ~]# gluster v profile negative info incremental
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     121.29 us      77.00 us     251.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     116.08 us      53.00 us     446.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick1/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     125.92 us      69.00 us     477.00 us             13      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
 
Brick: :/rhs/brick2/negative
----------------------------------------
Interval 13 Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
    100.00     113.86 us      74.00 us     249.00 us              7      LOOKUP
 
    Duration: 8 seconds
   Data Read: 0 bytes
Data Written: 0 bytes






Workaround:
========
post restart turn off and again turn on nl cache option

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-05-11 09:51:31 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Poornima G on 2017-05-12 08:50:26 EDT ---

RCA:

The cache is stored in the directories inode_ctx, and the time of cache is set to the time inode_ctx was created, and is not updated as and when new cache entries are added to the inode_ctx(as it is cumbersome to have timeout for each entry, and updating global time will result in stale cache of older entries). Consider a sequence of operation as below:

At T1, lookup dir1/f1 (f1 doesn't exist)
       dir1 inode_ctx->cache_time = T1
            cache = "f1"
       lookup dir f2 (f2 doesn't exist)
       dir1 inode_ctx->cache_time remains unchanged
            cache = "f1", "f2"

Child_DOWN / all /some of the bricks down at T2
       thus anything cached before T2 is considered invalid but is not cleared from memory.

At T3, lookup dir1/f1
       dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache.
       lookup is sent to the brick.

At T4, lookup dir1/f3
       dir1 inode_ctx->cache_time(T1) < T2, hence do not serve from cache.
       lookup is sent to the brick.

Thus the cache of dir1 is invalid until T1+10 min(cache timeout), and also for the 10 min no new cache is added to the dir1.

Consequence:
After any brick down/up, the cache of all the existing directories is invalid and cannot be corrected for max 10 min. Thus after brick down/up the nl-cache doesn't work for the directories that already had the cache, for the max of 10 min.



Solution:
If the inoce_ctx->cachetime < brick_down time, then delete the existing cache and start populating fresh cache.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-05-15 05:48:25 EDT ---

This bug is automatically being provided 'pm_ack+' for the release flag 'rhgs‑3.3.0', the current release of Red Hat Gluster Storage 3 under active development, having been appropriately marked for the release, and having been provided ACK from Development and QE

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-05-15 06:03:05 EDT ---

Since this bug has been approved for the RHGS 3.3.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.3.0+', and through the Internal Whiteboard entry of '3.3.0', the Target Release is being automatically set to 'RHGS 3.3.0'

Comment 1 Worker Ant 2017-06-04 05:36:22 UTC

REVIEW: https://review.gluster.org/17453 (nl-cache: Fix a possible crash and stale cache) posted (#1) for review on master by Poornima G (pgurusid)

Comment 2 Worker Ant 2017-06-08 05:24:49 UTC

REVIEW: https://review.gluster.org/17453 (nl-cache: Fix a possible crash and stale cache) posted (#2) for review on master by Poornima G (pgurusid)

Comment 3 Worker Ant 2017-06-09 05:33:27 UTC

REVIEW: https://review.gluster.org/17453 (nl-cache: Fix a possible crash and stale cache) posted (#3) for review on master by Poornima G (pgurusid)

Comment 4 Worker Ant 2017-06-12 05:03:12 UTC

REVIEW: https://review.gluster.org/17453 (nl-cache: Fix a possible crash and stale cache) posted (#4) for review on master by Poornima G (pgurusid)

Comment 5 Worker Ant 2017-06-12 05:14:44 UTC

REVIEW: https://review.gluster.org/17453 (nl-cache: Fix a possible crash and stale cache) posted (#5) for review on master by Poornima G (pgurusid)

Comment 6 Worker Ant 2017-06-13 05:01:21 UTC

COMMIT: https://review.gluster.org/17453 committed in master by Atin Mukherjee (amukherj) 
------
commit 7674584fa53944a4e982e217798f31a3d1ef313b
Author: Poornima G <pgurusid>
Date:   Fri May 26 15:45:57 2017 +0530

    nl-cache: Fix a possible crash and stale cache
    
    Issue1:
    Consider the followinf sequence of operations:
       ...
       nlc_ctx = nlc_ctx_get (inode i1)
       ....... -> nlc_clear_cache (i1) gets called as a part of nlc_invalidate
                                       or any other callers
                  ...
                  GF_FREE (ii nlc_ctx)
       LOCK (nlc_ctx->lock);  ->  This will result in crash as the ctx
                                  got freed in nlc_clear_cache.
    
    Issue2:
       lookup on dir1/file1 result in ENOENT
       add cache to dir1 at time T1
       ....
       CHILD_DOWN at T2
       lookup on dir1/file2 result in ENOENT
       add cache to dir1, but the cache time is still T1
       lookup on dir1/file2 - should have been served from cache
                              but the cache time is T1 < T2, hence
                              cache is considered as invalid.
    So, after CHILD_DOWN the right thing would be to clear the cache
    and restart caching on that inode.
    
    Solution:
    Do not free nlc_ctx in nlc_clear_cache, but only in inode_forget()
    The fix for both issue1 and 2 is interleaved hence sending it as
    single patch.
    
    Change-Id: I83d8ed36c049a93567c6d7e63d045dc14ccbb397
    BUG: 1458539
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: https://review.gluster.org/17453
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 7 Shyamsundar 2017-09-05 17:33:02 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/