RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2131391 - fuse readdir cache sometimes corrupted
Summary: fuse readdir cache sometimes corrupted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel
Version: 8.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Miklos Szeredi
QA Contact: Boyang Xue
URL:
Whiteboard:
Depends On:
Blocks: 2142657
TreeView+ depends on / blocked
 
Reported: 2022-09-30 20:34 UTC by Frank Sorenson
Modified: 2023-05-16 10:37 UTC (History)
5 users (show)

Fixed In Version: kernel-4.18.0-441.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2142657 (view as bug list)
Environment:
Last Closed: 2023-05-16 08:53:09 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer program (5.48 KB, text/plain)
2022-09-30 20:34 UTC, Frank Sorenson
no flags Details
proposed fix (upstream) (1.15 KB, text/plain)
2022-10-05 08:17 UTC, Miklos Szeredi
no flags Details
patch to control file and directory caching separately (2.15 KB, patch)
2022-10-07 19:08 UTC, Frank Sorenson
no flags Details | Diff
proposed fix (v2) (1.64 KB, message/rfc822)
2022-10-17 11:10 UTC, Miklos Szeredi
no flags Details
proposed patch (v3) (1.75 KB, patch)
2022-10-19 15:09 UTC, Miklos Szeredi
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/rhel/src/kernel rhel-8 merge_requests 3741 0 None None None 2022-11-14 15:26:08 UTC
Red Hat Issue Tracker RHELPLAN-135474 0 None None None 2022-10-01 09:42:26 UTC
Red Hat Product Errata RHSA-2023:2951 0 None None None 2023-05-16 08:53:48 UTC

Description Frank Sorenson 2022-09-30 20:34:54 UTC
Created attachment 1915352 [details]
reproducer program

Description of problem:

When multiple processes perform parallel directory listings, when readdir results are cached, periodically a process listing a cached directory will begin reading from a page containing bogus data.

The kernel typically detects this bogus data due to an invalid (very large) entry name length/entry size, outputs a 
WARNING, and returns EIO to userspace.  The directory will continue to (attempt to) parse this invalid data for every readdir of the directory until the cache times out.


[3109130.031012] WARNING: CPU: 2 PID: 317313 at fs/fuse/readdir.c:396 fuse_readdir+0x5bb/0x680 [fuse]
[3109130.031070] CPU: 2 PID: 317313 Comm: find Kdump: loaded Tainted: GF       W  OE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
[3109130.031072] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[3109130.031081] RIP: 0010:fuse_readdir+0x5bb/0x680 [fuse]
[3109130.031100] Call Trace:
[3109130.031125]  iterate_dir+0x13c/0x190
[3109130.031128]  ksys_getdents64+0x9c/0x130
[3109130.031130]  ? iterate_dir+0x190/0x190
[3109130.031133]  __x64_sys_getdents64+0x16/0x20
[3109130.031138]  do_syscall_64+0x5b/0x1a0
[3109130.031142]  entry_SYSCALL_64_after_hwframe+0x65/0xca

377 static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
378                                                void *addr, unsigned int size,
379                                                struct dir_context *ctx)
380 {
381         unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
382         enum fuse_parse_result res = FOUND_NONE;
...
396                 if (WARN_ON(dirent->namelen > FUSE_NAME_MAX))
397                         return FOUND_ERR;


examining a vmcore or using systemtap, the contents of the page are unrecognizable (beginning with offset 0).  namelen will be some large number, such as 100663296 (0x6000000) or 1269538688 (0x4bab9f80).



Version-Release number of selected component (if applicable):

RHEL kernel versions that support fuse cache_readdir, at least since 8.3; seen with:
  kernel-4.18.0-240.22.1.el8_3
  kernel-4.18.0-348.20.1.el8_5
  kernel-4.18.0-372.27.1.el8_6

also reproduced with upstream kernel

fuse3-libs versions which support readdir caching:
  fuse3-libs-2.9.7-16.el8 and newer


How reproducible:

easy


Steps to Reproduce:

# dnf install fuse{,-libs}-2.9.7-16.el8.x86_64 fuse3{,-devel,-libs,-common}-3.3.0-16.el8.x86_64

download & install cvmfs rpms from https://cernvm.cern.ch/fs/

https://ecsft.cern.ch/dist/cvmfs/cvmfs-2.9.4/cvmfs-2.9.4-1.el8.x86_64.rpm
https://ecsft.cern.ch/dist/cvmfs/cvmfs-2.9.4/cvmfs-fuse3-2.9.4-1.el8.x86_64.rpm
http://ecsft.cern.ch/dist/cvmfs/cvmfs-config/cvmfs-config-default-2.0-2.noarch.rpm


simple cvmfs setup:
    /etc/cvmfs/default.local:
        CVMFS_REPOSITORIES="$((echo oasis.opensciencegrid.org;echo cms.cern.ch;ls /cvmfs)|sort -u|paste -sd ,)"
        CVMFS_HTTP_PROXY="DIRECT"

        # and if you need to limit the size of the cached data:
        # CVMFS_QUOTA_LIMIT=500
        # or to move it to another location (from default location of /var/lib/cvmfs) -- it will then complain if the limit is less than 1 GiB
        # CVMFS_CACHE_BASE=/other/path/cvmfs

setup and restart autofs:
    /etc/auto.master.d/cvmfs.autofs:
        /cvmfs /etc/auto.cvmfs

    # systemctl restart autofs.service


compile the provided walk_tree.c:
    # gcc -Wall walk_tree.c -o walk_tree -g


start multiple processes crawling a cvmfs filesystem:
    usage: ./walk_tree <child_threads> <starting path>

    # ./walk_tree 8 /cvmfs/oasis.opensciencegrid.org
        tid 5670, child 0: alive
        tid 5671, child 1: alive
        tid 5672, child 2: alive
        tid 5673, child 3: alive
        tid 5675, child 5: alive
        tid 5676, child 6: alive
        tid 5674, child 4: alive
        tid 5677, child 7: alive
        tid 5672, child 2: error getting directory entries in '/cvmfs/oasis.opensciencegrid.org/geant4/externals/cmake/v3_9_0/source/cmake-3.9.0/Tests/TryCompile/Inner', inode # 327519: Input/output error
tid 5673, child 3: error getting directory entries in '/cvmfs/oasis.opensciencegrid.org/geant4/externals/cmake/v3_9_0/source/cmake-3.9.0/Tests/TryCompile/Inner', inode # 327519: Input/output error
tid 5673, child 3: exiting with ERROR
tid 5676, child 6: error getting directory entries in '/cvmfs/oasis.opensciencegrid.org/geant4/externals/cmake/v3_9_0/source/cmake-3.9.0/Tests/TryCompile/Inner', inode # 327519: Input/output error
...


Actual results:

EIO is returned to userspace when listing the directory, kernel outputs WARNING


Expected results:

no kernel warnings or errors when accessing the filesystem



Additional info:

Thus far, this has only been reproduced with cvmfs2, despite some attempts to create a stripped-down filesystem to simplify debugging.

However, the issue occurs after the kernel does a sanity check on the response from the userspace filesystem and has copied the dirents into cache.  Therefore, the problem appears to be entirely within the kernel.

The fact that this only occurs when multiple processes are listing the same directories would also suggest that there may be a race in the kernel code which manages the readdir cache.

Comment 1 Miklos Szeredi 2022-10-03 09:04:26 UTC
Can you please provide a crashdump (echo 1 > /proc/sys/kernel/panic_on_warn) and upload it to https://galvatron-x86.cee.redhat.com/

Comment 3 Frank Sorenson 2022-10-03 18:40:30 UTC
looking at the vmcore:

[  852.595240] WARNING: CPU: 3 PID: 15069 at fs/fuse/readdir.c:396 fuse_readdir+0x5bb/0x680 [fuse]
[  852.595323] Kernel panic - not syncing: panic_on_warn set ...
               
[  852.595451] CPU: 3 PID: 15069 Comm: walk_tree Kdump: loaded Not tainted 4.18.0-372.31.1.el8_6.x86_64 #1
[  852.595574] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[  852.595708] Call Trace:
[  852.595928]  dump_stack+0x41/0x60
[  852.596126]  panic+0xe7/0x2ac
[  852.596281]  ? fuse_readdir+0x5bb/0x680 [fuse]
[  852.596500]  __warn.cold.14+0x31/0x40
[  852.596712]  ? fuse_readdir+0x5bb/0x680 [fuse]
[  852.596898]  ? fuse_readdir+0x5bb/0x680 [fuse]
[  852.597048]  report_bug+0xb1/0xd0
[  852.597176]  ? terminate_walk+0x7a/0xe0
[  852.597319]  do_error_trap+0x9e/0xd0
[  852.597456]  do_invalid_op+0x36/0x40
[  852.597608]  ? fuse_readdir+0x5bb/0x680 [fuse]
[  852.597761]  invalid_op+0x14/0x20
[  852.597902] RIP: 0010:fuse_readdir+0x5bb/0x680 [fuse]
[  852.598064] Code: c1 48 39 c2 0f 85 81 00 00 00 48 d1 e9 48 89 8b f0 02 00 00 e9 74 fb ff ff 4d 89 fe 48 8b 5c 24 10 4c 8b 64 24 08 4c 8b 3c 24 <0f> 0b c7 44 24 1c ff ff ff ff e9 5a fe ff ff 4d 89 fe 48 8b 5c 24
[  852.598417] RSP: 0018:ffffbcd3036abe10 EFLAGS: 00010286
[  852.598586] RAX: 0000000000000090 RBX: ffff9971d2edaa00 RCX: 000000008949ffff
[  852.598749] RDX: 000000008949ffff RSI: 0000000000000000 RDI: ffff9971d2edacf8
[  852.598918] RBP: ffffbcd3036abe80 R08: ffffbcd3036abd80 R09: 0000000000000000
[  852.599083] R10: ffffbcd3036abe90 R11: ffff99718d86b000 R12: ffffe54584361ac0
[  852.599244] R13: ffff9970b30d99c0 R14: ffffbcd3036abed0 R15: ffff997182ccab00
[  852.599446]  iterate_dir+0x13c/0x190
[  852.599619]  ksys_getdents64+0x9c/0x130
[  852.599752]  ? iterate_dir+0x190/0x190
[  852.599891]  __x64_sys_getdents64+0x16/0x20
[  852.600031]  do_syscall_64+0x5b/0x1a0
[  852.600173]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  852.600325] RIP: 0033:0x7f3a1857f78d
[  852.600468] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8 64 89 01 48
[  852.600854] RSP: 002b:00007ffdb84e18e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d9
[  852.601026] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a1857f78d
[  852.601192] RDX: 0000000000010000 RSI: 000000000105e930 RDI: 000000000000001b
[  852.601367] RBP: 00007ffdb84e19d0 R08: 0000000000000000 R09: 0000000000b5e680
[  852.601529] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400c30
[  852.601713] R13: 00007ffdb84e2780 R14: 0000000000000000 R15: 0000000000000000

PID: 15069    TASK: ffff997241270000  CPU: 3    COMMAND: "walk_tree"

    [exception RIP: fuse_readdir+0x5bb]
    RIP: ffffffffc021521b  RSP: ffffbcd3036abe10  RFLAGS: 00010286
    RAX: 0000000000000090  RBX: ffff9971d2edaa00  RCX: 000000008949ffff
    RDX: 000000008949ffff  RSI: 0000000000000000  RDI: ffff9971d2edacf8
    RBP: ffffbcd3036abe80   R8: ffffbcd3036abd80   R9: 0000000000000000
    R10: ffffbcd3036abe90  R11: ffff99718d86b000  R12: ffffe54584361ac0
    R13: ffff9970b30d99c0  R14: ffffbcd3036abed0  R15: ffff997182ccab00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffffbcd3036abe88] iterate_dir at ffffffff9775593c
#11 [ffffbcd3036abec8] ksys_getdents64 at ffffffff977565bc
#12 [ffffbcd3036abf30] __x64_sys_getdents64 at ffffffff97756666
#13 [ffffbcd3036abf38] do_syscall_64 at ffffffff9740430b
#14 [ffffbcd3036abf50] entry_SYSCALL_64_after_hwframe at ffffffff97e000ad

the getdents64() is on fd 27:
PID: 15069    TASK: ffff997241270000  CPU: 3    COMMAND: "walk_tree"
ROOT: /    CWD: /var/tmp/test_fs
 FD       FILE            DENTRY           INODE       TYPE PATH
 27 ffff997182ccab00 ffff9971d2ec36c0 ffff9971d2edaa00 DIR  /cvmfs/oasis.opensciencegrid.org/geant4/externals/boost/v1_57_0/source/boost_1_57_0/doc/html/boost_asio/reference/windows__basic_stream_handle/get_implementation


the WARNING is at:

377 static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
378                                                void *addr, unsigned int size,
379                                                struct dir_context *ctx)
380 {
381         unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
382         enum fuse_parse_result res = FOUND_NONE;
383 
384         WARN_ON(offset >= size);
385 
386         for (;;) {
387                 struct fuse_dirent *dirent = addr + offset;
388                 unsigned int nbytes = size - offset;
389                 size_t reclen;
390 
391                 if (nbytes < FUSE_NAME_OFFSET || !dirent->namelen)
392                         break;

/usr/src/debug/kernel-4.18.0-372.31.1.el8_6/linux-4.18.0-372.31.1.el8_6.x86_64/fs/fuse/readdir.c: 391
0xffffffffc0214fc3 <fuse_readdir+0x363>:        cmp    $0x17,%eax
0xffffffffc0214fc6 <fuse_readdir+0x366>:        jbe    0xffffffffc02152ab <fuse_readdir+0x64b>

^^^^^ (nbytes < FUSE_NAME_OFFSET)


0xffffffffc0214fcc <fuse_readdir+0x36c>:        mov    0x10(%r11),%ecx
0xffffffffc0214fd0 <fuse_readdir+0x370>:        movl   $0x0,0x1c(%rsp)
0xffffffffc0214fd8 <fuse_readdir+0x378>:        test   %ecx,%ecx
0xffffffffc0214fda <fuse_readdir+0x37a>:        je     0xffffffffc0215084 <fuse_readdir+0x424>

^^^^^ (! dirent->namelen)

dirent is in %r11
dirent->namelen is in %ecx

    RAX: 0000000000000090  RBX: ffff9971d2edaa00  RCX: 000000008949ffff
    R10: ffffbcd3036abe90  R11: ffff99718d86b000  R12: ffffe54584361ac0

393 
394                 reclen = FUSE_DIRENT_SIZE(dirent); /* derefs ->namelen */
395 
396                 if (WARN_ON(dirent->namelen > FUSE_NAME_MAX))
397                         return FOUND_ERR;

/usr/src/debug/kernel-4.18.0-372.31.1.el8_6/linux-4.18.0-372.31.1.el8_6.x86_64/fs/fuse/readdir.c: 396
0xffffffffc0215001 <fuse_readdir+0x3a1>:        cmp    $0x400,%ecx
0xffffffffc0215007 <fuse_readdir+0x3a7>:        ja     0xffffffffc021520a <fuse_readdir+0x5aa>

^^^^^ (dirent->namelen > FUSE_NAME_MAX)

crash> struct fuse_dirent.namelen ffff99718d86b000 -d
  namelen = 2303328255,

...
0xffffffffc021520a <fuse_readdir+0x5aa>:        mov    %r15,%r14
0xffffffffc021520d <fuse_readdir+0x5ad>:        mov    0x10(%rsp),%rbx
0xffffffffc0215212 <fuse_readdir+0x5b2>:        mov    0x8(%rsp),%r12
0xffffffffc0215217 <fuse_readdir+0x5b7>:        mov    (%rsp),%r15
/usr/src/debug/kernel-4.18.0-372.31.1.el8_6/linux-4.18.0-372.31.1.el8_6.x86_64/fs/fuse/readdir.c: 396
0xffffffffc021521b <fuse_readdir+0x5bb>:        ud2    

crash> struct fuse_dirent ffff99718d86b000
struct fuse_dirent {
  ino = 0x8b480000441f0fff,
  off = 0xf95ce8f7894c1053,
  namelen = 0x8949ffff,
  type = 0xc08548c7,
  name = 0xffff99718d86b018 "\017\204\220\376\377\377H\213\005\371\266*"

crash> rd ffff99718d86b000 10
ffff99718d86b000:  8b480000441f0fff f95ce8f7894c1053   ...D..H.S.L...\.
ffff99718d86b010:  c08548c78949ffff 8b48fffffe90840f   ..I..H........H.
ffff99718d86b020:  38394c002ab6f905 f64100000120840f   ...*.L98.. ...A.
ffff99718d86b030:  0000cf840f010147 8324558b24438b00   G.........C$.U$.
ffff99718d86b040:  000000da840ffff8 000209840ffffa83   ................



crash> inode.i_mapping ffff9971d2edaa00
  i_mapping = 0xffff9971d2edab78,

crash> address_space.i_pages 0xffff9971d2edab78 -ox
struct address_space {
  [ffff9971d2edab80] struct xarray i_pages;

crash> tree -t x -r address_space.i_pages 0xffff9971d2edab78 -s page.index
ffffe54584361ac0
      index = 0x0,

crash> kmem ffffe54584361ac0
      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe54584361ac0 10d86b000 ffff9971d2edab78        0  3 57ffffc00000a1 locked,lru,waiters

crash> ptov 10d86b000
VIRTUAL           PHYSICAL     
ffff99718d86b000  10d86b000


crash> fuse_inode.rdc ffff9971d2edaa00
    rdc = {
      cached = 0x1,
      size = 0x90,
      pos = 0x90,
      version = 0x2,
      mtime = {
        tv_sec = 0x54521895,
        tv_nsec = 0x0
      },
      iversion = 0x0,

crash> struct file.private_data ffff997182ccab00
  private_data = 0xffff9970b30d99c0,

crash> fuse_file.open_flags 0xffff9970b30d99c0
  open_flags = 0x8,

include/uapi/linux/fuse.h
#define FOPEN_DIRECT_IO         (1 << 0)
#define FOPEN_KEEP_CACHE        (1 << 1)
#define FOPEN_NONSEEKABLE       (1 << 2)
#define FOPEN_CACHE_DIR         (1 << 3)




the inode is also found in this process, which is currently opening that directory:
PID: 15075    TASK: ffff9972424b4d80  CPU: 2    COMMAND: "walk_tree"
 #0 [ffffbcd30167f9a0] __schedule at ffffffff97da1861
 #1 [ffffbcd30167fa30] schedule at ffffffff97da1df5
 #2 [ffffbcd30167fa40] io_schedule at ffffffff97da21f2
 #3 [ffffbcd30167fa50] __lock_page at ffffffff976832ed
 #4 [ffffbcd30167fad8] invalidate_inode_pages2_range at ffffffff97694e29
 #5 [ffffbcd30167fc60] fuse_finish_open at ffffffffc020ce81 [fuse]
 #6 [ffffbcd30167fc88] fuse_open_common at ffffffffc020d07a [fuse]
 #7 [ffffbcd30167fcd0] do_dentry_open at ffffffff9773b832
 #8 [ffffbcd30167fd00] path_openat at ffffffff977501ee
 #9 [ffffbcd30167fdd8] do_filp_open at ffffffff97752503
#10 [ffffbcd30167fee0] do_sys_open at ffffffff9773d0a4
#11 [ffffbcd30167ff38] do_syscall_64 at ffffffff9740430b
#12 [ffffbcd30167ff50] entry_SYSCALL_64_after_hwframe at ffffffff97e000ad

openat() for:
 FD       FILE            DENTRY           INODE       TYPE PATH
 25 ffff99718171d000 ffff9971d2e7d780 ffff9971d2edd080 DIR  /cvmfs/oasis.opensciencegrid.org/geant4/externals/boost/v1_57_0/source/boost_1_57_0/doc/html/boost_asio/reference/windows__basic_stream_handle

crash> filename.name ffff99718242d000
  name = 0xffff99718242d020 "get_implementation",



symbols in scope at 0xffffffffc020ce7c in 'fuse_finish_open'
        void fuse_finish_open(struct inode *, struct file *)

        inode - len 8:
                * in register $rdi
                * in register $rbp
        file - len 8: in register $rbx
        ff - len 8: in register $r13
        fc - len 8: in register $r12

r13 got stored in invalidate_inode_pages2:

 +RBP: 0xffff9971d2edaa00  << inode
 +RBX: 0xffff9971824b5500  << file
 +R12: 0xffff997241eabe00  << fuse_conn
 +R13: 0xffff9970b30d93c0  << fuse_file

crash> struct file.private_data 0xffff9971824b5500
  private_data = 0xffff9970b30d93c0,
(just making sure)

the invalidate_inode_pages2_range() is called from:

fs/fuse/file.c
 198 void fuse_finish_open(struct inode *inode, struct file *file)
 199 {
 200         struct fuse_file *ff = file->private_data;
 201         struct fuse_conn *fc = get_fuse_conn(inode);
 202 
 203         if (!(ff->open_flags & FOPEN_KEEP_CACHE))
 204                 invalidate_inode_pages2(inode->i_mapping);


crash> fuse_file.open_flags 0xffff9970b30d93c0
  open_flags = 0x8,

so FOPEN_CACHE_DIR

mm/truncate.c:
780 int invalidate_inode_pages2(struct address_space *mapping)
781 {
782         return invalidate_inode_pages2_range(mapping, 0, -1);
783 }

(so the cached pages are being invalidated because FOPEN_KEEP_CACHE isn't set...

however, this is a directory, and libfuse says this has no effect with an opendir:

        /** Can be filled in by open. It signals the kernel that any
            currently cached file data (ie., data that the filesystem
            provided the last time the file was open) need not be
            invalidated. Has no effect when set in other contexts (in
            particular it does nothing when set by opendir()). */
        unsigned int keep_cache : 1;


so I believe this means that directories are being cached, but cache is then immediately invalidated on the next opendir

I suspect the page we're choking on while reading from cache has just recently been attached to this address_space, but does not actually hold any cached fuse_dirents (hence the bogus data).

Comment 4 Miklos Szeredi 2022-10-05 08:17:32 UTC
Created attachment 1916128 [details]
proposed fix (upstream)

untested patch against upstream kernel

Comment 9 Miklos Szeredi 2022-10-06 11:17:17 UTC
There might also be an issue with FOPEN_KEEP_CACHE usage.  It might be required, despite the misleading comment in libfuse.

FOPEN_CACHE_DIR: cache directory contents and use it if available
FOPEN_KEEP_CACE: if not set, clear the cache on open

They are orthogonal, both need to be used for effective caching.  FOPEN_CACHE_DIR without FOPEN_KEEP_CACHE means: reset the cache, but build up another one.

Comment 11 Frank Sorenson 2022-10-07 19:08:15 UTC
Created attachment 1916749 [details]
patch to control file and directory caching separately

(In reply to Miklos Szeredi from comment #9)
> There might also be an issue with FOPEN_KEEP_CACHE usage.  It might be
> required, despite the misleading comment in libfuse.

I did spot this...

How about something like this (untested) patch against upstream

Comment 12 Miklos Szeredi 2022-10-07 19:32:31 UTC
No, I think the existing kernel behavior is okay.  FOPEN_KEEP_CACHE means invalidate current cache but continue building the cache (if caching is enabled, which is the default for regular files).  This is a perfectly valid concept for directory cache as well.

The issue seems to be with the libfuse API documentation, and possibly with cvmfs code.

Comment 13 Frank Sorenson 2022-10-08 02:42:53 UTC
Comment on attachment 1916749 [details]
patch to control file and directory caching separately

okay, gotcha

Comment 15 Boyang Xue 2022-10-11 11:20:13 UTC
I fail to reproduce this bug by following the steps in #0. Is there something I've missed? I will try it again and try from other data centers.

log
```
[root@kvm102 ~]# rpm -qa | grep -E "fuse|cvm"
fuse3-libs-3.3.0-16.el8.x86_64
cvmfs-config-default-2.0-2.noarch
fuse-common-3.3.0-16.el8.x86_64
fuse3-3.3.0-16.el8.x86_64
cvmfs-2.9.4-1.el8.x86_64
fuse3-devel-3.3.0-16.el8.x86_64
fuse-libs-2.9.7-16.el8.x86_64
fuse-2.9.7-16.el8.x86_64
cvmfs-fuse3-2.9.4-1.el8.x86_64
[root@kvm102 ~]# uname -a
Linux kvm102 4.18.0-424.el8.x86_64 #1 SMP Mon Sep 5 20:37:40 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
[root@kvm102 ~]# cat /etc/cvmfs/default.local
CVMFS_REPOSITORIES=cms.cern.ch,oasis.opensciencegrid.org
CVMFS_HTTP_PROXY=DIRECT
[root@kvm102 ~]# cat /etc/auto.master.d/cvmfs.autofs
/cvmfs /etc/auto.cvmfs
[root@kvm102 ~]# systemctl status autofs.service
● autofs.service - Automounts filesystems on demand
   Loaded: loaded (/usr/lib/systemd/system/autofs.service; disabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/autofs.service.d
           └─50-cvmfs.conf
   Active: active (running) since Tue 2022-10-11 08:31:01 EDT; 2h 18min ago
 Main PID: 28625 (automount)
    Tasks: 6 (limit: 14592)
   Memory: 1.4G
   CGroup: /system.slice/autofs.service
           └─28625 /usr/sbin/automount --systemd-service --dont-check-daemon

Oct 11 08:50:37 kvm102 cvmfs2[29138]: (oasis.opensciencegrid.org) released nested catalogs
Oct 11 09:05:48 kvm102 cvmfs2[29066]: high watermark of pinned files (1500M > 1500M)
Oct 11 09:05:48 kvm102 cvmfs2[29075]: (cvmfs-config.cern.ch) released nested catalogs
Oct 11 09:05:48 kvm102 cvmfs2[29138]: (oasis.opensciencegrid.org) released nested catalogs
Oct 11 09:20:18 kvm102 cvmfs2[29066]: clean up cache until at most 2048000 KB is used
Oct 11 09:24:58 kvm102 cvmfs2[29066]: high watermark of pinned files (1529M > 1500M)
Oct 11 09:24:58 kvm102 cvmfs2[29075]: (cvmfs-config.cern.ch) released nested catalogs
Oct 11 09:24:58 kvm102 cvmfs2[29138]: (oasis.opensciencegrid.org) released nested catalogs
Oct 11 09:43:48 kvm102 cvmfs2[29138]: (oasis.opensciencegrid.org) CernVM-FS: unmounted /cvmfs/oasis.opensciencegrid.org (oasis.opensciencegrid.org)
Oct 11 09:47:34 kvm102 cvmfs2[29075]: (cvmfs-config.cern.ch) CernVM-FS: unmounted /cvmfs/cvmfs-config.cern.ch (cvmfs-config.cern.ch)

[root@kvm102 ~]# ./walk_tree 8 /cvmfs/oasis.opensciencegrid.org
tid 28994, child 0: alive 
tid 28997, child 3: alive                                                                      
tid 28998, child 4: alive   
tid 28999, child 5: alive  
tid 28995, child 1: alive
tid 29000, child 6: alive              
tid 28996, child 2: alive                                                                      
tid 29001, child 7: alive
tid 28995, child 1: exiting with NO ERROR                                                      
tid 28997, child 3: exiting with NO ERROR                                                      
tid 29000, child 6: exiting with NO ERROR                                                      
tid 28994, child 0: exiting with NO ERROR                                                      
tid 29001, child 7: exiting with NO ERROR                                                      
tid 28998, child 4: exiting with NO ERROR                                                      
tid 28996, child 2: exiting with NO ERROR                                                      
tid 28999, child 5: exiting with NO ERROR                                                      
tid 0, child -1: child 0 exited                                                                                                                                                               
tid 0, child -1: child 1 exited                                                                                                                                                               
tid 0, child -1: child 2 exited
tid 0, child -1: child 3 exited
tid 0, child -1: child 4 exited
tid 0, child -1: child 5 exited
tid 0, child -1: child 6 exited
tid 0, child -1: child 7 exited

# no warning in kernel log
```

Comment 16 Frank Sorenson 2022-10-11 22:30:16 UTC
fuse-2.9.7-16.el8 / fuse3-3.3.0-16.el8 should be recent enough to contain the readdir caching code:


%changelog
* Mon May 30 2022 Pavel Reichl <preichl> - 2.9.7-16
- Back-port max_pages support,
- caching symlinks in kernel page cache,
- and in-kernel readdir caching
- Fixed rhbz#2080000


and cvmfs 2.9.4-1.el8 is the same version I've been testing with

kernel 4.18.0-424.el8.x86_64 should definitely support 




How long did the walk_tree run?

does cvmfs have contents?  for example:

# ls -al /cvmfs/oasis.opensciencegrid.org
total 56
drwxr-xr-x. 22 cvmfs cvmfs 4096 Nov 16  2017 .
drwxr-xr-x.  4 cvmfs cvmfs 4096 Oct  3  2018 accre
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 atlas
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 auger
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 cmssoft
drwxr-xr-x.  5 cvmfs cvmfs 1024 Nov 16  2017 csiu
-rw-r--r--.  1 cvmfs cvmfs  511 Jun 14  2021 .cvmfsdirtab
drwxrwxr-x.  3 cvmfs cvmfs 1024 Nov 16  2017 enmr
drwxrwxr-x.  2 cvmfs cvmfs 4096 Jan 17  2016 fermilab
drwxrwxr-x.  5 cvmfs cvmfs 1024 Apr  4  2017 geant4
drwxrwxr-x.  3 cvmfs cvmfs 1024 Nov 16  2017 glow
drwxr-xr-x. 19 cvmfs cvmfs 4096 May  3 19:51 gluex
drwxrwxr-x.  6 cvmfs cvmfs 1024 Nov 16  2017 ilc
drwxr-xr-x.  7 cvmfs cvmfs 4096 Dec 21  2020 jlab
drwxrwxr-x.  6 cvmfs cvmfs 4096 Mar 30  2020 ligo
drwxr-xr-x. 10 cvmfs cvmfs 4096 Oct  7 10:24 mis
drwxr-xr-x.  4 cvmfs cvmfs 1024 Nov 17  2017 nanohub
drwxrwxr-x.  2 cvmfs cvmfs 1024 Nov 17  2017 nova
drwxrwxr-x.  8 cvmfs cvmfs 4096 Jun 23  2020 osg
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 osg-software
drwxr-xr-x. 17 cvmfs cvmfs 1024 Nov 17  2017 sbgrid

Comment 17 Boyang Xue 2022-10-12 02:52:17 UTC
(In reply to Frank Sorenson from comment #16)
> fuse-2.9.7-16.el8 / fuse3-3.3.0-16.el8 should be recent enough to contain
> the readdir caching code:
> 
> 
> %changelog
> * Mon May 30 2022 Pavel Reichl <preichl> - 2.9.7-16
> - Back-port max_pages support,
> - caching symlinks in kernel page cache,
> - and in-kernel readdir caching
> - Fixed rhbz#2080000
> 
> 
> and cvmfs 2.9.4-1.el8 is the same version I've been testing with
> 
> kernel 4.18.0-424.el8.x86_64 should definitely support 
> 
> 
> 
> 
> How long did the walk_tree run?

It took 70 minutes for a single run.

> 
> does cvmfs have contents?  for example:
> 
> # ls -al /cvmfs/oasis.opensciencegrid.org
> total 56
> drwxr-xr-x. 22 cvmfs cvmfs 4096 Nov 16  2017 .
> drwxr-xr-x.  4 cvmfs cvmfs 4096 Oct  3  2018 accre
> drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 atlas
> drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 auger
> drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 cmssoft
> drwxr-xr-x.  5 cvmfs cvmfs 1024 Nov 16  2017 csiu
> -rw-r--r--.  1 cvmfs cvmfs  511 Jun 14  2021 .cvmfsdirtab
> drwxrwxr-x.  3 cvmfs cvmfs 1024 Nov 16  2017 enmr
> drwxrwxr-x.  2 cvmfs cvmfs 4096 Jan 17  2016 fermilab
> drwxrwxr-x.  5 cvmfs cvmfs 1024 Apr  4  2017 geant4
> drwxrwxr-x.  3 cvmfs cvmfs 1024 Nov 16  2017 glow
> drwxr-xr-x. 19 cvmfs cvmfs 4096 May  3 19:51 gluex
> drwxrwxr-x.  6 cvmfs cvmfs 1024 Nov 16  2017 ilc
> drwxr-xr-x.  7 cvmfs cvmfs 4096 Dec 21  2020 jlab
> drwxrwxr-x.  6 cvmfs cvmfs 4096 Mar 30  2020 ligo
> drwxr-xr-x. 10 cvmfs cvmfs 4096 Oct  7 10:24 mis
> drwxr-xr-x.  4 cvmfs cvmfs 1024 Nov 17  2017 nanohub
> drwxrwxr-x.  2 cvmfs cvmfs 1024 Nov 17  2017 nova
> drwxrwxr-x.  8 cvmfs cvmfs 4096 Jun 23  2020 osg
> drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 osg-software
> drwxr-xr-x. 17 cvmfs cvmfs 1024 Nov 17  2017 sbgrid

Yes. During the `walk_tree` run, my `ls` output:
```
[root@kvm102 ~]# ls -al /cvmfs/oasis.opensciencegrid.org
total 56
drwxr-xr-x. 22 cvmfs cvmfs 4096 Nov 16  2017 .
-rw-r--r--.  1 cvmfs cvmfs  511 Jun 14  2021 .cvmfsdirtab
drwxr-xr-x.  4 cvmfs cvmfs 4096 Oct  3  2018 accre
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 atlas
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 auger
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 cmssoft
drwxr-xr-x.  5 cvmfs cvmfs 1024 Nov 16  2017 csiu
drwxrwxr-x.  3 cvmfs cvmfs 1024 Nov 16  2017 enmr
drwxrwxr-x.  2 cvmfs cvmfs 4096 Jan 17  2016 fermilab
drwxrwxr-x.  5 cvmfs cvmfs 1024 Apr  4  2017 geant4
drwxrwxr-x.  3 cvmfs cvmfs 1024 Nov 16  2017 glow
drwxr-xr-x. 19 cvmfs cvmfs 4096 May  3 20:51 gluex
drwxrwxr-x.  6 cvmfs cvmfs 1024 Nov 16  2017 ilc
drwxr-xr-x.  7 cvmfs cvmfs 4096 Dec 21  2020 jlab
drwxrwxr-x.  6 cvmfs cvmfs 4096 Mar 30  2020 ligo
drwxr-xr-x. 10 cvmfs cvmfs 4096 Oct  7 11:24 mis
drwxr-xr-x.  4 cvmfs cvmfs 1024 Nov 17  2017 nanohub
drwxrwxr-x.  2 cvmfs cvmfs 1024 Nov 17  2017 nova
drwxrwxr-x.  8 cvmfs cvmfs 4096 Jun 23  2020 osg
drwxr-xr-x.  2 cvmfs cvmfs 1024 Nov 16  2017 osg-software
drwxr-xr-x. 17 cvmfs cvmfs 1024 Nov 17  2017 sbgrid
drwxrwxr-x.  3 cvmfs cvmfs 1024 Nov 17  2017 snoplussnolabca
```

I have tried
1) Randomize the number of the child threads from 1-16 (my system is a 4 vCPU KVM guest)
2) Run the test from the RDU2 data center (my original test was done from a PEK2 system)
Both of these attempts failed.

Comment 18 Miklos Szeredi 2022-10-17 11:10:09 UTC
Created attachment 1918546 [details]
proposed fix (v2)

Attaching updated patch.

Comment 21 Miklos Szeredi 2022-10-19 15:09:13 UTC
Created attachment 1919090 [details]
proposed patch (v3)

Updated patch

Comment 26 Boyang Xue 2022-10-26 10:03:55 UTC
Since we have qa_ack+ and devel_ack+, and the BZ status is ASSIGNED. I'm setting the ITR to 8.8. Developer please set the DTM when it's ready.

Comment 28 Boyang Xue 2022-11-27 05:37:59 UTC
TEST PASS.

Unable to reproduce the bug with the reproducer. Verified by running regression tests. No regression was found in the tests.

Reproduced with kernel-4.18.0-439.el8. Link to Beaker jobs:
https://url.corp.redhat.com/bz2131391-reproduce

Verified with kernel-4.18.0-437.el8.mr3741_221114_1537.g9461. Link to Beaker jobs:
https://url.corp.redhat.com/bz2131391-verify

Comment 33 Boyang Xue 2022-12-02 01:18:54 UTC
TEST PASS.

Unable to reproduce the bug with the reproducer. Verified by running regression tests. No regression was found in the tests.

Reproduced with kernel-4.18.0-439.el8. Link to Beaker jobs:
https://url.corp.redhat.com/bz2131391-reproduce

Verified with kernel-4.18.0-441.el8. Link to Beaker jobs:
https://url.corp.redhat.com/bz2131391-final-verify

Comment 35 errata-xmlrpc 2023-05-16 08:53:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2951


Note You need to log in before you can comment on or make changes to this bug.