Bug 1757872

Summary: crash in is_size_safe_to_change in 3.10.0-1062.el7 kernel as a result of another condition / race similar to earlier bug 1580165
Product: Red Hat Enterprise Linux 7 Reporter: Dave Wysochanski <dwysocha>
Component: kernelAssignee: Dave Wysochanski <dwysocha>
kernel sub component: CIFS QA Contact: xiaoli feng <xifeng>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: fsorenso, gsierohu, jaeshin, jstancek, lsahlber, piastryyy, rbergant, xifeng, xzhou
Version: 7.7Keywords: Reproducer
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.10.0-1109.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-31 19:33:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1711360    

Description Dave Wysochanski 2019-10-02 15:34:32 UTC
Description of problem:
A similar crash has been reported as in https://bugzilla.redhat.com/show_bug.cgi?id=1580165#c59 with similar problem in openFileList locking.

Version-Release number of selected component (if applicable):
kernel-3.10.0-1062.1.1.el7

How reproducible:
TBD - reproducible by at least one customer

Steps to Reproduce:
TBD

Actual results:
kernel panic

Expected results:
no panic

Additional info:
We have a vmcore and initial analysis that I will paste from the other bug.

Comment 3 Dave Wysochanski 2019-10-03 14:53:04 UTC
Hey Ronnie, I am not sure what approach to take here.  It looks like for RHEL7:
1) a global spinlock was added, not the new cifsFileInfo.open_file_lock added by the upstream commit you did
487317c99477 ("cifs: add spinlock for the openFileList to cifsInodeInfo")

2) did not fix the site of the crash in is_inode_writable

Should we revisit the patch for RHEL7 or add to it?  We have a customer willing to test with a new patch that has hit it a few times in a week.

Comment 8 Dave Wysochanski 2019-10-04 22:42:43 UTC
Today I tried to setup a reproducer environment for this but was not successful in reproducing the crash.  However, I have some ideas on how to make a crash more likely.
I used a small variant of the reproducer in the original bug 1580165 and ran a few iterations of it with a few users.  I started a small stap to study the kernel structures and the size of the lists.

I am using a fairly simple setup without kerberos, but using 'cifscreds' for each local user and these mount options:
mount -vvvv -t cifs -osec=ntlmssp,multiuser,vers=2.1,user=shares //cifs-server/example /mnt/example

Comment 9 Dave Wysochanski 2019-10-05 03:09:08 UTC
Ok this reproduces very easily for me now on 3.10.0-1062.el7.x86_64.  Just need to increase the number of threads running the ./openloop.py program.  If I run 100 iterations in parallel on the same user then login with another user and cd to the same cifs mount the machine will panic within about 10 seconds.  I'll have to try and refine but for now here's the basics.

1. As root mount the share
# mount -vvvv -t cifs -osec=ntlmssp,multiuser,vers=2.1,user=shares //cifs-server/example /mnt/example

2. su to user1
[root@rhel7u7-node1 ~]# su - cifs1-user

3. cd to the share, and run cifscreds
[cifs1-user@rhel7u7-node1 ~]$ cd /mnt/example
[cifs1-user@rhel7u7-node1 example]$ ls
ls: reading directory .: Permission denied
[cifs1-user@rhel7u7-node1 example]$ cifscreds add cifs-server
Password:
[cifs1-user@rhel7u7-node1 example]$ ls
cifs1-user-file1  cifs2-user-file1  cifs3-user-file1  large-tarball.tgz  lib  openloop.py  testfile

4. Run 100 iterations of the program that reads a small file (a few bytes)
[cifs1-user@rhel7u7-node1 example]$ for i in `seq 1 100`; do ./openloop.py & done

5. ssh in on another terminal and su to user2, and cd to /mnt/example and run cifscreds, then ls the directory

<crash in about 10 seconds>


I just started running the same test with 3.10.0-1062.1.1.el7.bz1757872.1.x86_64 and so far so good.  Will let it run overnight with 3 users, each of which has 100 instances of the program running.

Comment 10 Dave Wysochanski 2019-10-05 03:12:37 UTC
Relevant /etc/samba/smb.conf (on server)
[global]
...
        security = user
...
        encrypt passwords = yes
...
        server signing = mandatory
        client signing = mandatory
...
[example]
        path = /samba/example
        browsable = yes
        writable = yes
        read only = no
        valid users = shares,cifs1-user,cifs2-user,cifs3-user

Comment 11 Dave Wysochanski 2019-10-06 11:05:02 UTC
Ran test overnight even with 3*100 instances with patched kernel-3.10.0-1062.1.1.el7.bz1757872.1.src.rpm and it did not crash or see any other side-effects.

This kernel has 3 patches:
* Thu Oct 03 2019 Dave Wysochanski <dwysocha> [3.10.0-1062.1.1.el7.bz1757872.1]
- [fs] Revert "[fs] cifs: add more spinlocks to pretect against (Dave Wysochansk) [1757872]
- [fs] cifs: add spinlock for the openFileList to cifsInodeInfo (Dave Wysochansk) [1757872]
- [fs] cifs: use cifsInodeInfo->open_file_lock while iterating (Dave Wysochansk) [1757872]

Comment 13 Frank Sorenson 2019-10-09 13:34:55 UTC
(In reply to Dave Wysochanski from comment #9)
> Ok this reproduces very easily for me now on 3.10.0-1062.el7.x86_64.  Just
> need to increase the number of threads running the ./openloop.py program. 
> If I run 100 iterations in parallel on the same user then login with another
> user and cd to the same cifs mount the machine will panic within about 10
> seconds.  I'll have to try and refine but for now here's the basics.

I can confirm the crash is reproduced on -1062, though it took 2 users, each running 200 copies of openloop.py, and took 8-10 minutes before crashing (unsure whether the 'ls' got involved in my case).  Perhaps with more threads it would crash sooner.

System was a VM with 4 cpus and 1.5 GiB of memory


openloop.py is as previously, with the counter & counter output removed:

#!/usr/bin/env python

while True:
        with open('testfile', 'r') as myfile:
                content = myfile.read()

Comment 14 Dave Wysochanski 2019-10-09 19:21:01 UTC
*** Bug 1752315 has been marked as a duplicate of this bug. ***

Comment 16 Jan Stancek 2019-10-26 10:03:03 UTC
Patch(es) committed on kernel-3.10.0-1106.el7

Comment 21 Dave Wysochanski 2019-10-27 16:59:38 UTC
Ok after looking at this closer, I agree we need to take 1d4dff701fcb481eebe50b9fec01b10d23bf09ed because if we don't it will introduce another race and possible use-after-free due to the second patch above changing from tcon->open_file_lock to cinode->open_file_lock inside cifs_get_writable_file.  The patch solved one problem and created another.

Ronnie, how do we handle this?  Should we open another bug and submit the patch, or is there some way to add the patch onto this bug?

We also need to address it in rhel8 bug (in POST but not yet committed): https://bugzilla.redhat.com/show_bug.cgi?id=1757865

Comment 22 Dave Wysochanski 2019-10-27 17:16:41 UTC
FWIW, I think the race is a lot smaller and harder to hit for the new problem fixed by Pavel's patch.  That probably explains why the customer, our testing, and upstream buildbot 5.4-rc3 did not find the problem.  I guess we could just play the wait and see approach with a new bug filed.

Comment 23 Dave Wysochanski 2019-10-27 22:56:03 UTC
I'm going to post the followup patch from pavel on this bug as a followup because we don't want to introduce another problem.

Comment 24 Dave Wysochanski 2019-10-27 23:01:35 UTC
Final upstream commit id for patch needing posted

commit 1a67c415965752879e2e9fad407bc44fc7f25f23
Author: Pavel Shilovsky <pshilov>
Date:   Wed Oct 23 15:37:19 2019 -0700

    CIFS: Fix use after free of file info structures
    
    Currently the code assumes that if a file info entry belongs
    to lists of open file handles of an inode and a tcon then
    it has non-zero reference. The recent changes broke that
    assumption when putting the last reference of the file info.
    There may be a situation when a file is being deleted but
    nothing prevents another thread to reference it again
    and start using it. This happens because we do not hold
    the inode list lock while checking the number of references
    of the file info structure. Fix this by doing the proper
    locking when doing the check.
    
    Fixes: 487317c99477d ("cifs: add spinlock for the openFileList to cifsInodeInfo")
    Fixes: cb248819d209d ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic")
    Cc: Stable <stable.org>
    Reviewed-by: Ronnie Sahlberg <lsahlber>
    Signed-off-by: Pavel Shilovsky <pshilov>
    Signed-off-by: Steve French <stfrench>

Comment 26 Dave Wysochanski 2019-10-28 20:58:57 UTC
*** Bug 1685457 has been marked as a duplicate of this bug. ***

Comment 27 Dave Wysochanski 2019-10-29 18:58:32 UTC
Adding Pavel from Microsoft so he can see when this gets fixed.

Comment 31 Jan Stancek 2019-11-03 15:44:12 UTC
Patch(es) committed on kernel-3.10.0-1109.el7

Comment 40 Ronnie Sahlberg 2019-12-02 22:43:39 UTC
*** Bug 1770284 has been marked as a duplicate of this bug. ***

Comment 43 errata-xmlrpc 2020-03-31 19:33:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1016