Bug 1757872
Summary: | crash in is_size_safe_to_change in 3.10.0-1062.el7 kernel as a result of another condition / race similar to earlier bug 1580165 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Dave Wysochanski <dwysocha> |
Component: | kernel | Assignee: | Dave Wysochanski <dwysocha> |
kernel sub component: | CIFS | QA Contact: | xiaoli feng <xifeng> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | fsorenso, gsierohu, jaeshin, jstancek, lsahlber, piastryyy, rbergant, xifeng, xzhou |
Version: | 7.7 | Keywords: | Reproducer |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | kernel-3.10.0-1109.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-03-31 19:33:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1711360 |
Description
Dave Wysochanski
2019-10-02 15:34:32 UTC
Hey Ronnie, I am not sure what approach to take here. It looks like for RHEL7: 1) a global spinlock was added, not the new cifsFileInfo.open_file_lock added by the upstream commit you did 487317c99477 ("cifs: add spinlock for the openFileList to cifsInodeInfo") 2) did not fix the site of the crash in is_inode_writable Should we revisit the patch for RHEL7 or add to it? We have a customer willing to test with a new patch that has hit it a few times in a week. Today I tried to setup a reproducer environment for this but was not successful in reproducing the crash. However, I have some ideas on how to make a crash more likely. I used a small variant of the reproducer in the original bug 1580165 and ran a few iterations of it with a few users. I started a small stap to study the kernel structures and the size of the lists. I am using a fairly simple setup without kerberos, but using 'cifscreds' for each local user and these mount options: mount -vvvv -t cifs -osec=ntlmssp,multiuser,vers=2.1,user=shares //cifs-server/example /mnt/example Ok this reproduces very easily for me now on 3.10.0-1062.el7.x86_64. Just need to increase the number of threads running the ./openloop.py program. If I run 100 iterations in parallel on the same user then login with another user and cd to the same cifs mount the machine will panic within about 10 seconds. I'll have to try and refine but for now here's the basics. 1. As root mount the share # mount -vvvv -t cifs -osec=ntlmssp,multiuser,vers=2.1,user=shares //cifs-server/example /mnt/example 2. su to user1 [root@rhel7u7-node1 ~]# su - cifs1-user 3. cd to the share, and run cifscreds [cifs1-user@rhel7u7-node1 ~]$ cd /mnt/example [cifs1-user@rhel7u7-node1 example]$ ls ls: reading directory .: Permission denied [cifs1-user@rhel7u7-node1 example]$ cifscreds add cifs-server Password: [cifs1-user@rhel7u7-node1 example]$ ls cifs1-user-file1 cifs2-user-file1 cifs3-user-file1 large-tarball.tgz lib openloop.py testfile 4. Run 100 iterations of the program that reads a small file (a few bytes) [cifs1-user@rhel7u7-node1 example]$ for i in `seq 1 100`; do ./openloop.py & done 5. ssh in on another terminal and su to user2, and cd to /mnt/example and run cifscreds, then ls the directory <crash in about 10 seconds> I just started running the same test with 3.10.0-1062.1.1.el7.bz1757872.1.x86_64 and so far so good. Will let it run overnight with 3 users, each of which has 100 instances of the program running. Relevant /etc/samba/smb.conf (on server) [global] ... security = user ... encrypt passwords = yes ... server signing = mandatory client signing = mandatory ... [example] path = /samba/example browsable = yes writable = yes read only = no valid users = shares,cifs1-user,cifs2-user,cifs3-user Ran test overnight even with 3*100 instances with patched kernel-3.10.0-1062.1.1.el7.bz1757872.1.src.rpm and it did not crash or see any other side-effects. This kernel has 3 patches: * Thu Oct 03 2019 Dave Wysochanski <dwysocha> [3.10.0-1062.1.1.el7.bz1757872.1] - [fs] Revert "[fs] cifs: add more spinlocks to pretect against (Dave Wysochansk) [1757872] - [fs] cifs: add spinlock for the openFileList to cifsInodeInfo (Dave Wysochansk) [1757872] - [fs] cifs: use cifsInodeInfo->open_file_lock while iterating (Dave Wysochansk) [1757872] (In reply to Dave Wysochanski from comment #9) > Ok this reproduces very easily for me now on 3.10.0-1062.el7.x86_64. Just > need to increase the number of threads running the ./openloop.py program. > If I run 100 iterations in parallel on the same user then login with another > user and cd to the same cifs mount the machine will panic within about 10 > seconds. I'll have to try and refine but for now here's the basics. I can confirm the crash is reproduced on -1062, though it took 2 users, each running 200 copies of openloop.py, and took 8-10 minutes before crashing (unsure whether the 'ls' got involved in my case). Perhaps with more threads it would crash sooner. System was a VM with 4 cpus and 1.5 GiB of memory openloop.py is as previously, with the counter & counter output removed: #!/usr/bin/env python while True: with open('testfile', 'r') as myfile: content = myfile.read() *** Bug 1752315 has been marked as a duplicate of this bug. *** Patch(es) committed on kernel-3.10.0-1106.el7 Ok after looking at this closer, I agree we need to take 1d4dff701fcb481eebe50b9fec01b10d23bf09ed because if we don't it will introduce another race and possible use-after-free due to the second patch above changing from tcon->open_file_lock to cinode->open_file_lock inside cifs_get_writable_file. The patch solved one problem and created another. Ronnie, how do we handle this? Should we open another bug and submit the patch, or is there some way to add the patch onto this bug? We also need to address it in rhel8 bug (in POST but not yet committed): https://bugzilla.redhat.com/show_bug.cgi?id=1757865 FWIW, I think the race is a lot smaller and harder to hit for the new problem fixed by Pavel's patch. That probably explains why the customer, our testing, and upstream buildbot 5.4-rc3 did not find the problem. I guess we could just play the wait and see approach with a new bug filed. I'm going to post the followup patch from pavel on this bug as a followup because we don't want to introduce another problem. Final upstream commit id for patch needing posted commit 1a67c415965752879e2e9fad407bc44fc7f25f23 Author: Pavel Shilovsky <pshilov> Date: Wed Oct 23 15:37:19 2019 -0700 CIFS: Fix use after free of file info structures Currently the code assumes that if a file info entry belongs to lists of open file handles of an inode and a tcon then it has non-zero reference. The recent changes broke that assumption when putting the last reference of the file info. There may be a situation when a file is being deleted but nothing prevents another thread to reference it again and start using it. This happens because we do not hold the inode list lock while checking the number of references of the file info structure. Fix this by doing the proper locking when doing the check. Fixes: 487317c99477d ("cifs: add spinlock for the openFileList to cifsInodeInfo") Fixes: cb248819d209d ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic") Cc: Stable <stable.org> Reviewed-by: Ronnie Sahlberg <lsahlber> Signed-off-by: Pavel Shilovsky <pshilov> Signed-off-by: Steve French <stfrench> *** Bug 1685457 has been marked as a duplicate of this bug. *** Adding Pavel from Microsoft so he can see when this gets fixed. Patch(es) committed on kernel-3.10.0-1109.el7 *** Bug 1770284 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1016 |