Bug 1701077 - NFS4 Server can crash whole server / cause data loss
Summary: NFS4 Server can crash whole server / cause data loss
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 29
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-18 00:57 UTC by Slawomir Pryczek
Modified: 2019-04-25 23:24 UTC (History)
16 users (show)

Fixed In Version: kernel-5.0.9-301.fc30
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-25 19:33:39 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Crash log (38.62 KB, text/plain)
2019-04-18 00:57 UTC, Slawomir Pryczek
no flags Details

Description Slawomir Pryczek 2019-04-18 00:57:29 UTC
Created attachment 1556046 [details]
Crash log

Machine get's unstable / or totally locks up when high amount of specific NFS4 file ops takes place. NFS3 seems unaffected. Any user account seems to be able to kill/crash the server completely within around 5-30s, if it has access to nfs4 share. This was tested by admins of company i work for, on VM and barebone servers, and by me on VMWare, we were able to crash several (virtual) machines which are using new FC, within seconds. Sometimes files were lost after the crash or whole volume was unable to mount. Strange thing is that it also sometimes affected files which were never written to (source code files, or the app file itself... which shouldn't get damaged because the app was running)

Version-Release number of the kernel:
- 5.0.7-200.fc29.x86_64 #1 SMP (Fedora 29, also other sub-versions)
- 5.0.6-200.fc29.x86_64 #1 SMP (Fedora 29 Server Edition)

Not sure when the bug first appeared, but it works ok on FC26. Also im not very familiar with fedora. It's usually crashing on posix_lock_inode+0x4cf/0x8c0 and these were all (beside one case) fresh system installs, just with nfs client and nfs server added.

Logs attached, also sending VMWare screenshot if that helps in anything:
https://www.screencast.com/t/w06VPrBap

I wrote an app for crashing the server and reproducing this bug. Not sure if i should release it, due to severity of this and possible data loss. I can also share complete VMWare VM. If you think that's ok i can post it on github, also i identified single, specific file operation which is causing the crash if run alongside other file ops. Not sure if I should disclose in public because then it can be reproduced.

Component crashing is NFS Server (the client works fine, just locks up after the server is gone when tested on 2 machines).

Reporting as urgent because anyone having access to any nfs share on new FC installs can crash server or cause data loss. Discovered this bug when working with sqlite database over NFS share, but it takes around 5-10 minutes to crash our production server(s) probably because the FS load wasn't as high as generated by the app. So servers can also crash on their own when load is high and this specific file operation is taking place.

Comment 1 Slawomir Pryczek 2019-04-18 10:08:43 UTC
Some more info about the issue. Tested this with Debian 9 (Linux debian 4.9.0-8-amd64 #1 SMP), and it for sure seems kernel issue because there the bug manifests exactly as when we have used nfsv4 on Fedora 26. So instead of kernel errors / killing the machine - the nfs server is loosing ability to write anything to the share. After issuing "echo 123 > t.txt" the shell locks indefinitely, and some files could be read but not all (maybe because of some cache which is still working properly). This happened to us on production just couple of times...

The only error which is seen in dmesg is "nfs4_reclaim_open_state: Lock reclaim failed!". It'll kill nfs 4.2, 4.1. NFS 4.2 takes around 5-10s to be killed and 4.1 around a minute. After that the server needs restart /etc/init.d (stop/start/reload) seems to not have any effect.

Very strange thing is that if the client connects using nfs 4.0 then everything works fine ("mount -t nfs -o vers=4.0 127.0.0.1:/home /homenfs/") and there are no errors in dmesg. Maybe because it's slower because the server seems to be also harder to kill when enabling nfsd debugging...

Also this seems related: https://bugzilla.kernel.org/show_bug.cgi?id=115521

Comment 2 Slawomir Pryczek 2019-04-19 07:20:20 UTC
Another update on the issue. Tested this on Fedora 30 Server Edition (5.0.7-300.fc30.x86_64 #1 SMP) same thing happens.

Posting test/exploit code and there are more logs on github (compilation instructions at the top)
https://github.com/slawomir-pryczek/drbd_kill

NFS 4.1 and 4.2 affected. NFS 4.0 and NFS 3 is working fine

Comment 3 Slawomir Pryczek 2019-04-20 21:09:51 UTC
Reported that to kernel bugtracker... seems patch is ready
https://bugzilla.kernel.org/show_bug.cgi?id=203363

Comment 4 Fedora Update System 2019-04-22 16:43:59 UTC
kernel-tools-5.0.9-300.fc30 kernel-headers-5.0.9-300.fc30 kernel-5.0.9-300.fc30 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-e84f6c34da

Comment 5 Fedora Update System 2019-04-22 16:45:50 UTC
kernel-tools-5.0.9-200.fc29 kernel-headers-5.0.9-200.fc29 kernel-5.0.9-200.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-1e8a4c6958

Comment 6 Fedora Update System 2019-04-22 16:47:06 UTC
kernel-tools-5.0.9-100.fc28 kernel-headers-5.0.9-100.fc28 kernel-5.0.9-100.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2019-1b986880ea

Comment 7 Fedora Update System 2019-04-23 14:55:44 UTC
kernel-5.0.9-300.fc30, kernel-headers-5.0.9-300.fc30, kernel-tools-5.0.9-300.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-e84f6c34da

Comment 8 Fedora Update System 2019-04-23 19:33:36 UTC
kernel-5.0.9-100.fc28, kernel-headers-5.0.9-100.fc28, kernel-tools-5.0.9-100.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-1b986880ea

Comment 9 Fedora Update System 2019-04-23 21:15:50 UTC
kernel-5.0.9-200.fc29, kernel-headers-5.0.9-200.fc29, kernel-tools-5.0.9-200.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-1e8a4c6958

Comment 10 Fedora Update System 2019-04-24 05:30:09 UTC
kernel-5.0.9-301.fc30 kernel-headers-5.0.9-300.fc30 kernel-tools-5.0.9-300.fc30 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-e84f6c34da

Comment 11 Fedora Update System 2019-04-24 20:27:42 UTC
kernel-5.0.9-301.fc30, kernel-headers-5.0.9-300.fc30, kernel-tools-5.0.9-300.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-e84f6c34da

Comment 12 Fedora Update System 2019-04-25 01:33:36 UTC
kernel-5.0.9-200.fc29, kernel-headers-5.0.9-200.fc29, kernel-tools-5.0.9-200.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 13 Fedora Update System 2019-04-25 19:33:39 UTC
kernel-5.0.9-301.fc30, kernel-headers-5.0.9-300.fc30, kernel-tools-5.0.9-300.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 14 Fedora Update System 2019-04-25 23:24:29 UTC
kernel-5.0.9-100.fc28, kernel-headers-5.0.9-100.fc28, kernel-tools-5.0.9-100.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.