Bug 1455086
| Summary: | nfs client causes kernel BUG at ./include/linux/mm.h:432! | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Trevor Cordes <trevor> |
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 24 | CC: | gansalmon, ichavero, itamar, jonathan, kernel-maint, labbott, madhu.chinakonda, mchehab |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-08 19:49:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Trevor Cordes
2017-05-24 09:04:33 UTC
Bug is still there, tested 4.11.5-100.fc24; bug output slightly different: kernel BUG at ./include/linux/mm.h:462! I can make it freeze the mount within 5s just by playing a video off of the exported fs. Tried 4.11.10-100.fc24 and bug persists. Took about 5 mins to crash this time playing video... I think it crashes faster when I seek all around the video rather than just watching it linearly. I'm going to try to start a bisect. FYI, my mount options in fstab (in case I'm doing something weird triggering this bug): 192.168.100.2:/data /mnt/data nfs4 rw,bg,async,hard,intr,noatime,nodiratime,nosuid,proto=tcp,timeo=15,rsize=8192,wsize=8192 0 0 from mount: 192.168.100.2:/data on /mnt/data type nfs4 (rw,nosuid,noatime,nodiratime,vers=4.2,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,port=0,timeo=15,retrans=2,sec=sys,clientaddr=192.168.100.1,local_lock=none,addr=192.168.100.2) I have bisected this bug to:
c21b48cc1bbf2f5af3ef54ada559f7fadf8b508b is the first bad commit
commit c21b48cc1bbf2f5af3ef54ada559f7fadf8b508b
Author: Eric Dumazet <edumazet>
Date: Wed Apr 26 09:07:46 2017 -0700
net: adjust skb->truesize in ___pskb_trim()
Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in
skb_try_coalesce() using syzkaller and a filter attached to a TCP
socket.
As we did recently in commit 158f323b9868 ("net: adjust skb->truesize in
pskb_expand_head()") we can adjust skb->truesize from ___pskb_trim(),
via a call to skb_condense().
If all frags were freed, then skb->truesize can be recomputed.
This call can be done if skb is not yet owned, or destructor is
sock_edemux().
I am currently rpmbuilding a custom test kernel of 4.11.10 with the following patch to undo that commit:
diff -uNr a/net/core/skbuff.c b/net/core/skbuff.c
--- a/net/core/skbuff.c 2017-07-23 06:48:50.714654762 -0500
+++ b/net/core/skbuff.c 2017-07-23 06:53:31.441060810 -0500
@@ -1576,8 +1576,6 @@
skb_set_tail_pointer(skb, len);
}
- if (!skb->sk || skb->destructor == sock_edemux)
- skb_condense(skb);
return 0;
}
EXPORT_SYMBOL(___pskb_trim);
It will take all day to rebuild so I will get back later, but the bisect was pretty easy & clean (can make it crash in about 10s every time) so I'm pretty sure that will be the bug.
I'm not sure why I'm the only one in the world with this bug, or why heavy NFS access triggers it.
I'll report back when my custom 4.11.10 is booted and doesn't have the bug!
I have confirmed the patch in comment #4 fixes the bug. My bug is definitely caused by commit c21b48cc1bbf2f5af3ef54ada559f7fadf8b508b. My patched stock Fedora 24 4.11.10 works perfectly fine. I'm trying to get some attention to this on LKML, we'll see how that goes. Eric Dumazet <edumazet> on the netdev mailing list has notified me of a potential "proper" patch to fix my bug:
e44699d2c28067f69698ccb68dd3ddeacfebc434 ("net: handle
NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()")
Indeed that patch also fixes my problem (after I took out my kludge c21b48 back-out patch), when applied to stock F24 4.11.10 src rpm.
I hadn't tested e44699 as it isn't in any Fedora 24 kernel yet. I'm not sure what vanilla version it is in but the patch is only 1 month old. Is it possible to find out if/when/what F24 kernel version will get this patch? Thanks!
Thanks for tracking it down. F24 is going to go EOL very soon so it's going to be getting very sparse updates. e44699d2c28067f69698ccb68dd3ddeacfebc434 is in 4.12 though so F26 and eventually F25 will be getting that patch. Thanks! This might be a dumb question, but how can I tell what version of (a) vanilla kernel and/or (b) Fedora's kernel a specific commit is in? I'll be switching to F25 soon and will look forward to that commit. This message is a reminder that Fedora 24 is nearing its end of life. Approximately 2 (two) weeks from now Fedora will stop maintaining and issuing updates for Fedora 24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '24'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 24 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. (In reply to Trevor Cordes from comment #8) > Thanks! This might be a dumb question, but how can I tell what version of > (a) vanilla kernel and/or (b) Fedora's kernel a specific commit is in? > > I'll be switching to F25 soon and will look forward to that commit. It's not a dumb question. For vanilla kernels, if you have a git checkout of the tree you can do: git describe --contains <commit sha1sum> For a Fedora kernel, it's somewhat harder depending on things. We have an exploded tree, but if it's a backport the upstream sha1sum can't be used. So for specific patches or bugs, it's probably best just to ask. Also, I would recommend moving to Fedora 26. I upgraded to F26 and got kernel-4.12.5-300.fc26.x86_64 from koji and my bug appears gone, so that commit must be in there. Note: current non-testing F26 kernel 4.11.11 still has the bug, that's why I had to grab koji. I guess I close this bug as UPSTREAM(?) or wait until 4.12 gets out of testing and into updates proper? P.S. These updates seem to also fix a similar bug (unreported) where my cifs mounts would hang in the exact same way... perhaps they were related. Thanks! Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. |