Bug 1369600
| Summary: | Console spammed with receive_cb_reply log and sunrpc/sched.c warnings until OOM | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jason Tibbitts <j> |
| Component: | kernel | Assignee: | nfs-maint |
| kernel sub component: | NFS | QA Contact: | Filesystem QE <fs-qe> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | bfields, cww, dwysocha, jiyin, orion, pasik, spacewar, xzhou, yoyang |
| Version: | 7.2 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-11-26 21:07:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jason Tibbitts
2016-08-23 22:45:34 UTC
Please make this bug public if possible; I can't seem to do that myself. This has happened on another server this morning, so I know it isn't specific to that one machine. The hardware and OS install should be identical, however. Fortunately systemctl restart nfs seems to have cleared it up without any issues, and the clients seem to have recovered. So at least the workaround is relatively simple, if I can manage to catch the problem before the machine dies from OOM. > net/sunrpc/sched.c:692
That's the RPC_IS_ASSASSINATED(task) check in rpc_exit_task. Maybe that means some cleanup isn't being done on rpc exit (e.g., rpc_release not getting called?), causing the leak?
And I wonder if there's something wrong with the backchannel locking that's letting the xid's get out of sync.
If you have any suggestions I'm happy to try them, but sadly I cannot reproduce this at will. I know I can drop back to NFSv4.0 (which worked for years before v4.1 was enabled) but I was keeping things this way in the hope of being able to debug a bit. Of course it happens just often enough to cause real problems but not often enough to present obvious debugging opportunities. I have a feeling we did actually fix some sort of callback race there recently, but I'm not finding it at the moment.... Of course, you could just run on a upstream kernel for a few days and see if that helps, but I don't know if that's practical for you. I can do that, no problem, assuming that the userspace would actually work. I'm not at all sure how to generate a fresh kernel using the RHEL configs, though. I can try building a current Fedora kernel on EL7, but I have my doubts that what I got at the end would be useful. RHEL7 is pretty old now in Fedora terms. Some further notes: Sometimes I can catch a client in the act of spamming the server. On such a client, this is generally the only thing in dmesg: [100711.117385] NFS: state manager: check lease failed on NFSv4 server nas01 with error 13 [101050.513502] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [...repeated...] [101050.542845] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [101050.547814] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [104950.859195] Callback slot table overflowed [...repeated...] [104971.209458] Callback slot table overflowed [104971.210004] RPC: Could not send backchannel reply error: -105 [104971.243881] Callback slot table overflowed [104971.289402] Callback slot table overflowed I managed to capture some traffic from such a host and looked at the XIDs going over the wire and what's odd is that _none_ of them correspond to the XIDs spewed in the receive_cb_reply lines. (The dumps from the captures are huge and they're made on a krb5p export, but I can supply one if that would be useful.) I've started exporting everything as krb5i and for whatever reason the clients appear to prefer that, so I should be able to get some more useful captures in the future if necessary. In the meantime, elrepo makes upstream kernels easy to get, so I'll try that as soon as I can squeeze in a reboot. OK, I installed 4.7.2 from elrepo and it fell over in about two minutes with: kernel BUG at mm/slub.c:3657 So I'm now running 4.4.19 from elrepo and it seems to be OK, but it's only been fifteen minutes. The other official Centos kernels fail to boot due to some problem with dracut which causes it to not include the proper modules. I need to rebuild the boot images before I'll be able to try them. And, sadly, 4.4.19 seemed to have worked but appears to have some incompatibility somewhere. I had to modify /etc/idmapd.conf to add Domain = whatever to get something to stop complaining about not being able to map uids, and then users somehow can't actually see their home directories at login even though they appear to mount just fine. So that's out. Chuck Lever on IRC suggested that some networking issue could be to blame, and while that hardware has transferred petabytes of information over its lifetime, I notice that the 327.28.2 kernel has a number of changes to the e1000 driver. I need to stick a 10G card in this machine at some point anyway, so I'll go ahead and do that tonight and see if the situation changes. I installed a 10G Ethernet card and switched over to it; no change in behavior. I took a hard look at my logs and found that the problem started on two hosts on the morning and early afternoon of August 18. Now, I updated pretty much all of my clients from Fedora 23 to 24 that evening previous, and the log spew on the servers started pretty much when people started working. (A faculty file server at 9AM and then a graduate student server at noon, which makes a log of sense.) In the F24 update, the kernel didn't really change; the version's the same and I don't think the config is any different. nfs-utils was updated, though, and that strikes me as the most suspicious thing. The version before the updates was 1.3.3-8.rc5, which I think was really a release candidate of 1.3.4. (The nfs-utils versioning in Fedora seems pretty broken.) After the update, it's 1.3.4-1.rc1 (which I think is actually a release candidate of 1.3.5, aargh). So... could a difference in nfs-utlls on the client possibly have anything to do with this? The client only really uses rpc.gssd and mount.nfs, doesn't it? rpc.gssd especially sticks out. So I'll concentrate on the clients now, and try a bit to get 4.4.19 actually working on the server. But any ideas would be greatly appreciated. (In reply to Jason Tibbitts from comment #11) > In the F24 update, the kernel didn't really change; the version's the same > and I don't think the config is any different. nfs-utils was updated, > though, and that strikes me as the most suspicious thing. The version before > the updates was 1.3.3-8.rc5, which I think was really a release candidate of > 1.3.4. (The nfs-utils versioning in Fedora seems pretty broken.) After the > update, it's 1.3.4-1.rc1 (which I think is actually a release candidate of > 1.3.5, aargh). Hm. From a quick look, there might have been bugfixes to version negotiation, so maybe the protocol minor version used by those clients changes? (Also, if you want to argue with SteveD about nfs-utils versioning, open a bug against nfs-utils and that'll go to him.) Well, sadly, this morning I downgraded nfs-utils on all client machines and restarted rpc.gssd and the problem still occurred. Of course, there were existing active mounts so it could have been some residual thing. I'll leave it that way and weed out problematic clients. Really the only thing I can think of doing now is booting the clients into the F23 kernel. Sadly I can't use the F23 version of nfs-utils due to libtirpc being updated. (Which I guess is something else to look at.) I'm so out of ideas that I'm even considering subtle miscompilation of things by gcc6. (F23 had gcc5.) But really what I need to do is just disable nfs4.1 on the server. If that still shows the problem then something is seriously off. I've just been avoiding it because I wanted to better understand the 4.1 problem before just punting. Anyway, that's a job for tonight. Good thing it's Friday as the load will be lower. I was just going to bug steved the next time I see him on IRC. That grousing was just me with no sleep, and the fact that I maintain those Fedora packaging guidelies so I like to see packages following them. And the latest. From a packet capture on a krb5i mount, the client gets into a state where it spams the server mercilessly with BIND_CONN_TO_SESSION. The server responds with NFS4_OK to each, and eventually dies. Chuck found that the description of upstream kernel commit bdc54d8e3cb4a41dddcabfd86d9eb3aa5f622b75, which fixes a client problem that went into 4.6 rc1, sees very related: " If there were less than 2 entries in the multipath list, then xprt_iter_next_entry_multiple() would never advance beyond the first entry, which is correct for round robin behaviour, but not for the list iteration. The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt() as we would never see the xprt == NULL condition fulfilled. " I have no idea why clients running Fedora 23 never had this problem given that they're running a 4.6 kernel, but I've deployed a kernel with this patch to all of my clients, so we'll see how things work come Monday. With luck we'll at least know what client behavior is causing the server to die. Everything seems to be fine after that client patch. At least, the clients are no longer tickling the server issue. Of course, this doesn't actually fix the server issue, but for now at least I can get on with life. Thanks for the report. Unfortunately I think it's likely then that there are multiple issues: - the client bug causing the excessive BIND_CONN_TO_SESSION calls (sound like you've identified the bug and the fix, thanks). - a server leak in the case of excessive BIND_CONN_TO_SESSION. It could be as simple as the server lacking any limit on the number of bound connections, so setting some per-session (or per-client) limit may be all that's necessary. - the "unrecognized reply" messages: we don't know if this was just another consequence of the client bug, or a separate client or server bug causing the callback xid stream to get out of sync in the presence of multiple connections. A network trace might help sort that out. I was having other NFS problems (client side crashes under heavy NFS load), and upgraded my CentOS 7 file server and Fedora 24 clients on the same day. That eliminated the client side crashes, but I experienced the server side connection failures (with continuously spewing "Got unrecognized reply" in log) about 20-40 times a day. I mistakenly thought it was the file server upgrade that caused the problem, and wasted a bunch of time trying to debug that. Now that I've upgraded the kernel on the F24 clients to 4.7.4-200, I've gone a day without seeing the issue. It would still be nice to get the server side issue fixed. In my experience it is really easy to provoke the issue with clients running the F24 4.6.x kernels, including 4.6.4-301 and 4.6.5.300, when the clients have home directories NFS mounted from the file server. Closing INSUFFICIENT_DATA. Please re-open if you have new information and especially if you can reproduce on a later RHEL7 kernel (such as 3.10.0-957.el7 or above) |