Hide Forgot
Description of problem: On some of our nfs clients (a computing server) we got many messages like this (about several times a minute, but not in a fixed interval): kernel: NFS: state manager: check lease failed on NFSv4 server mynfsserver with error 13 My colleagues told me that their processes was aborted. We also got the following message two times on one of the nfs clients: kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed! On one nfs server we got the following message two times. It had occured during the period where many messages listed above (check lease failed) was reported on the nfs clients, but I have not found exact matching time stamps on nfs clients. kernel: rpc-srv/tcp: nfsd: got error -104 when sending 24 bytes I have no idea what could be the reason for hat messages. They occur suddenly yesterday evening nearly at the same time on the nfs clients. dnf-automatic is running in the night, so nfs problem didn't occur immediate after package updates. I have configured the nfsd on the file server to use 256 threads. I thought this should be enough. May too few threads cause the listed error messages? I am sorry that I have no more messages about the error. journalctl lists no more information than this. Version-Release number of selected component (if applicable): kernel-4.4.4-301.fc23.x86_64 nfs-utils-1.3.3-7.rc4.fc23.x86_64 How reproducible: I don't know how to reproduce it.
I just want to note that we have one nfs client and one server rebooted. Now both run with kernel-4.4.6-300.fc23.x86_64 and all updates applied. Now they are running about half a day with this kernel. Until now the error has not occured again. But I don't know if the problem is solved or will only occur later...
Another notice: First we had rebooted the nfs client with the new kernel. This hasn't solved the problem, the error had occured again. The nfs file system was still mounted ("mount" tells me this), but it was not listed in the output of "df", and accessing the path with "ls" prints an error (permission denied). After rebooting the nfs server, the remote file system was mounted and accessible again on the nfs client, like there hadn't be a problem before... Rebooting the nfs server solves the problem (for the moment, at least). I don't know if it has to do with the new kernel or the reboot. We use autofs to access the remote filesystems.
The number of threads shouldn't be an issue. "kernel: NFS: state manager: check lease failed on NFSv4 server mynfsserver with error 13" Haven't checked what check_lease does, but probably a RENEW or SEQUENCE. And 13 is NFS4ERR_ACCESS. That's odd. Are you using kerberos? What version of the NFS protocol? (v4, v4.1, v4.2?)
We use nis for hosts and netgroup etc, and automount table too. We are planning move to FreeIPA, but currently we don't use kerberos, but "sys" authentification over nfsv4. We use the default of Fedora 23, that is, nfs uses version 4.2. Here is a line of the output of "mount" from a nfs client (ip addresses and names substituted due privacy): mynfsserver:/fs/arbeitsdaten30 on /mount/arbeitsdaten30 type nfs4 (rw,nosuid,relatime,seclabel,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.0.2.2,local_lock=none,addr=192.0.2.1)
"kernel: rpc-srv/tcp: nfsd: got error -104 when sending 24 bytes" Error 104 ECONNRESET. Maybe a connection problem, and the client is then trying to re-connect and getting EACCES back. It would be interesting to see a network capture between the client and server during this problem. Any selinux avc denials during that time?
(In reply to Benjamin Coddington from comment #5) > Any selinux avc denials during that time? I have checked the audit logs with ausearch and selected a time range around the listed error messages and filtered out lines with result=success. The only remaining lines (except for time stamp lines) are the following: type=LOGIN msg=audit(1459266608.930:35660): pid=31564 uid=0 subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 old-auid=4294967295 auid=12344 old-ses=4294967295 ses=6 res=1 type=USER_AVC msg=audit(1459266609.212:35664): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='Unknown permission start for class system exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' I am sorry but I have no network capture from the time of that events. These messages have not appeared again since yesterday.
(In reply to Edgar Hoch from comment #1) > Now both run with kernel-4.4.6-300.fc23.x86_64 and all updates applied. We have rebooted another nfs client and server and have started io-intensive jobs. The error messages have occured again, the nfs mounted partition was not accessible for a while (as described in comment #2), but after some minutes (when the jobs has finished the nfs partition was accessible again (without rebooting or unmounting etc.). So kernel-4.4.6-300.fc23.x86_64 does not solve the problem.
We had the idea that the nfs server denies access for the nfs client for the mounted nfs partition because it may have a problem with authentication or authorization of the nfs client. Since Fedora 23 we use sssd instead of nscd for caching our nis data. nscd still has a problem with initgroups - see bug #1294574 . Yesterday I temporary stopped sssd on the nfs server and started nscd. My colleage stated her io-intensive computing jobs, and she told me that it has finished without problems. The logs doesn't contain the error message. It may be coincidence that the error hasn't appeared, or may caused on nscd, I don't know. The problem with sssd is that it does not cache the hosts table. In the netgroup table in nis we use short hostnames that match with the short host names in the hosts table of nis. If there are high network traffic the it may be that access to a remote nis server may be to slow, so the nfs server didn't know the host name of the nfs client, or cannot match it in the netgroup? We use netgroups in the nfs export list. But with nscd our user processes has only there primary group, the other groups are not assigned to the processes. So this is not a solution for us. I have noticed that sssd have logged a message that /etc/netgroup does not exist. By default, it seems to me that this file does not exist in Fedora. We don't need this file because we enter all netgroups in the nis database. Now I have created an empty file /etc/netgroup, the warning of sssd has disappeared. I don't know if the missing file has switched off the caching of netgroup from nis. I think, these are different databases, and should not affect each other, but I am not sure. We will try another test with this change now. Another idea is that we can create a nis slave server on all of our nfs file servers. Then ypbind can connect to ypserv on localhost, so no real network traffic and no caching is neccessary. We may try this next perhaps.
I just want to note that I found on two nfs servers message like this (one time only between boots): kernel: perf interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 I don't know if they are related to the problem, but they have occured in the same time range when the error messages as listed above have occured on nfs clients.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs. Fedora 23 has now been rebased to 4.7.4-100.fc23. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25. If you experience different issues, please open a new bug report for those.
We have upgraded to Fedora 24 in the meantime. I can say that the error have occured on kernel 4.6.4-301.fc24.x86_64 and 4.6.5-300.fc24.x86_64 . This was some weeks ago. Now I have rebooted some nfs clients with kernel 4.7.5-200.fc24.x86_64. I cannot reboot the nfs server now because of running processes. I will monitor our nfs clients and server if the error occurs. Perhaps we can close this bug report and I reopen it if I found new occurrences of this problem.
Thank you for letting us know. I'm going to close this bug and we can re-open if the problem shows up again.