Description of problem: We have 20 computers and we use autofs for NFS mounts (/home etc.). Almost every day since upgrade of autofs, automount on some computers freezes, so users can't login, ls on mountpoints freezes too. The process automount can be killed only with kill -9, after restart it freezes again, system reboot is needed to work again. I downgraded the package and it works fine now. Version-Release number of selected component: 5.1.4-5.fc26.x86_64 Additional info: Ask me please for additional info, I'm not sure what is relevant.
It doesn't look like I've missed any bug fixes in the package. Enabling debug logging and posting it is probably the best way to start the investigation. Things have changed now systemd is used. If you have a busy site logging may get throttled causing a loss of log information but lets see if that's a problem before worrying about it. With systemd all you should need to do is set "logging = debug" in /etc/autofs.conf to get the logging (and restart autofs). Then produce the log after you see a hang by using journalctl. I'm not sure what options to use with journalctl. I usually use something like "journalctl -ae" or "journalctl -xe" and pipe that to "grep automount" since the autofs log entries will always have automount present in them. The systemd journal might be really large (it may have been saving log information for a long time). The only thing I can see is to use "-n <number of log lines>" with journalctl to restrict output to a more recent time frame. I hope this is enough to get you started. Ian
Created attachment 1405310 [details] [06-03-2018 19:05:23] SERVICE ALERT: nymfe01;SNMP [06-03-2018 19:05:23] SERVICE ALERT: nymfe01;SNMP
Created attachment 1405311 [details] [06-03-2018 19:05:23] SERVICE ALERT: nymfe07;SNMP
Created attachment 1405313 [details] [06-03-2018 18:05:43] SERVICE ALERT: nymfe13;SNMP
Created attachment 1405315 [details] [07-03-2018 10:05:43] SERVICE ALERT: nymfe09;SNMP
Created attachment 1405316 [details] OK nymfe22
Thanks, I have enabled debug logging. I used "journalctl -au autofs". See the logs attached. Some remarks when the automount stopped working: We use Nagios to check snmpd once per hour. Snmpd hangs when entering automounted directory. When snmpd hangs, it means automount already doesn't work. So for example, Nagios found out at 19:05, that snmpd isn't working. It means snmpd stopped working between 18:05 and 19:05. It means automount stopped working before snmpd. I have attached logs for nymfe22, which is working correctly, so you could compare. You can see that automount really isn't responding at 22:15 when the computers shut down: Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: State 'stop-sigterm' timed out. Killing. Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Killing process 29283 (automount) with signal SIGKILL. Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Main process exited, code=killed, status=9/KILL Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: Stopped Automounts filesystems on demand. Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Unit entered failed state. Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Failed with result 'timeout'. -- Reboot --
(In reply to matousejem from comment #7) > Thanks, I have enabled debug logging. I used "journalctl -au autofs". > > See the logs attached. Some remarks when the automount stopped working: > We use Nagios to check snmpd once per hour. Snmpd hangs when entering > automounted directory. When snmpd hangs, it means automount already doesn't > work. > > So for example, Nagios found out at 19:05, that snmpd isn't working. It > means snmpd stopped working between 18:05 and 19:05. It means automount > stopped working before snmpd. Maybe I need more information. You can see that: Mar 06 18:55:12 /home/xsladovn gets mounted Mar 06 19:15:03 /home/xbures3 gets mounted So mounts are happening around this time. > > I have attached logs for nymfe22, which is working correctly, so you could > compare. > > You can see that automount really isn't responding at 22:15 when the > computers shut down: > Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: State > 'stop-sigterm' timed out. Killing. > Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Killing > process 29283 (automount) with signal SIGKILL. > Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Main process > exited, code=killed, status=9/KILL > Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: Stopped Automounts > filesystems on demand. > Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Unit entered > failed state. > Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Failed with > result 'timeout'. > -- Reboot -- But it's clear that automount didn't shutdown at reboot. There's not that many log messages in that area but I'll try and work out where automount is stopping. Perhaps that will give me a lead to follow. Ian
(In reply to Ian Kent from comment #8) > (In reply to matousejem from comment #7) > > Thanks, I have enabled debug logging. I used "journalctl -au autofs". > > > > See the logs attached. Some remarks when the automount stopped working: > > We use Nagios to check snmpd once per hour. Snmpd hangs when entering > > automounted directory. When snmpd hangs, it means automount already doesn't > > work. > > > > So for example, Nagios found out at 19:05, that snmpd isn't working. It > > means snmpd stopped working between 18:05 and 19:05. It means automount > > stopped working before snmpd. > > Maybe I need more information. > > You can see that: > Mar 06 18:55:12 /home/xsladovn gets mounted > Mar 06 19:15:03 /home/xbures3 gets mounted > > So mounts are happening around this time. > But you did get alerts at this time. And the log claims that both these mounts were successfully mounted. Is it possible to get a copy of /proc/mounts at the time of the alert to verify the mount did in fact succeed? Can you work out what directory the monitoring system was attempting to access? Once something gets mounted autofs doesn't (shouldn't) come into play in subsequent accesses to the mounted file system.
(In reply to Ian Kent from comment #9) > (In reply to Ian Kent from comment #8) > > (In reply to matousejem from comment #7) > > > Thanks, I have enabled debug logging. I used "journalctl -au autofs". > > > > > > See the logs attached. Some remarks when the automount stopped working: > > > We use Nagios to check snmpd once per hour. Snmpd hangs when entering > > > automounted directory. When snmpd hangs, it means automount already doesn't > > > work. > > > > > > So for example, Nagios found out at 19:05, that snmpd isn't working. It > > > means snmpd stopped working between 18:05 and 19:05. It means automount > > > stopped working before snmpd. > > > > Maybe I need more information. > > > > You can see that: > > Mar 06 18:55:12 /home/xsladovn gets mounted > > Mar 06 19:15:03 /home/xbures3 gets mounted > > > > So mounts are happening around this time. > > > > But you did get alerts at this time. > > And the log claims that both these mounts were successfully mounted. > Is it possible to get a copy of /proc/mounts at the time of the alert > to verify the mount did in fact succeed? > Can you work out what directory the monitoring system was attempting > to access? > > Once something gets mounted autofs doesn't (shouldn't) come into > play in subsequent accesses to the mounted file system. No, it doesn't get mounted. There is difference between logs of successful and unsuccessful mounts. Successful mount is this: Mar 06 15:01:40 nymfe01.fi.muni.cz automount[29283]: mount_mount: mount(nfs): calling mount -t nfs -s -o rw,soft,intr,actimeo=30,noatime,vers=4.1,sec=sys home.fi.muni.cz:/export/home/xkoprna /home/xkoprna Mar 06 15:01:40 nymfe01.fi.muni.cz automount[29283]: mount_mount: mount(nfs): mounted home.fi.muni.cz:/export/home/xkoprna on /home/xkoprna Unsuccessful one: Mar 06 18:55:12 nymfe01.fi.muni.cz automount[29283]: mount_mount: mount(nfs): calling mount -t nfs -s -o rw,soft,intr,actimeo=30,noatime,vers=4.1,sec=sys home.fi.muni.cz:/export/home/xsladovn /home/xsladovn In this case, the message isn't followed by something like "mounted home.fi.muni.cz:/export/home/[...]". The same is with xbures3.
Created attachment 1407310 [details] Nymfe12 logs root@nymfe12:~# last root [...] xkopeck5 :0 Mon Mar 12 14:02 - 14:28 (00:26) [12-03-2018 14:34:15] SERVICE ALERT: nymfe12;SNMP At 14:48 I 'kill -9' snmpd. At 14:54 I tried to login, i.e. mount /home/xejem. It hangs. No success. [12-03-2018 15:36:45] SERVICE ALERT: nymfe12;SNMP Strace of snmpd hangs at 'statfs("/home/xkopeck5", '. At 16:11 I did 'cat /proc/mounts > nymfe12mounts'.
Created attachment 1407311 [details] nymfe12mounts
(In reply to matousejem from comment #11) > Created attachment 1407310 [details] > Nymfe12 logs > > root@nymfe12:~# last > root [...] > xkopeck5 :0 Mon Mar 12 14:02 - 14:28 (00:26) > > [12-03-2018 14:34:15] SERVICE ALERT: nymfe12;SNMP > > At 14:48 I 'kill -9' snmpd. > > At 14:54 I tried to login, i.e. mount /home/xejem. It hangs. No success. > > [12-03-2018 15:36:45] SERVICE ALERT: nymfe12;SNMP > Strace of snmpd hangs at 'statfs("/home/xkopeck5", '. > > At 16:11 I did 'cat /proc/mounts > nymfe12mounts'. Right, so your saying that snmpd tries to access /home/xejem and hangs (and fails to mount the mount), and further accesses to /home automounts hang. I don't understand why /home/xkopeck5 gets mounted, apparently after automount is hung, and yet still blocks on access. That's an NFS problem not an automount problem. I've seen something similar a couple of times fairly recently but not on any Fedora versions ... I'd like to see a list of running processes when this happens as well. Provide a core of the hung automount process (with gcore) and a list of all installed packages and so I can try and duplicate your install sufficiently to get backtraces of the automount threads from the provided core. Alternately, to save time, you could try to get the backtraces. If automount is deadlocked in some way you might not be able to get these and getting a core might be the only way to run gdb against the blocked automount. If you do try and run gdb on the running automount or a core taken with gcore then please make sure you have installed any debug packages that are needed. Most of these should be installed by running "debuginfo-install autofs". If the needed debuginfo packages aren't installed backtraces will be useless. To get the backtraces use "thr a a bt" within gdb, assuming you can get a prompt ....
Important update: I confirm it is an NFS problem. Freezing happens even after downgrade to the previous version of autofs. It happens subjectively less frequently, but it happens too. We found a solution to our problems: After we switched NFS version from 4.1 to 4.0, everything works fine. Note1: 'vers=4' probably means 4.WHATEVER, so it has to be set explicitly to 4.0. Note2: When a client machine freezes, logs on server are full of messages like this one (tens per second): bÅe 15 11:39:43 anxur.fi.muni.cz kernel: RPC request reserved 84 but used 344
This message is a reminder that Fedora 26 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '26'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 26 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 26 changed to end-of-life (EOL) status on 2018-05-29. Fedora 26 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
Our Fedora 28 server today got log-spammed with hundreds of "RPC request reserved 116 but used 268". We are on 4.17.19-200.fc28.x86_64. Are there any other logs I can provide?
(In reply to RobbieTheK from comment #17) > Our Fedora 28 server today got log-spammed with hundreds of "RPC request > reserved 116 but used 268". We are on 4.17.19-200.fc28.x86_64. Are there any > other logs I can provide? As in comment #14 this sounds like a kernel NFS problem. It's probably worth looking for NFS bugs to see if it's being worked on.
The exact error is mentioned in a kernel.org thread at https://lore.kernel.org/lkml/20170427173943.GD29622@ZenIV.linux.org.uk/ but not sure if a fix was ever made? There's a mention on the Debian bug page at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898060 and Ubuntu's launchpad at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1785788 all sounding very similar (just different reserved numbers). The only kernel errors I'm seeing from the boot are unrelated: Sep 04 15:59:47 ourdomain.edu kernel: ACPI: SPCR: Unexpected SPCR Access Width. Defaulting to byte size Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: No handler for Region [SYSI] (000000000e7692ae) [IPMI] (20180313/evregion-132) Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: Region IPMI (ID=7) has no handler (20180313/exfldio-265) Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: Method parse/execution failed \_SB.PMI0._GHL, AE_NOT_EXIST (20180313/psparse-516) Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: Method parse/execution failed \_SB.PMI0._PMC, AE_NOT_EXIST (20180313/psparse-516) Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20180313/power_meter-755)