Bug 1552037 - Automount sometimes freezes until reboot
Summary: Automount sometimes freezes until reboot
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: autofs
Version: 26
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Ian Kent
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-06 11:44 UTC by matousejem
Modified: 2018-11-28 16:45 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-05-29 11:51:27 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
[06-03-2018 19:05:23] SERVICE ALERT: nymfe01;SNMP (614.95 KB, text/x-vhdl)
2018-03-07 12:37 UTC, matousejem
no flags Details
[06-03-2018 19:05:23] SERVICE ALERT: nymfe07;SNMP (472.50 KB, text/x-vhdl)
2018-03-07 12:38 UTC, matousejem
no flags Details
[06-03-2018 18:05:43] SERVICE ALERT: nymfe13;SNMP (483.11 KB, text/x-vhdl)
2018-03-07 12:39 UTC, matousejem
no flags Details
[07-03-2018 10:05:43] SERVICE ALERT: nymfe09;SNMP (232.92 KB, text/x-vhdl)
2018-03-07 12:39 UTC, matousejem
no flags Details
OK nymfe22 (521.66 KB, text/x-vhdl)
2018-03-07 12:40 UTC, matousejem
no flags Details
Nymfe12 logs (89.13 KB, text/plain)
2018-03-12 17:53 UTC, matousejem
no flags Details
nymfe12mounts (3.66 KB, text/plain)
2018-03-12 17:55 UTC, matousejem
no flags Details

Description matousejem 2018-03-06 11:44:34 UTC
Description of problem:
We have 20 computers and we use autofs for NFS mounts (/home etc.). Almost every day since upgrade of autofs, automount on some computers freezes, so users can't login, ls on mountpoints freezes too. The process automount can be killed only with kill -9, after restart it freezes again, system reboot is needed to work again. I downgraded the package and it works fine now.

Version-Release number of selected component:
5.1.4-5.fc26.x86_64

Additional info:
Ask me please for additional info, I'm not sure what is relevant.

Comment 1 Ian Kent 2018-03-06 12:22:53 UTC
It doesn't look like I've missed any bug fixes in the package.

Enabling debug logging and posting it is probably the best way
to start the investigation.

Things have changed now systemd is used.

If you have a busy site logging may get throttled causing a
loss of log information but lets see if that's a problem
before worrying about it.

With systemd all you should need to do is set "logging = debug"
in /etc/autofs.conf to get the logging (and restart autofs).

Then produce the log after you see a hang by using journalctl.

I'm not sure what options to use with journalctl.

I usually use something like "journalctl -ae" or "journalctl -xe"
and pipe that to "grep automount" since the autofs log entries
will always have automount present in them.

The systemd journal might be really large (it may have been
saving log information for a long time). The only thing I can
see is to use "-n <number of log lines>" with journalctl to
restrict output to a more recent time frame.

I hope this is enough to get you started.
Ian

Comment 2 matousejem 2018-03-07 12:37:41 UTC
Created attachment 1405310 [details]
[06-03-2018 19:05:23] SERVICE ALERT: nymfe01;SNMP

[06-03-2018 19:05:23] SERVICE ALERT: nymfe01;SNMP

Comment 3 matousejem 2018-03-07 12:38:55 UTC
Created attachment 1405311 [details]
[06-03-2018 19:05:23] SERVICE ALERT: nymfe07;SNMP

Comment 4 matousejem 2018-03-07 12:39:28 UTC
Created attachment 1405313 [details]
[06-03-2018 18:05:43] SERVICE ALERT: nymfe13;SNMP

Comment 5 matousejem 2018-03-07 12:39:55 UTC
Created attachment 1405315 [details]
[07-03-2018 10:05:43] SERVICE ALERT: nymfe09;SNMP

Comment 6 matousejem 2018-03-07 12:40:27 UTC
Created attachment 1405316 [details]
OK nymfe22

Comment 7 matousejem 2018-03-07 12:41:47 UTC
Thanks, I have enabled debug logging. I used "journalctl -au autofs".

See the logs attached. Some remarks when the automount stopped working:
We use Nagios to check snmpd once per hour. Snmpd hangs when entering automounted directory. When snmpd hangs, it means automount already doesn't work.

So for example, Nagios found out at 19:05, that snmpd isn't working. It means snmpd stopped working between 18:05 and 19:05. It means automount stopped working before snmpd.

I have attached logs for nymfe22, which is working correctly, so you could compare.

You can see that automount really isn't responding at 22:15 when the computers shut down:
Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: State 'stop-sigterm' timed out. Killing.
Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Killing process 29283 (automount) with signal SIGKILL.
Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Main process exited, code=killed, status=9/KILL
Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: Stopped Automounts filesystems on demand.
Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Unit entered failed state.
Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Failed with result 'timeout'.
-- Reboot --

Comment 8 Ian Kent 2018-03-09 00:38:58 UTC
(In reply to matousejem from comment #7)
> Thanks, I have enabled debug logging. I used "journalctl -au autofs".
> 
> See the logs attached. Some remarks when the automount stopped working:
> We use Nagios to check snmpd once per hour. Snmpd hangs when entering
> automounted directory. When snmpd hangs, it means automount already doesn't
> work.
> 
> So for example, Nagios found out at 19:05, that snmpd isn't working. It
> means snmpd stopped working between 18:05 and 19:05. It means automount
> stopped working before snmpd.

Maybe I need more information.

You can see that:
Mar 06 18:55:12 /home/xsladovn gets mounted
Mar 06 19:15:03 /home/xbures3 gets mounted

So mounts are happening around this time. 

> 
> I have attached logs for nymfe22, which is working correctly, so you could
> compare.
> 
> You can see that automount really isn't responding at 22:15 when the
> computers shut down:
> Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: State
> 'stop-sigterm' timed out. Killing.
> Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Killing
> process 29283 (automount) with signal SIGKILL.
> Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Main process
> exited, code=killed, status=9/KILL
> Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: Stopped Automounts
> filesystems on demand.
> Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Unit entered
> failed state.
> Mar 06 22:18:02 nymfe01.fi.muni.cz systemd[1]: autofs.service: Failed with
> result 'timeout'.
> -- Reboot --

But it's clear that automount didn't shutdown at reboot.

There's not that many log messages in that area but I'll try and
work out where automount is stopping.

Perhaps that will give me a lead to follow.

Ian

Comment 9 Ian Kent 2018-03-09 00:52:09 UTC
(In reply to Ian Kent from comment #8)
> (In reply to matousejem from comment #7)
> > Thanks, I have enabled debug logging. I used "journalctl -au autofs".
> > 
> > See the logs attached. Some remarks when the automount stopped working:
> > We use Nagios to check snmpd once per hour. Snmpd hangs when entering
> > automounted directory. When snmpd hangs, it means automount already doesn't
> > work.
> > 
> > So for example, Nagios found out at 19:05, that snmpd isn't working. It
> > means snmpd stopped working between 18:05 and 19:05. It means automount
> > stopped working before snmpd.
> 
> Maybe I need more information.
> 
> You can see that:
> Mar 06 18:55:12 /home/xsladovn gets mounted
> Mar 06 19:15:03 /home/xbures3 gets mounted
> 
> So mounts are happening around this time. 
> 

But you did get alerts at this time.

And the log claims that both these mounts were successfully mounted.
Is it possible to get a copy of /proc/mounts at the time of the alert
to verify the mount did in fact succeed?
Can you work out what directory the monitoring system was attempting
to access?

Once something gets mounted autofs doesn't (shouldn't) come into
play in subsequent accesses to the mounted file system.

Comment 10 matousejem 2018-03-12 17:52:28 UTC
(In reply to Ian Kent from comment #9)
> (In reply to Ian Kent from comment #8)
> > (In reply to matousejem from comment #7)
> > > Thanks, I have enabled debug logging. I used "journalctl -au autofs".
> > > 
> > > See the logs attached. Some remarks when the automount stopped working:
> > > We use Nagios to check snmpd once per hour. Snmpd hangs when entering
> > > automounted directory. When snmpd hangs, it means automount already doesn't
> > > work.
> > > 
> > > So for example, Nagios found out at 19:05, that snmpd isn't working. It
> > > means snmpd stopped working between 18:05 and 19:05. It means automount
> > > stopped working before snmpd.
> > 
> > Maybe I need more information.
> > 
> > You can see that:
> > Mar 06 18:55:12 /home/xsladovn gets mounted
> > Mar 06 19:15:03 /home/xbures3 gets mounted
> > 
> > So mounts are happening around this time. 
> > 
> 
> But you did get alerts at this time.
> 
> And the log claims that both these mounts were successfully mounted.
> Is it possible to get a copy of /proc/mounts at the time of the alert
> to verify the mount did in fact succeed?
> Can you work out what directory the monitoring system was attempting
> to access?
> 
> Once something gets mounted autofs doesn't (shouldn't) come into
> play in subsequent accesses to the mounted file system.

No, it doesn't get mounted. There is difference between logs of successful and unsuccessful mounts.

Successful mount is this:
    Mar 06 15:01:40 nymfe01.fi.muni.cz automount[29283]: mount_mount: mount(nfs): calling mount -t nfs -s -o rw,soft,intr,actimeo=30,noatime,vers=4.1,sec=sys home.fi.muni.cz:/export/home/xkoprna /home/xkoprna
    Mar 06 15:01:40 nymfe01.fi.muni.cz automount[29283]: mount_mount: mount(nfs): mounted home.fi.muni.cz:/export/home/xkoprna on /home/xkoprna

Unsuccessful one:
    Mar 06 18:55:12 nymfe01.fi.muni.cz automount[29283]: mount_mount: mount(nfs): calling mount -t nfs -s -o rw,soft,intr,actimeo=30,noatime,vers=4.1,sec=sys home.fi.muni.cz:/export/home/xsladovn /home/xsladovn

In this case, the message isn't followed by something like "mounted home.fi.muni.cz:/export/home/[...]". The same is with xbures3.

Comment 11 matousejem 2018-03-12 17:53:55 UTC
Created attachment 1407310 [details]
Nymfe12 logs

root@nymfe12:~# last
root [...]
xkopeck5 :0    Mon Mar 12 14:02 - 14:28  (00:26)

[12-03-2018 14:34:15] SERVICE ALERT: nymfe12;SNMP

At 14:48 I 'kill -9' snmpd.

At 14:54 I tried to login, i.e. mount /home/xejem. It hangs. No success.

[12-03-2018 15:36:45] SERVICE ALERT: nymfe12;SNMP
Strace of snmpd hangs at 'statfs("/home/xkopeck5", '.

At 16:11 I did 'cat /proc/mounts > nymfe12mounts'.

Comment 12 matousejem 2018-03-12 17:55:59 UTC
Created attachment 1407311 [details]
nymfe12mounts

Comment 13 Ian Kent 2018-03-13 00:47:09 UTC
(In reply to matousejem from comment #11)
> Created attachment 1407310 [details]
> Nymfe12 logs
> 
> root@nymfe12:~# last
> root [...]
> xkopeck5 :0    Mon Mar 12 14:02 - 14:28  (00:26)
> 
> [12-03-2018 14:34:15] SERVICE ALERT: nymfe12;SNMP
> 
> At 14:48 I 'kill -9' snmpd.
> 
> At 14:54 I tried to login, i.e. mount /home/xejem. It hangs. No success.
> 
> [12-03-2018 15:36:45] SERVICE ALERT: nymfe12;SNMP
> Strace of snmpd hangs at 'statfs("/home/xkopeck5", '.
> 
> At 16:11 I did 'cat /proc/mounts > nymfe12mounts'.

Right, so your saying that snmpd tries to access /home/xejem and
hangs (and fails to mount the mount), and further accesses to
/home automounts hang.

I don't understand why /home/xkopeck5 gets mounted, apparently
after automount is hung, and yet still blocks on access. That's
an NFS problem not an automount problem.

I've seen something similar a couple of times fairly recently
but not on any Fedora versions ...

I'd like to see a list of running processes when this happens
as well.

Provide a core of the hung automount process (with gcore)
and a list of all installed packages and so I can try and
duplicate your install sufficiently to get backtraces of
the automount threads from the provided core.

Alternately, to save time, you could try to get the backtraces.

If automount is deadlocked in some way you might not be able
to get these and getting a core might be the only way to run
gdb against the blocked automount.

If you do try and run gdb on the running automount or a core
taken with gcore then please make sure you have installed
any debug packages that are needed. Most of these should be
installed by running "debuginfo-install autofs". If the needed
debuginfo packages aren't installed backtraces will be useless.

To get the backtraces use "thr a a bt" within gdb, assuming
you can get a prompt ....

Comment 14 matousejem 2018-04-10 12:26:30 UTC
Important update:
I confirm it is an NFS problem. Freezing happens even after downgrade to the previous version of autofs. It happens subjectively less frequently, but it happens too.
We found a solution to our problems: After we switched NFS version from 4.1 to 4.0, everything works fine. 

Note1: 'vers=4' probably means 4.WHATEVER, so it has to be set explicitly to 4.0.
Note2: When a client machine freezes, logs on server are full of messages like this one (tens per second):
bře 15 11:39:43 anxur.fi.muni.cz kernel: RPC request reserved 84 but used 344

Comment 15 Fedora End Of Life 2018-05-03 08:08:41 UTC
This message is a reminder that Fedora 26 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 26. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '26'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 26 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 16 Fedora End Of Life 2018-05-29 11:51:27 UTC
Fedora 26 changed to end-of-life (EOL) status on 2018-05-29. Fedora 26
is no longer maintained, which means that it will not receive any
further security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 17 RobbieTheK 2018-09-04 20:44:03 UTC
Our Fedora 28 server today got log-spammed with hundreds of "RPC request reserved 116 but used 268". We are on 4.17.19-200.fc28.x86_64. Are there any other logs I can provide?

Comment 18 Ian Kent 2018-09-04 23:28:10 UTC
(In reply to RobbieTheK from comment #17)
> Our Fedora 28 server today got log-spammed with hundreds of "RPC request
> reserved 116 but used 268". We are on 4.17.19-200.fc28.x86_64. Are there any
> other logs I can provide?

As in comment #14 this sounds like a kernel NFS problem.
It's probably worth looking for NFS bugs to see if it's
being worked on.

Comment 19 RobbieTheK 2018-09-05 15:59:01 UTC
The exact error is mentioned in a kernel.org thread at https://lore.kernel.org/lkml/20170427173943.GD29622@ZenIV.linux.org.uk/ but not sure if a fix was ever made? There's a mention on the Debian bug page  at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898060 and Ubuntu's launchpad at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1785788 all sounding very similar (just different reserved numbers).

The only kernel errors I'm seeing from the boot are unrelated:
Sep 04 15:59:47 ourdomain.edu kernel: ACPI: SPCR: Unexpected SPCR Access Width.  Defaulting to byte size
Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: No handler for Region [SYSI] (000000000e7692ae) [IPMI] (20180313/evregion-132)
Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: Region IPMI (ID=7) has no handler (20180313/exfldio-265)
Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: Method parse/execution failed \_SB.PMI0._GHL, AE_NOT_EXIST (20180313/psparse-516)
Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: Method parse/execution failed \_SB.PMI0._PMC, AE_NOT_EXIST (20180313/psparse-516)
Sep 04 15:59:53 ourdomain.edu kernel: ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20180313/power_meter-755)


Note You need to log in before you can comment on or make changes to this bug.