1232528 – NFS client problem results in excessive delays and server not responding messages

Bug 1232528 - NFS client problem results in excessive delays and server not responding messages

Summary: NFS client problem results in excessive delays and server not responding mess...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	21
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	nfs-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-06-17 00:52 UTC by Andrew J. Schorr
Modified:	2015-12-02 18:09 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-02 13:51:45 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Andrew J. Schorr 2015-06-17 00:52:59 UTC

Description of problem:
A cron job experiences unexpected delays, and the kernel logs "nfs: server not responding" messages. By inserting "date" timestamp commands into the shell script, it appears that the delay occurs when the shell tries to find a command in the PATH which contains several directories mounted via NFS over TCP from a Netapp fileserver. It pauses for 2 or 3 minutes while trying to resolve the command.

Version-Release number of selected component (if applicable):
kernel-4.0.4-202.fc21.x86_64

How reproducible:
I am not certain. This cron job runs every afternoon, and it is sometimes subject to delays. It may depend on server load and result from a botched retry timer mechanism. This same cron job runs on 8 other machines running a mix of Fedora 19 and Fedora 21 with the 3.19.7 kernel, and none of those machines experiences a similar delay. I found the problem by editing the shell script to start with a call to "/usr/bin/date" followed by a call to "date". Today, there was a 3-minute delay between the results of "/usr/bin/date" and "/date". This happened in 2 different cron jobs (using the same script with different arguments). In journalctl, I see these messages:
Jun 16 16:41:00 ti139 kernel: nfs: server netapp not responding, still trying
Jun 16 16:41:00 ti139 kernel: nfs: server netapp not responding, still trying
Jun 16 16:41:00 ti139 kernel: nfs: server netapp OK
Jun 16 16:41:00 ti139 kernel: nfs: server netapp OK
The timestamp matches the output of the 2nd (path-resolved) date command. The first call to /usr/bin/date returned "Tue Jun 16 16:38:00 EDT 2015", and the next line in the script which called "date" returned "Tue Jun 16 16:41:00 EDT 2015". Here are the NFS mount options:

netapp:/home on /nfs/netapp/home type nfs (rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.53.251,mountvers=3,mountport=4046,mountproto=tcp,local_lock=none,addr=192.168.53.251)

The server is a Netapp FAS250.

Steps to Reproduce:
1. Set up a cron job on 9 servers, each running the same job twice at 16:38 with a PATH containing directories mounted from a Netapp FAS250.
2. Notice that some jobs have inexplicable 2 or 3-minute delays.
3. Notice occasional "nfs: server not responding" messages in the logs.

Actual results:
Inexplicable delays.

Expected results:
Jobs should complete promptly.

Additional info:

Comment 1 Andrew J. Schorr 2015-06-17 00:56:25 UTC

Oops, please excuse the typo above.  The script says:

/usr/bin/date
date

and the results were:

Tue Jun 16 16:38:00 EDT 2015
Tue Jun 16 16:41:00 EDT 2015

It looks like there were a bunch of NFS changes in 4.0.  Something is not right.

Regards,
Andy

Comment 2 Andrew J. Schorr 2015-06-18 01:22:21 UTC

Today, the delay between the 2 date commands was 15 seconds, and no kernel errors were logged.  None of the other hosts has any noticeable delay at all.  Does anybody have any ideas on how to troubleshoot this?

Comment 3 Benjamin Coddington 2015-06-18 08:27:36 UTC

You might start by collecting a wire capture of the event between your nfs client and the server.  It sounds like maybe you have idle connections for a while, and then the client has trouble doing RPC with the server.  Any network devices between client and server - like routers or firewalls?

Comment 4 Mikhail Strizhov 2015-07-21 21:12:51 UTC

+1 to the bug. We have the same issue running Fedora 21 with 4.0.7-200.fc21.x86_64 kernel and nfs v3. Mount options: -vers=3,hard,intr

Few recent messages from /var/log/messages

Jul 19 08:05:41 walnut dbus[650]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 19 08:06:06 walnut dbus[650]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 19 08:06:07 walnut kernel: [245063.159083] nfs: server chopin OK
Jul 19 08:06:07 walnut kernel: [245063.159101] nfs: server chopin OK
Jul 19 08:06:07 walnut kernel: nfs: server chopin OK
Jul 19 08:06:07 walnut kernel: nfs: server chopin OK
Jul 19 08:06:07 walnut systemd: systemd-logind.service: main process exited, code=dumped, status=6/ABRT
Jul 19 08:06:07 walnut systemd: Unit systemd-logind.service entered failed state.
Jul 19 08:06:07 walnut systemd: systemd-logind.service failed.
Jul 19 08:06:07 walnut systemd-logind: Watching system buttons on /dev/input/event1 (Power Button)
Jul 19 08:06:07 walnut systemd-logind: Watching system buttons on /dev/input/event0 (Power Button)
Jul 19 08:06:07 walnut systemd-logind: New seat seat0.
Jul 19 08:06:07 walnut systemd-logind: New session c1 of user gdm.
Jul 19 08:06:07 walnut systemd-logind: New session c3 of user gdm.
Jul 19 08:20:26 walnut systemd-logind: Failed to start user slice: Connection timed out
Jul 19 08:20:51 walnut systemd-logind: Failed to start user service: Connection timed out
Jul 19 08:20:51 walnut systemd-logind: Assertion 's->user->slice' failed at ../src/login/logind-session.c:496, function session_start_scope(). Aborting.
Jul 19 08:20:51 walnut systemd-coredump: Process 15824 (systemd-logind) of user 0 dumped core.

Stack trace of thread 15824:
#0  0x00007f94428b78d7 raise (libc.so.6)
#1  0x00007f94428b953a abort (libc.so.6)
#2  0x00007f944390f6b2 log_assert_failed (systemd-logind)
#3  0x00007f944391d79f session_start (systemd-logind)
#4  0x00007f94438f34e5 method_create_session.lto_priv.335 (systemd-logind)
#5  0x00007f94438d73d2 object_find_and_run.lto_priv.172 (systemd-logind)
#6  0x00007f944390882a bus_process_internal.constprop.102 (systemd-logind)
#7  0x00007f9443911f91 io_callback (systemd-logind)
#8  0x00007f9443927ca0 source_dispatch.lto_priv.228 (systemd-logind)
#9  0x00007f94438f1d9e sd_event_dispatch (systemd-logind)
#10 0x00007f94438dad14 main (systemd-logind)
#11 0x00007f94428a2fe0 __libc_start_main (libc.so.6)
#12 0x00007f94438db6cc _start (systemd-logind)

Jul 19 08:23:01 walnut kernel: [246076.844492] nfs: server chopin not responding, still trying
Jul 19 08:23:01 walnut kernel: [246076.844511] nfs: server chopin not responding, still trying
Jul 19 08:23:01 walnut kernel: nfs: server chopin not responding, still trying
Jul 19 08:23:01 walnut kernel: nfs: server chopin not responding, still trying
Jul 19 08:28:08 walnut kernel: [246383.645927] nfs: server chopin OK
Jul 19 08:28:08 walnut kernel: [246383.645942] nfs: server chopin OK
Jul 19 08:28:08 walnut kernel: nfs: server chopin OK
Jul 19 08:28:08 walnut kernel: nfs: server chopin OK


We have switched back to kernel 3.19.5-200.fc21.x86_64 and no such problem appear.

We dont have any routers or firewalls between machines. 

Thanks.

Comment 5 Mikhail Strizhov 2015-08-10 15:47:50 UTC

The most recent kernel 4.1.3-100.fc21.x86_64 has the same problem. 

/etc/fstab has following nfs v3 mount option:

129.82.45.182:/fedora64-local  /usr/local  		nfs     nfsvers=3,hard,intr  0 0

Commenting out this option would solve the issue (no lag in system, sshing works, etc).

Having this mount in /etc/fstab causes following messages.

dmesg:

209837.916317] nfs: server 129.82.45.182 OK
[216290.444952] nfs: server 129.82.45.182 not responding, still trying
[216290.444972] nfs: server 129.82.45.182 not responding, still trying
[217036.392527] nfs: server 129.82.45.182 OK
[217036.392545] nfs: server 129.82.45.182 OK
[224089.921945] nfs: server 129.82.45.182 not responding, still trying
[224089.921965] nfs: server 129.82.45.182 not responding, still trying
[224237.306955] nfs: server 129.82.45.182 OK
[224237.307212] nfs: server 129.82.45.182 OK
[227689.784180] nfs: server 129.82.45.182 not responding, still trying
[227689.784199] nfs: server 129.82.45.182 not responding, still trying
[227837.169430] nfs: server 129.82.45.182 OK
[227837.169449] nfs: server 129.82.45.182 OK
[231289.670881] nfs: server 129.82.45.182 not responding, still trying
[231289.670900] nfs: server 129.82.45.182 not responding, still trying
[231437.056151] nfs: server 129.82.45.182 OK
[231437.056168] nfs: server 129.82.45.182 OK
[234889.517507] nfs: server 129.82.45.182 not responding, still trying
[234889.517526] nfs: server 129.82.45.182 not responding, still trying


Trying to ssh as room to machine with 4.1.3-100.fc21.x86_64 shows following errors in /var/log/messages:


Aug 10 09:41:34 walnut systemd-logind: New session c2 of user gdm.
Aug 10 09:41:34 walnut systemd: Received SIGRTMIN+24 from PID 2262 (kill).
Aug 10 09:41:34 walnut systemd-coredump: Process 1007 (systemd-logind) of user 0 dumped core.

Stack trace of thread 1007:
#0  0x00007f8e043dd8d7 raise (libc.so.6)
#1  0x00007f8e043df53a abort (libc.so.6)
#2  0x0000559d23b4a6b2 log_assert_failed (systemd-logind)
#3  0x0000559d23b5879f session_start (systemd-logind)
#4  0x0000559d23b2e4e5 method_create_session.lto_priv.335 (systemd-logind)
#5  0x0000559d23b123d2 object_find_and_run.lto_priv.172 (systemd-logind)
#6  0x0000559d23b4382a bus_process_internal.constprop.102 (systemd-logind)
#7  0x0000559d23b4cf91 io_callback (systemd-logind)
#8  0x0000559d23b62ca0 source_dispatch.lto_priv.228 (systemd-logind)
#9  0x0000559d23b2cd9e sd_event_dispatch (systemd-logind)
#10 0x0000559d23b15d14 main (systemd-logind)
#11 0x00007f8e043c8fe0 __libc_start_main (libc.so.6)
#12 0x0000559d23b166cc _start (systemd-logind)


Thank you.

Comment 6 Andrew J. Schorr 2015-08-10 16:11:26 UTC

Thanks Mikhail for providing this info.  Just above, you said:

"Commenting out this option would solve the issue (no lag in system, sshing works, etc)."

By that, do you mean that removing the non-standard NFS mount options fixes the problems?  In effect, are you saying that the problems go away when using NFS version 4?  If you could show the corresponding entries in /proc/mounts for the non-working and working cases, that would be really helpful.

Thanks,
Andy

Comment 7 Mikhail Strizhov 2015-08-10 16:31:14 UTC

Andy,

(In reply to Andrew J. Schorr from comment #6)
> Thanks Mikhail for providing this info.  Just above, you said:
> 
> "Commenting out this option would solve the issue (no lag in system, sshing
> works, etc)."
> 
> By that, do you mean that removing the non-standard NFS mount options fixes
> the problems?

Yes, commenting out / removing "129.82.45.182:/fedora64-local /usr/local nfs nfsvers=3,hard,intr  0 0" in /etc/fstab seems helps avoid lag in the system.

However, I must add that our client Fedora 21 systems use two mount types: one record ("hard" mount) from /etc/fstab and multiple autofs records from /etc/auto.master. 
I spent some time playing around with both fstab and autofs and found that fstab mount causes the problem (lag).


>  In effect, are you saying that the problems go away when
> using NFS version 4? 

I can not confirm nfs v4 behavior because we use nfs v3 (clients are Fedora 21 and server is Red Hat 6.7) in our systems.

> If you could show the corresponding entries in
> /proc/mounts for the non-working and working cases, that would be really
> helpful.

Sure. 

Here is  /proc/mounts with enabled fstab and autofs mounts:

129.82.45.182:/fedora64-local /usr/local nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=129.82.45.182,mountvers=3,mountport=53726,mountproto=udp,local_lock=none,addr=129.82.45.182 0 0
/etc/auto.misc /misc autofs rw,relatime,fd=6,pgrp=961,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs rw,relatime,fd=12,pgrp=961,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.bach /s/bach autofs rw,relatime,fd=18,pgrp=961,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.chopin /s/chopin autofs rw,relatime,fd=24,pgrp=961,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.fir /s/fir autofs rw,relatime,fd=30,pgrp=961,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.jawar /s/jawar autofs rw,relatime,fd=36,pgrp=961,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.parsons /s/parsons autofs rw,relatime,fd=42,pgrp=961,timeout=300,minproto=5,maxproto=5,indirect 0 0

Here is /proc/mounts with enabled autofs only:

/etc/auto.misc /misc autofs rw,relatime,fd=6,pgrp=950,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs rw,relatime,fd=12,pgrp=950,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.bach /s/bach autofs rw,relatime,fd=18,pgrp=950,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.chopin /s/chopin autofs rw,relatime,fd=24,pgrp=950,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.fir /s/fir autofs rw,relatime,fd=30,pgrp=950,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.jawar /s/jawar autofs rw,relatime,fd=36,pgrp=950,timeout=300,minproto=5,maxproto=5,indirect 0 0
/etc/auto.indirect.parsons /s/parsons autofs rw,relatime,fd=42,pgrp=950,timeout=300,minproto=5,maxproto=5,indirect 0 0

Thanks.


> 
> Thanks,
> Andy

Comment 8 Andrew J. Schorr 2015-08-10 16:47:57 UTC

OK, thanks, now I understand.  I originally thought you were saying that the problem is fixed by mounting the filesystem with different options.  But in fact, you seem to be saying that you removed the mount entirely.  That's not an option for us, since we need to mount the filesystem.

Regards,
Andy

Comment 9 Mikhail Strizhov 2015-09-02 17:14:06 UTC

Another update on this bug.

Fedora 21 nfs client, kernel: 4.1.5-100.fc21.x86_64, nfs-utils-1.3.1-6.4.fc21.x86_64

Red Hat 6.7 nfs server, kernel: 2.6.32-573.3.1.el6.x86_64, nfs-utils-1.2.3-64.el6.x86_64

Problem is there.

However, we found a "workaround" (in quotes) to prevent Fedora 21 client freezing and nfs timeouts:
use "udp" transport protocol instead of default tcp in nfs mount options. 
For example, "nfsvers=3,hard,intr,bg,udp" does not cause the hanging.

Also, we have performed Fedora 21 nfs server <-> Fedora 21 nfs client with same kernel (4.1.5), same nfs utils over tcp with "nfsvers=3,hard,intr,bg" mount options and we did not find any hanging\issues.

Thanks.

Comment 10 J. Bruce Fields 2015-09-02 17:33:47 UTC

"server not responding" just means what it says, and could have any number of causes, so I wouldn't necessarily assume that Mikhail Strizhov's problem necessarily has the same underlying cause as the original reporter's problem.

> We have the same issue running Fedora 21 with 4.0.7-200.fc21.x86_64 kernel

Is this a recent regression for you?  (Was there a previous kernel that worked OK for you?)

Comment 11 Andrew J. Schorr 2015-09-02 17:41:11 UTC

FYI, I am continuing to try to catch a tcpdump of this in action. When I initially reported it, it was happening pretty regularly. Since I added the netsniff-ng job, it hasn't happened. I hope to catch it one of these days.

Comment 12 Mikhail Strizhov 2015-09-02 18:02:15 UTC

(In reply to J. Bruce Fields from comment #10)
> "server not responding" just means what it says, and could have any number
> of causes, so I wouldn't necessarily assume that Mikhail Strizhov's problem
> necessarily has the same underlying cause as the original reporter's problem.
> 
> > We have the same issue running Fedora 21 with 4.0.7-200.fc21.x86_64 kernel
> 
> Is this a recent regression for you?  (Was there a previous kernel that
> worked OK for you?)

Problem started with kernel 4.0.4 (mid June) and till now with 4.1.5.

Last working kernel is kernel-3.19.5-200.fc21, which we are still running on 350  Fedora 21 client machines.

Comment 13 Mikhail Strizhov 2015-09-28 17:35:57 UTC

Up.

Switching 300++ F21 machines (with last updates) from tcp to udp in nfs options seems `solved` the problem. No hanging issues in last 3 weeks of uptime.

We have left 1 F21 machine with tcp option on and it continues to shows errors in /var/log/messages and periodically hangs.

One more note, we have installed a F22 (with last updates) and tcp enabled nfs client. There are no errors/hanging found. It has been up for few weeks already.

Thanks.

Comment 14 Fedora End Of Life 2015-11-04 10:24:38 UTC

This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 15 Fedora End Of Life 2015-12-02 13:51:57 UTC

Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.