Bug 1042778

Summary: processes accessing CIFS mounts hang in cifs_get_smb_ses /
Product: [Fedora] Fedora Reporter: Martin Wilck <martin.wilck>
Component: kernelAssignee: Sachin Prabhu <sprabhu>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: chorn, francis.turgeon.boutin, gansalmon, info, itamar, jonathan, kernel-maint, madhu.chinakonda, simonandric5, sprabhu, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-17 19:37:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
dmesg none

Description Martin Wilck 2013-12-13 10:46:44 UTC
Created attachment 836274 [details]
dmesg

Description of problem:

$ ps -Lww -C mount.cifs -o pid,stat,nwchan,wchan,comm,args
  PID STAT  WCHAN WCHAN  COMMAND         COMMAND
19236 D    5c9470 cifs_g mount.cifs      /sbin/mount.cifs //pdbisdocs.pdb.fsc.net/Projects /cifs/pdbisdocs.pdb.fsc.net/Projects -n -o rw,multiuser,cruid=1004,sec=krb5i

$ cat /proc/19236/stack

[<ffffffffa05c9470>] cifs_get_smb_ses+0x200/0x4c0 [cifs]
[<ffffffffa05c9cdd>] cifs_mount+0x5ad/0xad0 [cifs]
[<ffffffffa05b8f50>] cifs_do_mount+0xa0/0x4d0 [cifs]
[<ffffffff811ac239>] mount_fs+0x39/0x1b0
[<ffffffff811c51df>] vfs_kern_mount+0x5f/0xf0
[<ffffffff811c75ae>] do_mount+0x23e/0xa20
[<ffffffff811c7e13>] SyS_mount+0x83/0xc0
[<ffffffff81656859>] system_call_fastpath+0x16/0x1b

The bash session which still had CWD on CIFS hangs in a similar stack:

[<ffffffffa05dac23>] wait_for_response.isra.6+0xc3/0x120 [cifs]
[<ffffffffa05dbb9f>] SendReceive+0x12f/0x300 [cifs]
[<ffffffffa05ba3b6>] CIFSSMBNegotiate+0x116/0x6a0 [cifs]
[<ffffffffa05e2dd7>] cifs_negotiate+0x17/0x60 [cifs]
[<ffffffffa05c901f>] cifs_negotiate_protocol+0x7f/0xe0 [cifs]
[<ffffffffa05c947b>] cifs_get_smb_ses+0x20b/0x4c0 [cifs]
[<ffffffffa05ca6c9>] cifs_sb_tlink+0x4a9/0x850 [cifs]
[<ffffffffa05d4f91>] cifs_get_inode_info+0x51/0x7b0 [cifs]
[<ffffffffa05d6f3f>] cifs_revalidate_dentry_attr+0xef/0x1b0 [cifs]
[<ffffffffa05d70d7>] cifs_getattr+0x57/0x130 [cifs]
[<ffffffff811acf36>] vfs_getattr+0x46/0x80
[<ffffffff811ad032>] vfs_fstatat+0x62/0xa0
[<ffffffff811ad21f>] SYSC_newstat+0x1f/0x40
[<ffffffff811ad66e>] SyS_newstat+0xe/0x10
[<ffffffff81656859>] system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):
kernel-3.11.9-200.fc19.x86_64

How reproducible:
happens sometimes when CIFS file systems are idle for some time. I haven't done exhaustive reproducibility testing.
In the case at had I had a bash session open on one CIFS FS, so autofs didn't timeout.

Steps to Reproduce:
1. configure CIFS filesystems using autofs
2. cd to a CIFS file system in a terminal session
3. wait a while

Actual results:
Accesses to 

Expected results:


Additional info:
While I was typing in this bug, the CIFS accesses obviously timed out after estimated 10-15 minutes. Now everything is fine again.

However I verified that while the CIFS session was stalled, the server was alive and well. I could easily access it using smbclient from the same host.

During the error situation, I had set cifsFYI to 15 and saw the following output:


[39990.240053] fs/cifs/connect.c: Received no data or error: expecting 4
got -110
[39990.240057] fs/cifs/connect.c: Reconnecting tcp session
[39990.240060] fs/cifs/connect.c: cifs_reconnect: marking sessions and tcons for reconnect
[39990.240063] fs/cifs/connect.c: cifs_reconnect: tearing down socket
[39990.240065] fs/cifs/connect.c: State: 0x3 Flags: 0x0
[39990.240068] fs/cifs/connect.c: Post shutdown state: 0x3 Flags: 0x0
[39990.240083] fs/cifs/connect.c: cifs_reconnect: moving mids to private list
[39990.240086] fs/cifs/connect.c: cifs_reconnect: issuing mid callbacks

Full dmesg is attached.

Comment 1 Jeff Layton 2013-12-13 14:26:57 UTC
(In reply to Martin Wilck from comment #0)
> Created attachment 836274 [details]
> dmesg
> 
> Description of problem:
> 
> $ ps -Lww -C mount.cifs -o pid,stat,nwchan,wchan,comm,args
>   PID STAT  WCHAN WCHAN  COMMAND         COMMAND
> 19236 D    5c9470 cifs_g mount.cifs      /sbin/mount.cifs
> //pdbisdocs.pdb.fsc.net/Projects /cifs/pdbisdocs.pdb.fsc.net/Projects -n -o
> rw,multiuser,cruid=1004,sec=krb5i
> 
> $ cat /proc/19236/stack
> 
> [<ffffffffa05c9470>] cifs_get_smb_ses+0x200/0x4c0 [cifs]
> [<ffffffffa05c9cdd>] cifs_mount+0x5ad/0xad0 [cifs]
> [<ffffffffa05b8f50>] cifs_do_mount+0xa0/0x4d0 [cifs]
> [<ffffffff811ac239>] mount_fs+0x39/0x1b0
> [<ffffffff811c51df>] vfs_kern_mount+0x5f/0xf0
> [<ffffffff811c75ae>] do_mount+0x23e/0xa20
> [<ffffffff811c7e13>] SyS_mount+0x83/0xc0
> [<ffffffff81656859>] system_call_fastpath+0x16/0x1b
> 
> The bash session which still had CWD on CIFS hangs in a similar stack:
> 
> [<ffffffffa05dac23>] wait_for_response.isra.6+0xc3/0x120 [cifs]
> [<ffffffffa05dbb9f>] SendReceive+0x12f/0x300 [cifs]
> [<ffffffffa05ba3b6>] CIFSSMBNegotiate+0x116/0x6a0 [cifs]
> [<ffffffffa05e2dd7>] cifs_negotiate+0x17/0x60 [cifs]
> [<ffffffffa05c901f>] cifs_negotiate_protocol+0x7f/0xe0 [cifs]
> [<ffffffffa05c947b>] cifs_get_smb_ses+0x20b/0x4c0 [cifs]
> [<ffffffffa05ca6c9>] cifs_sb_tlink+0x4a9/0x850 [cifs]
> [<ffffffffa05d4f91>] cifs_get_inode_info+0x51/0x7b0 [cifs]
> [<ffffffffa05d6f3f>] cifs_revalidate_dentry_attr+0xef/0x1b0 [cifs]
> [<ffffffffa05d70d7>] cifs_getattr+0x57/0x130 [cifs]
> [<ffffffff811acf36>] vfs_getattr+0x46/0x80
> [<ffffffff811ad032>] vfs_fstatat+0x62/0xa0
> [<ffffffff811ad21f>] SYSC_newstat+0x1f/0x40
> [<ffffffff811ad66e>] SyS_newstat+0xe/0x10
> [<ffffffff81656859>] system_call_fastpath+0x16/0x1b
> 
> 
> Version-Release number of selected component (if applicable):
> kernel-3.11.9-200.fc19.x86_64
> 
> How reproducible:
> happens sometimes when CIFS file systems are idle for some time. I haven't
> done exhaustive reproducibility testing.
> In the case at had I had a bash session open on one CIFS FS, so autofs
> didn't timeout.
> 
> Steps to Reproduce:
> 1. configure CIFS filesystems using autofs
> 2. cd to a CIFS file system in a terminal session
> 3. wait a while
> 
> Actual results:
> Accesses to 
> 
> Expected results:
> 
> 
> Additional info:
> While I was typing in this bug, the CIFS accesses obviously timed out after
> estimated 10-15 minutes. Now everything is fine again.
> 
> However I verified that while the CIFS session was stalled, the server was
> alive and well. I could easily access it using smbclient from the same host.
> 
> During the error situation, I had set cifsFYI to 15 and saw the following
> output:
> 
> 
> [39990.240053] fs/cifs/connect.c: Received no data or error: expecting 4
> got -110

It certainly seems like the server wasn't responding:

[jlayton@tlielax ~]$ grep 110 /usr/include/asm-generic/errno*
/usr/include/asm-generic/errno.h:#define	ETIMEDOUT	110	/* Connection timed out */

> [39990.240057] fs/cifs/connect.c: Reconnecting tcp session
> [39990.240060] fs/cifs/connect.c: cifs_reconnect: marking sessions and tcons
> for reconnect
> [39990.240063] fs/cifs/connect.c: cifs_reconnect: tearing down socket
> [39990.240065] fs/cifs/connect.c: State: 0x3 Flags: 0x0
> [39990.240068] fs/cifs/connect.c: Post shutdown state: 0x3 Flags: 0x0
> [39990.240083] fs/cifs/connect.c: cifs_reconnect: moving mids to private list
> [39990.240086] fs/cifs/connect.c: cifs_reconnect: issuing mid callbacks
> 

A network capture might be interesting...

Comment 2 Martin Wilck 2013-12-13 14:36:28 UTC
> It certainly seems like the server wasn't responding:

True. The question is why it didn't. As I noted, new connection could be setup easily with that server. My conjecture is that the kernel side was trying to refresh (?) an existing connection that had been dropped on the server long before, and therefore got no response. 

Next question is, if I'm right, why did the server drop the connection and the client did not? Would it be possible to use a shorter timeout value at this point?

Both Windows and Netapp CIFS servers were stalled in the problem situation, but maybe that one stalled connection was just blocking everything else in the kernel's CIFS layer (you can judge better than me if that's possible - the traces above suggest that the bash process was stuck, and mount.cifs was just waiting for it to finish. The bash process was on a share on the Netapp server). 

> A network capture might be interesting...

Sure. Unfortunately I have no deterministic procedure to reproduce this. It just seems to happen once in a while.

Comment 3 Christian Horn 2013-12-13 14:39:47 UTC
(In reply to Martin Wilck from comment #2)
> Sure. Unfortunately I have no deterministic procedure to reproduce this. It
> just seems to happen once in a while.

One could run a variant of
  $ tcpdump -C 1 -i em1 -nv -w tcpdump.bin -W 3
which will only safe the last traffic and delete older dumps. Together with a port or ip filter it does maybe not slow down the system to much.

Comment 4 infove 2013-12-14 21:27:42 UTC
Having the same issue. I can reproduce it 100%.
Not only shares mounted with mount.cifs are affected, but KDE KIO.
Strangely, I simultaneously lost the ability to send emails to the SMTP server hosted on that same Windows box.

Dec 14 16:13:55 localhost.localdomain kernel: CIFS VFS: Server 10.10.10.2 has not responded in 120 seconds. Reconnecting...
Dec 14 16:16:01 localhost.localdomain kernel: CIFS VFS: Server 10.10.10.2 has not responded in 120 seconds. Reconnecting...
Dec 14 16:18:07 localhost.localdomain kernel: CIFS VFS: Server 10.10.10.2 has not responded in 120 seconds. Reconnecting...

Suspecting it was caused by one of the below updates:

http://paste.fedoraproject.org/61856/87054993/

Comment 5 infove 2013-12-14 22:11:51 UTC
Interesting is that I can copy some files over to CIFS, but not others.
Also I can send some emails but not all. The very short plain text emails send fine, but as soon as I am trying to reply, or send HTML emails with links and different fonts, they time out.
From the current 3.11.10 I booted into 3.11.9 and 3.8.5 kernels and the issue is still there.

Comment 6 infove 2013-12-14 22:38:27 UTC
I was wrong about being able to copy some files - only 0 size files copied. Essentially they were created on the destination, but no actual copying occurred. Any >0 file freezes the copying program no matter what it is. There seems to be difference in behaviour of stalled program depending on the CIFS mounting options. When only username and password is provided, I am not observing high CPU usage in IOWait. If a full set of options is supplied, it's 100% of 1 CPU core in IOWait.

Comment 7 infove 2013-12-15 01:08:03 UTC
Using a Windows machine, I can pull a file from F19 Samba share.
Copying files in a Windows VM running on F19 machine using shared folders works too.

Comment 8 infove 2013-12-15 15:25:14 UTC
Still waiting for any other info?

Comment 9 Justin M. Forbes 2014-01-03 22:05:45 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.12.6-200.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 10 Fedora End Of Life 2015-01-09 20:51:01 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 11 Martin Wilck 2015-01-12 11:08:38 UTC
I haven't seen this any more since upgrading to F20.

Comment 12 Fedora End Of Life 2015-02-17 19:37:52 UTC
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.