Bug 717735 - NFS mount preventing suspend
Summary: NFS mount preventing suspend
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
Assignee: Jeff Layton
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 710539 712088 759703 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-29 17:51 UTC by Adam Williamson
Modified: 2016-05-10 11:31 UTC (History)
16 users (show)

Fixed In Version: kernel-3.1.5-2.fc16
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-12-14 23:39:48 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
patch -- allow cifs and nfs TASK_KILLABLE sleeps to freeze (5.51 KB, patch)
2011-09-15 18:52 UTC, Jeff Layton
no flags Details | Diff
patch -- allow cifs and nfs TASK_KILLABLE sleeps to freeze (5.84 KB, patch)
2011-09-23 15:08 UTC, Jeff Layton
no flags Details | Diff
patch -- don't have freezer count processes sleeping in NFS/RPC code (6.28 KB, patch)
2011-11-04 13:03 UTC, Jeff Layton
no flags Details | Diff
patch -- nfs/sunrpc: make TASK_KILLABLE sleeps attempt to freeze (4.58 KB, patch)
2011-11-06 00:21 UTC, Jeff Layton
no flags Details | Diff
patch -- cifs/nfs/sunrpc client freezer patch against v3.1.1 (5.50 KB, patch)
2011-11-15 21:59 UTC, Jeff Layton
no flags Details | Diff
patch -- nfs/sunrpc client freezer patch against v3.2-rc3 or so (5.22 KB, patch)
2011-11-29 12:40 UTC, Jeff Layton
no flags Details | Diff
patch -- attempt to freeze while looping on a receive attempt (998 bytes, patch)
2011-12-01 19:30 UTC, Jeff Layton
no flags Details | Diff
patch -- cifs/nfs/sunrpc client freezer patch against v3.1.4 (15.10 KB, patch)
2011-12-08 12:44 UTC, Jeff Layton
no flags Details | Diff
patch -- cifs/nfs/sunrpc client freezer patch against v3.2-rc4 (or so) (7.78 KB, patch)
2011-12-08 12:48 UTC, Jeff Layton
no flags Details | Diff

Description Adam Williamson 2011-06-29 17:51:54 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=712088 , which is a very similar bug with CIFS.

I currently have a share on my NAS mounted via NFS:

nas:/mnt/HD_a2/ on /share/data type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.13,mountvers=3,mountport=2049,mountproto=udp,local_lock=none,addr=192.168.1.13)

Trying to suspend this system last night failed, twice, with this trace:

Jun 29 00:31:36 adam kernel: [14989.522021] Freezing user space processes ... 
Jun 29 00:31:36 adam kernel: [15009.506600] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0):
Jun 29 00:31:36 adam kernel: [15009.506695] umount          D ffff88041a0a03d0  4992 24634  24613 0x00800084
Jun 29 00:31:36 adam kernel: [15009.506700]  ffff8803830e5b78 0000000000000046 ffff8803830e5c28 0000000000000296
Jun 29 00:31:36 adam kernel: [15009.506704]  ffff88041a0a0000 ffff8803830e5fd8 ffff8803830e5fd8 00000000001d2d00
Jun 29 00:31:36 adam kernel: [15009.506708]  ffff880440788000 ffff88041a0a0000 ffff88045f6e3ea8 0000000000000082
Jun 29 00:31:36 adam kernel: [15009.506712] Call Trace:
Jun 29 00:31:36 adam kernel: [15009.506724]  [<ffffffffa03e5d46>] ? rpc_queue_empty+0x31/0x31 [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506730]  [<ffffffffa03e5d7a>] rpc_wait_bit_killable+0x34/0x38 [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506734]  [<ffffffff814f1873>] __wait_on_bit+0x48/0x7b
Jun 29 00:31:36 adam kernel: [15009.506737]  [<ffffffff814f1918>] out_of_line_wait_on_bit+0x72/0x7d
Jun 29 00:31:36 adam kernel: [15009.506743]  [<ffffffffa03e6a30>] ? __rpc_execute+0xb2/0x257 [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506748]  [<ffffffffa03e5d46>] ? rpc_queue_empty+0x31/0x31 [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506752]  [<ffffffff81074d89>] ? autoremove_wake_function+0x3d/0x3d
Jun 29 00:31:36 adam kernel: [15009.506758]  [<ffffffffa03e6a70>] __rpc_execute+0xf2/0x257 [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506761]  [<ffffffff81074ab4>] ? wake_up_bit+0x25/0x2a
Jun 29 00:31:36 adam kernel: [15009.506766]  [<ffffffffa03e6c42>] rpc_execute+0x3f/0x43 [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506770]  [<ffffffffa03dfef7>] rpc_run_task+0x86/0x8e [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506775]  [<ffffffffa03dffec>] rpc_call_sync+0x45/0x66 [sunrpc]
Jun 29 00:31:36 adam kernel: [15009.506785]  [<ffffffffa045c982>] nfs3_rpc_wrapper.constprop.7+0x2c/0x64 [nfs]
Jun 29 00:31:36 adam kernel: [15009.506793]  [<ffffffffa045da54>] nfs3_proc_getattr+0x5d/0x83 [nfs]
Jun 29 00:31:36 adam kernel: [15009.506799]  [<ffffffffa044f24b>] __nfs_revalidate_inode+0xb4/0x1a2 [nfs]
Jun 29 00:31:36 adam kernel: [15009.506805]  [<ffffffffa044f492>] nfs_revalidate_inode+0x4a/0x51 [nfs]
Jun 29 00:31:36 adam kernel: [15009.506811]  [<ffffffffa044f571>] nfs_getattr+0x92/0xc3 [nfs]
Jun 29 00:31:36 adam kernel: [15009.506815]  [<ffffffff8113b7d9>] vfs_getattr+0x45/0x63
Jun 29 00:31:36 adam kernel: [15009.506817]  [<ffffffff8113b84f>] vfs_fstatat+0x58/0x6e
Jun 29 00:31:36 adam kernel: [15009.506820]  [<ffffffff8113b8a0>] vfs_stat+0x1b/0x1d
Jun 29 00:31:36 adam kernel: [15009.506823]  [<ffffffff8113b99f>] sys_newstat+0x1a/0x33
Jun 29 00:31:36 adam kernel: [15009.506825]  [<ffffffff81140b21>] ? path_put+0x1f/0x23
Jun 29 00:31:36 adam kernel: [15009.506829]  [<ffffffff810ac3c9>] ? audit_syscall_entry+0x11c/0x148
Jun 29 00:31:36 adam kernel: [15009.506833]  [<ffffffff8125cbae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
Jun 29 00:31:36 adam kernel: [15009.506836]  [<ffffffff814fa182>] system_call_fastpath+0x16/0x1b
Jun 29 00:31:36 adam kernel: [15009.506839] 
Jun 29 00:31:36 adam kernel: [15009.506840] Restarting tasks ... done.

Suspend bugs are a bit more important now we're in the Glorious Era Of GNOME 3...

Comment 1 Jeff Guerdat 2011-07-11 23:56:30 UTC
Same problem here.  I also tried automounting the NAS which resulted in a failed suspend but which caused the mount to be unmounted, allowing a second attempt to suspend to succeed.

Comment 2 Jeff Guerdat 2011-07-11 23:57:34 UTC
OOPS!  Just noticed this was for rawhide - I'm on F15.

Comment 3 Adam Williamson 2011-07-12 00:31:51 UTC
probably still the same bug.

Comment 4 Jan Willies 2011-08-01 07:16:44 UTC
same here on F15:

Aug  1 08:58:55 jan kernel: [ 1525.755153] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0):
Aug  1 08:58:55 jan kernel: [ 1525.755275] umount          D ffff8800bf683b50     0  2176   2156 0x00800084
Aug  1 08:58:55 jan kernel: [ 1525.755283]  ffff880131035b68 0000000000000082 0000000000000246 ffff88010f454590
Aug  1 08:58:55 jan kernel: [ 1525.755290]  ffff880131035fd8 ffff880131035fd8 0000000000013840 0000000000013840
Aug  1 08:58:55 jan kernel: [ 1525.755297]  ffff88012fbcc590 ffff88010f454590 ffff880131035b38 ffffffff81478454
Aug  1 08:58:55 jan kernel: [ 1525.755304] Call Trace:
Aug  1 08:58:55 jan kernel: [ 1525.755317]  [<ffffffff81478454>] ? _raw_spin_unlock_irqrestore+0x17/0x19
Aug  1 08:58:55 jan kernel: [ 1525.755339]  [<ffffffffa031107b>] ? rpc_wait_bit_killable+0x0/0x38 [sunrpc]
Aug  1 08:58:55 jan kernel: [ 1525.755355]  [<ffffffffa03110af>] rpc_wait_bit_killable+0x34/0x38 [sunrpc]
Aug  1 08:58:55 jan kernel: [ 1525.755361]  [<ffffffff814773cd>] __wait_on_bit+0x48/0x7b
Aug  1 08:58:55 jan kernel: [ 1525.755367]  [<ffffffff8106acc7>] ? queue_work_on+0x37/0x45
Aug  1 08:58:55 jan kernel: [ 1525.755373]  [<ffffffff81477472>] out_of_line_wait_on_bit+0x72/0x7d
Aug  1 08:58:55 jan kernel: [ 1525.755388]  [<ffffffffa031107b>] ? rpc_wait_bit_killable+0x0/0x38 [sunrpc]
Aug  1 08:58:55 jan kernel: [ 1525.755393]  [<ffffffff8106f2ab>] ? wake_bit_function+0x0/0x31
Aug  1 08:58:55 jan kernel: [ 1525.755409]  [<ffffffffa0311c80>] __rpc_execute+0xf2/0x295 [sunrpc]
Aug  1 08:58:55 jan kernel: [ 1525.755414]  [<ffffffff8106f021>] ? wake_up_bit+0x25/0x2a
Aug  1 08:58:55 jan kernel: [ 1525.755429]  [<ffffffffa0311e90>] rpc_execute+0x3f/0x43 [sunrpc]
Aug  1 08:58:55 jan kernel: [ 1525.755442]  [<ffffffffa030bdde>] rpc_run_task+0xeb/0xf7 [sunrpc]
Aug  1 08:58:55 jan kernel: [ 1525.755454]  [<ffffffffa030bed7>] rpc_call_sync+0x45/0x66 [sunrpc]
Aug  1 08:58:55 jan kernel: [ 1525.755479]  [<ffffffffa0483b0a>] nfs3_rpc_wrapper.constprop.7+0x2c/0x64 [nfs]
Aug  1 08:58:55 jan kernel: [ 1525.755501]  [<ffffffffa0484bdf>] nfs3_proc_getattr+0x5d/0x83 [nfs]
Aug  1 08:58:55 jan kernel: [ 1525.755518]  [<ffffffffa0476c14>] __nfs_revalidate_inode+0xb4/0x1a2 [nfs]
Aug  1 08:58:55 jan kernel: [ 1525.755534]  [<ffffffffa0476e52>] nfs_revalidate_inode+0x4a/0x51 [nfs]
Aug  1 08:58:55 jan kernel: [ 1525.755550]  [<ffffffffa0476f31>] nfs_getattr+0x92/0xc4 [nfs]
Aug  1 08:58:55 jan kernel: [ 1525.755557]  [<ffffffff81124fb7>] vfs_getattr+0x45/0x63
Aug  1 08:58:55 jan kernel: [ 1525.755562]  [<ffffffff8104708d>] ? pick_next_task_fair+0xae/0xc1
Aug  1 08:58:55 jan kernel: [ 1525.755567]  [<ffffffff81125022>] vfs_fstatat+0x4d/0x63
Aug  1 08:58:55 jan kernel: [ 1525.755572]  [<ffffffff81047bda>] ? pick_next_task+0x2a/0x4e
Aug  1 08:58:55 jan kernel: [ 1525.755576]  [<ffffffff81125073>] vfs_stat+0x1b/0x1d
Aug  1 08:58:55 jan kernel: [ 1525.755581]  [<ffffffff81125172>] sys_newstat+0x1a/0x33
Aug  1 08:58:55 jan kernel: [ 1525.755586]  [<ffffffff81129e2d>] ? path_put+0x1f/0x23
Aug  1 08:58:55 jan kernel: [ 1525.755592]  [<ffffffff8109fa68>] ? audit_syscall_entry+0x145/0x171
Aug  1 08:58:55 jan kernel: [ 1525.755598]  [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b
Aug  1 08:58:55 jan kernel: [ 1525.755602] 
Aug  1 08:58:55 jan kernel: [ 1525.755604] Restarting tasks ... done.
Aug  1 08:58:55 jan kernel: [ 1525.859058] video LNXVIDEO:00: Restoring backlight state

Comment 5 Jan Willies 2011-08-14 07:34:04 UTC
also confirmed for F16

Comment 6 Jan Willies 2011-09-14 18:35:20 UTC
seems to be fixed in 3.1.0-0.rc6.git0.0.fc16.x86_64

Comment 7 Chuck Ebbert 2011-09-15 13:23:00 UTC
Changing this to F15 so we can get it fixed there.

Comment 8 Jeff Layton 2011-09-15 18:04:17 UTC
This problem is almost certainly the NFS equivalent of bug 712088.

What I've found with the cifs equivalent bug is that there is some raciness involved here. If the fs ends up getting cleanly unmounted before the network interfaces go down, then suspending will work. If it does not though, then it'll generally fail. 

My suspicion is that the timing may have changed here and the umount is finishing before the network interfaces go down. I see nothing right offhand that would fix this in any recent kernel. How sure are you that it's fixed? What might be interesting is to run some network heavy activity on a NFS mount and try repeatedly testing suspends.

Assuming that it's not really fixed, I suspect that we might be able to fix this with a similar approach to the patch in bug 717735. If so then this will require some similar changes to that problem so I'll go ahead and grab this one too.

Comment 9 Jeff Layton 2011-09-15 18:52:50 UTC
Created attachment 523426 [details]
patch -- allow cifs and nfs TASK_KILLABLE sleeps to freeze

Here's a mostly untested set of patches that I think will probably fix this. The fix is twofold...

First, we have to allow the freezer to wake processes that are in TASK_KILLABLE sleep. This is probably the most controversial part of the patchset, but I think it'll probably be harmless. We'll see what the linux-pm folks think though...

Next we have to teach the wait_bit_killable variants in the NFS and RPC code to try_to_freeze when they are woken up without a fatal signal.

I also threw the cifs patch for this problem in for good measure.

If you can test this set and let me know if it really fixes the issue then I'll see about getting these in for 3.2...

Comment 10 Jeff Layton 2011-09-23 15:08:30 UTC
Created attachment 524636 [details]
patch -- allow cifs and nfs TASK_KILLABLE sleeps to freeze

Revised patch. This also fixes some cases where the NFS layer will sleep during locking and NFSERR_JUKEBOX sort of events. It also properly includes freezer.h.

Comment 11 Adam Williamson 2011-09-23 17:07:02 UTC
Sorry I haven't tested this yet, Jeff, Beta has just been too busy for me to spend time on anything not-Beta really :( I'll get to it ASAP. suspend is still working for me most of the time with non-debug kernels, so they're definitely changing the timing a bit - I'll test the patch both with a debug-enabled and a debug-disabled kernel.

Comment 12 John Brier 2011-10-17 04:09:04 UTC
I have a similar problem but the hung task is fuser, not umount:

Oct 16 23:51:30 farina kernel: [21003.895160] PM: Syncing filesystems ... done.
Oct 16 23:51:30 farina kernel: [21004.042552] Freezing user space processes ... 
Oct 16 23:51:30 farina kernel: [21024.051046] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0):
Oct 16 23:51:30 farina kernel: [21024.051174] fuser           D 0000000000000000     0  7082   6840 0x00800084
Oct 16 23:51:30 farina kernel: [21024.051178]  ffff880098b25ae8 0000000000000082 ffffffff8148858a ffff880100000000
Oct 16 23:51:30 farina kernel: [21024.051181]  ffff8800b1ae4590 ffff880098b25fd8 ffff880098b25fd8 0000000000012540
Oct 16 23:51:30 farina kernel: [21024.051184]  ffffffff81a0b020 ffff8800b1ae4590 ffff88012ffb3b80 0000000100000246
Oct 16 23:51:30 farina kernel: [21024.051187] Call Trace:
Oct 16 23:51:30 farina kernel: [21024.051194]  [<ffffffff8148858a>] ? _raw_spin_lock_irqsave+0x12/0x2f
Oct 16 23:51:30 farina kernel: [21024.051208]  [<ffffffffa0db7847>] ? rpc_queue_empty+0x2e/0x2e [sunrpc]
Oct 16 23:51:30 farina kernel: [21024.051212]  [<ffffffff8104fbd6>] schedule+0x5a/0x5c
Oct 16 23:51:30 farina kernel: [21024.051220]  [<ffffffffa0db787b>] rpc_wait_bit_killable+0x34/0x38 [sunrpc]
Oct 16 23:51:30 farina kernel: [21024.051223]  [<ffffffff814875a4>] __wait_on_bit+0x48/0x7b
Oct 16 23:51:30 farina kernel: [21024.051226]  [<ffffffff8105a76b>] ? _local_bh_enable_ip+0x25/0x8e
Oct 16 23:51:30 farina kernel: [21024.051229]  [<ffffffff81487649>] out_of_line_wait_on_bit+0x72/0x7d
Oct 16 23:51:30 farina kernel: [21024.051237]  [<ffffffffa0db7847>] ? rpc_queue_empty+0x2e/0x2e [sunrpc]
Oct 16 23:51:30 farina kernel: [21024.051239]  [<ffffffff81070687>] ? autoremove_wake_function+0x3d/0x3d
Oct 16 23:51:30 farina kernel: [21024.051247]  [<ffffffffa0db841c>] __rpc_execute+0xf0/0x293 [sunrpc]
Oct 16 23:51:30 farina kernel: [21024.051255]  [<ffffffffa0db862c>] rpc_execute+0x3f/0x43 [sunrpc]
Oct 16 23:51:30 farina kernel: [21024.051261]  [<ffffffffa0db1e0f>] rpc_run_task+0x86/0x8e [sunrpc]
Oct 16 23:51:30 farina kernel: [21024.051267]  [<ffffffffa0db1f04>] rpc_call_sync+0x45/0x66 [sunrpc]
Oct 16 23:51:30 farina kernel: [21024.051284]  [<ffffffffa0e104ba>] _nfs4_call_sync+0x21/0x23 [nfs]
Oct 16 23:51:30 farina kernel: [21024.051296]  [<ffffffffa0e0dd75>] nfs4_call_sync+0x16/0x18 [nfs]
Oct 16 23:51:30 farina kernel: [21024.051308]  [<ffffffffa0e0ebe3>] _nfs4_proc_getattr+0x97/0xa5 [nfs]
Oct 16 23:51:30 farina kernel: [21024.051321]  [<ffffffffa0e11ddb>] nfs4_proc_getattr+0x36/0x55 [nfs]
Oct 16 23:51:30 farina kernel: [21024.051330]  [<ffffffffa0dfcffd>] __nfs_revalidate_inode+0xb4/0x1a2 [nfs]
Oct 16 23:51:30 farina kernel: [21024.051338]  [<ffffffffa0dfd309>] nfs_getattr+0x81/0xc4 [nfs]
Oct 16 23:51:30 farina kernel: [21024.051342]  [<ffffffff8112abb3>] vfs_getattr+0x45/0x63
Oct 16 23:51:30 farina kernel: [21024.051344]  [<ffffffff8112ac29>] vfs_fstatat+0x58/0x6e
Oct 16 23:51:30 farina kernel: [21024.051346]  [<ffffffff8112ac7a>] vfs_stat+0x1b/0x1d
Oct 16 23:51:30 farina kernel: [21024.051349]  [<ffffffff8112ad79>] sys_newstat+0x1a/0x33
Oct 16 23:51:30 farina kernel: [21024.051351]  [<ffffffff8112fba4>] ? path_put+0x20/0x24
Oct 16 23:51:30 farina kernel: [21024.051354]  [<ffffffff810a0f88>] ? audit_syscall_entry+0x145/0x171
Oct 16 23:51:30 farina kernel: [21024.051358]  [<ffffffff8148ed02>] system_call_fastpath+0x16/0x1b
Oct 16 23:51:30 farina kernel: [21024.051360] 
Oct 16 23:51:30 farina kernel: [21024.051361] Restarting tasks ... done.


fuser was monitoring an NFS mount point I have

192.168.2.17:/efserv/ /efserv nfs4 rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.2.11,minorversion=0,local_lock=none,addr=192.168.2.17 0 0


[root@farina ~]# uname -a
Linux farina.dj.edm 2.6.40.6-0.fc15.x86_64 #1 SMP Tue Oct 4 00:39:50 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Though here I tried to reproduce and it's umount now:

Oct 17 00:07:21 farina kernel: [21944.859957] PM: Syncing filesystems ... done.
Oct 17 00:07:21 farina kernel: [21944.866209] Freezing user space processes ... 
Oct 17 00:07:21 farina kernel: [21964.875048] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0):
Oct 17 00:07:21 farina kernel: [21964.875182] umount          D 0000000000000000     0  8070   8050 0x00800084
Oct 17 00:07:21 farina kernel: [21964.875186]  ffff880125ea1ae8 0000000000000086 ffffffff8148858a ffff880100000000
Oct 17 00:07:21 farina kernel: [21964.875190]  ffff88010615c590 ffff880125ea1fd8 ffff880125ea1fd8 0000000000012540
Oct 17 00:07:21 farina kernel: [21964.875193]  ffffffff81a0b020 ffff88010615c590 ffff88012ffb3b80 0000000100000246
Oct 17 00:07:21 farina kernel: [21964.875195] Call Trace:
Oct 17 00:07:21 farina kernel: [21964.875202]  [<ffffffff8148858a>] ? _raw_spin_lock_irqsave+0x12/0x2f
Oct 17 00:07:21 farina kernel: [21964.875217]  [<ffffffffa0db7847>] ? rpc_queue_empty+0x2e/0x2e [sunrpc]
Oct 17 00:07:21 farina kernel: [21964.875221]  [<ffffffff8104fbd6>] schedule+0x5a/0x5c
Oct 17 00:07:21 farina kernel: [21964.875229]  [<ffffffffa0db787b>] rpc_wait_bit_killable+0x34/0x38 [sunrpc]
Oct 17 00:07:21 farina kernel: [21964.875232]  [<ffffffff814875a4>] __wait_on_bit+0x48/0x7b
Oct 17 00:07:21 farina kernel: [21964.875235]  [<ffffffff8105a76b>] ? _local_bh_enable_ip+0x25/0x8e
Oct 17 00:07:21 farina kernel: [21964.875238]  [<ffffffff81487649>] out_of_line_wait_on_bit+0x72/0x7d
Oct 17 00:07:21 farina kernel: [21964.875245]  [<ffffffffa0db7847>] ? rpc_queue_empty+0x2e/0x2e [sunrpc]
Oct 17 00:07:21 farina kernel: [21964.875248]  [<ffffffff81070687>] ? autoremove_wake_function+0x3d/0x3d
Oct 17 00:07:21 farina kernel: [21964.875256]  [<ffffffffa0db841c>] __rpc_execute+0xf0/0x293 [sunrpc]
Oct 17 00:07:21 farina kernel: [21964.875264]  [<ffffffffa0db862c>] rpc_execute+0x3f/0x43 [sunrpc]
Oct 17 00:07:21 farina kernel: [21964.875270]  [<ffffffffa0db1e0f>] rpc_run_task+0x86/0x8e [sunrpc]
Oct 17 00:07:21 farina kernel: [21964.875276]  [<ffffffffa0db1f04>] rpc_call_sync+0x45/0x66 [sunrpc]
Oct 17 00:07:21 farina kernel: [21964.875293]  [<ffffffffa0e104ba>] _nfs4_call_sync+0x21/0x23 [nfs]
Oct 17 00:07:21 farina kernel: [21964.875305]  [<ffffffffa0e0dd75>] nfs4_call_sync+0x16/0x18 [nfs]
Oct 17 00:07:21 farina kernel: [21964.875317]  [<ffffffffa0e0ebe3>] _nfs4_proc_getattr+0x97/0xa5 [nfs]
Oct 17 00:07:21 farina kernel: [21964.875330]  [<ffffffffa0e11ddb>] nfs4_proc_getattr+0x36/0x55 [nfs]
Oct 17 00:07:21 farina kernel: [21964.875338]  [<ffffffffa0dfcffd>] __nfs_revalidate_inode+0xb4/0x1a2 [nfs]
Oct 17 00:07:21 farina kernel: [21964.875347]  [<ffffffffa0dfd309>] nfs_getattr+0x81/0xc4 [nfs]
Oct 17 00:07:21 farina kernel: [21964.875350]  [<ffffffff8112abb3>] vfs_getattr+0x45/0x63
Oct 17 00:07:21 farina kernel: [21964.875353]  [<ffffffff8112ac29>] vfs_fstatat+0x58/0x6e
Oct 17 00:07:21 farina kernel: [21964.875355]  [<ffffffff8112ac7a>] vfs_stat+0x1b/0x1d
Oct 17 00:07:21 farina kernel: [21964.875357]  [<ffffffff8112ad79>] sys_newstat+0x1a/0x33
Oct 17 00:07:21 farina kernel: [21964.875360]  [<ffffffff8112fba4>] ? path_put+0x20/0x24
Oct 17 00:07:21 farina kernel: [21964.875363]  [<ffffffff810a0f88>] ? audit_syscall_entry+0x145/0x171
Oct 17 00:07:21 farina kernel: [21964.875365]  [<ffffffff8113048d>] ? putname+0x34/0x36
Oct 17 00:07:21 farina kernel: [21964.875368]  [<ffffffff8148ed02>] system_call_fastpath+0x16/0x1b
Oct 17 00:07:21 farina kernel: [21964.875371] 
Oct 17 00:07:21 farina kernel: [21964.875372] Restarting tasks ... done.

Comment 13 Jeff Layton 2011-10-17 10:26:00 UTC
Looks like the same problem. As I mentioned before, there is some raciness involved here depending on when the network interfaces come down. At this point it looks like the patchset should make 3.2 upstream...

Comment 14 Adam Williamson 2011-10-25 00:46:11 UTC
Been running kernels with this patch on my laptop and desktop for a week or so, noticed no problems and have been able to suspend both systems first time, every time. Looks good to me!

Comment 15 Jeff Layton 2011-11-01 13:55:30 UTC
*** Bug 710539 has been marked as a duplicate of this bug. ***

Comment 16 Jeff Layton 2011-11-01 13:57:15 UTC
Tejun Heo has proposed that a key piece of this patchset be reverted. Since he knows a lot more about task state handling, I'm inclined to believe him that it'll be problematic.

At this point we're still discussing what the correct fix is, but for now it's
looking like this will not make 3.2. Hopefully we can get it fixed for 3.3.

Comment 17 Jeff Layton 2011-11-04 13:03:46 UTC
Created attachment 531763 [details]
patch -- don't have freezer count processes sleeping in NFS/RPC code

I've moved my laptop to f16 and have not been able to consistently reproduce
this since. I suspect that the problem is actually still there however. If
anyone here is still able to reproduce this, could you test the attached
patch? This is apparently the scheme that the upstream scheduler gurus would
prefer that we use for now.

It should work on f15-ish kernels too, but it may need some wiggling to get it
to apply there.

Comment 18 Adam Williamson 2011-11-04 19:09:07 UTC
The problem's definitely still there - I've been rebuilding kernels with your older patch, and both my systems always suspend first time with the rebuilt kernels, but only about 50% success without the patch.

I'll try the new patch soon, thanks!



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 19 Adam Williamson 2011-11-05 16:30:58 UTC
*** Bug 712088 has been marked as a duplicate of this bug. ***

Comment 20 Adam Williamson 2011-11-05 16:40:50 UTC
The patch fails to apply against current F16 kernel, all hunks fail:

Patch21080: cifs_nfs_suspend.patch
+ case "$patch" in
+ patch -p1 -F1 -s
1 out of 1 hunk FAILED -- saving rejects to file include/linux/freezer.h.rej
2 out of 2 hunks FAILED -- saving rejects to file include/linux/freezer.h.rej

Comment 21 Jeff Layton 2011-11-06 00:21:52 UTC
Created attachment 531916 [details]
patch -- nfs/sunrpc: make TASK_KILLABLE sleeps attempt to freeze

Can you try this patch? It's against 3.1.0-7.fc16.x86_64.

Comment 22 Adam Williamson 2011-11-11 19:41:39 UTC
With the new patch I just had one failure to suspend due to CIFS on my laptop; with the old patch I don't think I ever had a failure.

Comment 23 Jeff Layton 2011-11-11 19:49:14 UTC
Did it happen to spew anything to the ring buffer when it failed? Also, have you had any problems suspending when NFS is mounted?

FWIW, I'm not that thrilled with the new scheme for doing this as it doesn't seem as robust. So, I'm interested in any problems you may be seeing with it...

Comment 24 Adam Williamson 2011-11-15 18:38:02 UTC
Jeff: here's the failure from my laptop, running kernel 3.1.1-1 with the new patch applied:

Nov 15 10:36:17 vaioz kernel: [  517.877192] Freezing user space processes ... 
Nov 15 10:36:17 vaioz kernel: [  537.850179] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0):
Nov 15 10:36:17 vaioz kernel: [  537.850227] umount          D 0000000000000000     0  2421   2397 0x00800084
Nov 15 10:36:17 vaioz kernel: [  537.850235]  ffff88013a3f5cb8 0000000000000086 ffff88013801d180 ffff880000000000
Nov 15 10:36:17 vaioz kernel: [  537.850242]  ffff880036ba0000 ffff88013a3f5fd8 ffff88013a3f5fd8 0000000000012f80
Nov 15 10:36:17 vaioz kernel: [  537.850248]  ffffffff81a0d020 ffff880036ba0000 ffff88013a3f5c88 00000001814b7194
Nov 15 10:36:17 vaioz kernel: [  537.850254] Call Trace:
Nov 15 10:36:17 vaioz kernel: [  537.850266]  [<ffffffff814b5cc7>] schedule+0x5a/0x5c
Nov 15 10:36:17 vaioz kernel: [  537.850287]  [<ffffffffa0411d80>] wait_for_response+0xbc/0xbe [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850292]  [<ffffffff810733ce>] ? remove_wait_queue+0x3a/0x3a
Nov 15 10:36:17 vaioz kernel: [  537.850299]  [<ffffffffa0412850>] SendReceive2+0x162/0x29d [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850306]  [<ffffffffa04129c0>] SendReceiveNoRsp+0x35/0x37 [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850311]  [<ffffffffa03f8e8f>] CIFSSMBTDis+0x8d/0xcf [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850316]  [<ffffffffa04017b6>] cifs_put_tcon+0xc7/0xf0 [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850321]  [<ffffffffa0404c7c>] cifs_put_tlink+0x4f/0x5c [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850326]  [<ffffffffa0405fd3>] cifs_umount+0x4b/0x92 [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850330]  [<ffffffffa03f71d0>] cifs_kill_sb+0x1f/0x23 [cifs]
Nov 15 10:36:17 vaioz kernel: [  537.850336]  [<ffffffff8112adfd>] deactivate_locked_super+0x37/0x68
Nov 15 10:36:17 vaioz kernel: [  537.850339]  [<ffffffff8112b66b>] deactivate_super+0x37/0x3b
Nov 15 10:36:17 vaioz kernel: [  537.850342]  [<ffffffff811403ec>] mntput_no_expire+0xcc/0xd1
Nov 15 10:36:17 vaioz kernel: [  537.850344]  [<ffffffff81140fa9>] sys_umount+0x2ac/0x2da
Nov 15 10:36:17 vaioz kernel: [  537.850349]  [<ffffffff814bd8c2>] system_call_fastpath+0x16/0x1b
Nov 15 10:36:17 vaioz kernel: [  537.850351] 



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 25 Jeff Layton 2011-11-15 20:57:46 UTC
Sorry for the confusion. This bug was for the NFS problem, so I didn't bother to roll in the patches for CIFS. They have been committed upstream so that might have made it difficult to merge.

I'll respin this patchset against 3.1.1 and add in the cifs piece. Stay tuned...

Comment 26 Jeff Layton 2011-11-15 21:56:23 UTC
Going ahead and moving this to F16 bug...

Comment 27 Jeff Layton 2011-11-15 21:59:23 UTC
Created attachment 533853 [details]
patch -- cifs/nfs/sunrpc client freezer patch against v3.1.1

Ok, this patch is against 3.1.1 and seems to do the right thing. I tested several suspend/resume cycles with NFS and it seemed to work correctly. Can you test this one when you get time and let me know how it goes?

The CIFS part of this patch is already upstream. If the NFS parts work for you too, I'll plan to post the NFS/sunrpc parts upstream within the next few weeks so it'll be ready for 3.3.

Comment 28 Adam Williamson 2011-11-28 17:04:21 UTC
been running the 'big patch' against 3.1 on my laptop for a while and it seems to work fine, I've had no failed suspends. thanks!



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 29 Adam Williamson 2011-11-29 02:22:08 UTC
jeff: can you provide an up-to-date patch against 3.2 for use on my desktop (which is on f17)? I've tried using https://bugzilla.redhat.com/attachment.cgi?id=531763, which as I understood things ought to contain the NFS stuff while 3.2 itself already contains the CIFS stuff; it applies (at least last I tried), but doesn't seem to resolve the issue entirely, I do get suspend failures. So I figure there must be something missing from the combination of that patch plus whatever's in 3.2, compared to 3.1 plus the 'big patch'?



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 30 Jeff Layton 2011-11-29 12:40:59 UTC
Created attachment 537948 [details]
patch -- nfs/sunrpc client freezer patch against v3.2-rc3 or so

Yes, this patch applies with a little offset to 3.2-pre kernels. Note that the cifs piece of this is already merged into 3.2. The plan is to merge the nfs/sunrpc parts for 3.3.

Comment 31 Adam Williamson 2011-12-01 19:10:11 UTC
jeff: are you sure the CIFS bit is in 3.2.0-rc3.git0 ? I'm running:

3.2.0-0.rc3.git0.1.2.fc17.x86_64

which is a personal build: it's the Fedora 3.2.0-0.rc3.git0.1 kernel with the patch from comment #30 applied (and also some debug config changes, but that doesn't matter to this bug). I tried to suspend and got a failure from CIFS:

Nov 30 12:06:45 adam kernel: [ 4459.733121] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0):
Nov 30 12:06:45 adam kernel: [ 4459.733167] cifsd           W ffff880417683540  4528  1690      2 0x00800000
Nov 30 12:06:45 adam kernel: [ 4459.733175]  ffff88042e95f9a0 0000000000000046 ffff880400000000 ffffffff812ec9e2
Nov 30 12:06:45 adam kernel: [ 4459.733183]  ffff880417683160 ffff88042e95ffd8 ffff88042e95ffd8 ffff88042e95ffd8
Nov 30 12:06:45 adam kernel: [ 4459.733189]  ffff880444fae2c0 ffff880417683160 000000000000128d 0000000100000082
Nov 30 12:06:45 adam kernel: [ 4459.733196] Call Trace:
Nov 30 12:06:45 adam kernel: [ 4459.733206]  [<ffffffff812ec9e2>] ? __debug_object_init+0x202/0x410
Nov 30 12:06:45 adam kernel: [ 4459.733214]  [<ffffffff8162519f>] schedule+0x3f/0x60
Nov 30 12:06:45 adam kernel: [ 4459.733218]  [<ffffffff816256ea>] schedule_timeout+0x19a/0x380
Nov 30 12:06:45 adam kernel: [ 4459.733228]  [<ffffffff81083f60>] ? lock_timer_base+0x70/0x70
Nov 30 12:06:45 adam kernel: [ 4459.733231]  [<ffffffff814fac41>] sk_wait_data+0xd1/0xe0
Nov 30 12:06:45 adam kernel: [ 4459.733233]  [<ffffffff81099740>] ? remove_wait_
queue+0x50/0x50
Nov 30 12:06:45 adam kernel: [ 4459.733235]  [<ffffffff81555b75>] tcp_recvmsg+0x
545/0xca0
Nov 30 12:06:45 adam kernel: [ 4459.733238]  [<ffffffff8106a525>] ? load_balance
+0x105/0x890
Nov 30 12:06:45 adam kernel: [ 4459.733240]  [<ffffffff8157a1bb>] inet_recvmsg+0
x8b/0xa0
Nov 30 12:06:45 adam kernel: [ 4459.733242]  [<ffffffff814f52cd>] sock_recvmsg+0
x11d/0x140
Nov 30 12:06:45 adam kernel: [ 4459.733243]  [<ffffffff812ec7ae>] ? free_object+
0x8e/0xc0
Nov 30 12:06:45 adam kernel: [ 4459.733245]  [<ffffffff812ed0f8>] ? debug_object
_free+0xe8/0x140
Nov 30 12:06:45 adam kernel: [ 4459.733247]  [<ffffffff8109d885>] ? destroy_hrti
mer_on_stack+0x15/0x20
Nov 30 12:06:45 adam kernel: [ 4459.733249]  [<ffffffff81626aba>] ? schedule_hrt
imeout_range_clock+0xca/0x160
Nov 30 12:06:45 adam kernel: [ 4459.733250]  [<ffffffff8109dde4>] ? hrtimer_star
t_range_ns+0x14/0x20
Nov 30 12:06:45 adam kernel: [ 4459.733253]  [<ffffffff8112c685>] ? mempool_allo
c_slab+0x15/0x20
Nov 30 12:06:45 adam kernel: [ 4459.733255]  [<ffffffff814f5336>] kernel_recvmsg+0x46/0x60
Nov 30 12:06:45 adam kernel: [ 4459.733260]  [<ffffffffa04a8e7e>] cifs_readv_from_socket+0x1ae/0x280 [cifs]
Nov 30 12:06:45 adam kernel: [ 4459.733262]  [<ffffffff8112c9a9>] ? mempool_alloc+0x59/0x150
Nov 30 12:06:45 adam kernel: [ 4459.733264]  [<ffffffff8105a433>] ? __wake_up+0x53/0x70
Nov 30 12:06:45 adam kernel: [ 4459.733267]  [<ffffffffa04a8f77>] cifs_read_from_socket+0x27/0x30 [cifs]
Nov 30 12:06:45 adam kernel: [ 4459.733269]  [<ffffffffa04a914d>] cifs_demultiplex_thread+0x15d/0xdc0 [cifs]
Nov 30 12:06:45 adam kernel: [ 4459.733271]  [<ffffffff81624904>] ? __schedule+0x3e4/0x940
Nov 30 12:06:45 adam kernel: [ 4459.733274]  [<ffffffffa04a8ff0>] ? dequeue_mid+0x70/0x70 [cifs]
Nov 30 12:06:45 adam kernel: [ 4459.733276]  [<ffffffff81098e1c>] kthread+0x8c/0xa0
Nov 30 12:06:45 adam kernel: [ 4459.733279]  [<ffffffff81631eb4>] kernel_thread_helper+0x4/0x10
Nov 30 12:06:45 adam kernel: [ 4459.733280]  [<ffffffff81098d90>] ? kthread_worker_fn+0x1c0/0x1c0
Nov 30 12:06:45 adam kernel: [ 4459.733282]  [<ffffffff81631eb0>] ? gs_change+0x13/0x13
Nov 30 12:06:45 adam kernel: [ 4459.733287] 
Nov 30 12:06:45 adam kernel: [ 4459.733288] Restarting tasks ... done.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 32 Jeff Layton 2011-12-01 19:30:31 UTC
Created attachment 539374 [details]
patch -- attempt to freeze while looping on a receive attempt

It is, but this is a *different* bug that we introduced in 3.2. We've been discussing this upstream since yesterday. My thinking was that this patch would fix it, but the one person who tested it said it didn't work for them. Can you test it and let me know if it does for you?

Comment 33 Jeff Layton 2011-12-08 12:44:47 UTC
Created attachment 542512 [details]
patch -- cifs/nfs/sunrpc client freezer patch against v3.1.4

This set of patches should resolve the issue and is against v3.1.4. Adam, could you see about putting the above patchset into F16?

Comment 34 Jeff Layton 2011-12-08 12:48:59 UTC
Created attachment 542513 [details]
patch -- cifs/nfs/sunrpc client freezer patch against v3.2-rc4 (or so)

This patch is against the tip of Linus' tree. It should be suitable for f17 kernels. The relevant changes are slated for v3.3, so once f17 moves to a 3.3-based kernel we shouldn't need any patches.

Comment 35 Jeff Layton 2011-12-08 12:49:44 UTC
Adam can you see about getting the above two patches into Fedora's kernels?

Comment 36 Josh Boyer 2011-12-08 14:49:36 UTC
(In reply to comment #35)
> Adam can you see about getting the above two patches into Fedora's kernels?

We'll bring in the rawhide version for now and let it settle there for a few days.  If it looks good, we'll grab the f16 version after we get the 3.1.5 stable update pushed out.

Adam, in the meantime, if you want to respin your f16 kernel and test it locally that would also be helpful.

Comment 37 Adam Williamson 2011-12-08 18:11:23 UTC
Josh: I've been using the slightly older version on F16 for weeks without issue, I'll confirm with the new one too.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 38 Fedora Update System 2011-12-13 13:38:18 UTC
kernel-3.1.5-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.1.5-2.fc16

Comment 39 Jeff Layton 2011-12-13 15:43:46 UTC
*** Bug 759703 has been marked as a duplicate of this bug. ***

Comment 40 Fedora Update System 2011-12-13 21:51:14 UTC
Package kernel-3.1.5-2.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.1.5-2.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-17052/kernel-3.1.5-2.fc16
then log in and leave karma (feedback).

Comment 41 Fedora Update System 2011-12-14 23:39:48 UTC
kernel-3.1.5-2.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 42 Fedora Update System 2011-12-15 18:53:36 UTC
kernel-2.6.41.5-4.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.41.5-4.fc15

Comment 43 Fedora Update System 2011-12-22 01:17:18 UTC
kernel-2.6.41.6-1.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.41.6-1.fc15

Comment 44 mooz 2014-08-17 07:33:15 UTC
This issue exists in Redhat EPLC 6.5 kernel version 2.6.32-431.23.3.el6.x86_64 (latest stable release).

Comment 45 wwp 2016-05-10 11:31:42 UTC
The issue still exists in RHEL 6.7 kernel 2.6.32-573.26.1.el6.x86_64.


Note You need to log in before you can comment on or make changes to this bug.