Bug 712088
| Summary: | cifs causing suspend to fail | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> |
| Component: | kernel | Assignee: | Jeff Layton <jlayton> |
| Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 15 | CC: | aurorauddeborg, claus, dcbw, gansalmon, itamar, jiali, jonathan, kernel-maint, lpoetter, madhu.chinakonda, notting, redhat-bugzilla, sprabhu, steved, subscript |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-11-05 16:30:58 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Adam Williamson
2011-06-09 13:34:18 UTC
Forgot to mention - the offending CIFS mount should be: //192.168.1.13/Volume_1 /share/data cifs rsize=8192,wsize=8192,nosuid,soft,user=guest,noauto,comment=systemd.automount 0 0 note the use of systemd automounting. Patch sent upstream and should make its way into stable kernels soon. I think we also need to fix the "send" codepath as well, and make the sleep we do in wait_for_response freezable, but it's a little tricky as there is not yet a wait_event_killable variant that is also freezable. Just to note, I've noticed a very similar problem with NFS in current Rawhide: https://bugzilla.redhat.com/show_bug.cgi?id=717735 Still happening with 3.0-0.rc6.git0.1.1.fc16.x86_64 (that's just 3.0-0.rc6.git0.1.fc16.x86_64 with the patch to fix https://bugzilla.redhat.com/show_bug.cgi?id=717060 applied). Traceback is now: Jul 6 12:21:27 adam kernel: [ 6352.278471] PM: Syncing filesystems ... done. Jul 6 12:21:47 adam kernel: [ 6352.379145] Freezing user space processes ... Jul 6 12:21:47 adam kernel: [ 6372.363733] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0): Jul 6 12:21:47 adam kernel: [ 6372.363839] umount D ffff88044623a7a0 4224 3728 3711 0x00800084 Jul 6 12:21:47 adam kernel: [ 6372.363843] ffff8803bec7db38 0000000000000046 0000000000000292 ffff880400000000 Jul 6 12:21:47 adam kernel: [ 6372.363848] ffff88044623a3d0 ffff8803bec7dfd8 ffff8803bec7dfd8 00000000001d2d00 Jul 6 12:21:47 adam kernel: [ 6372.363852] ffff88044631c7a0 ffff88044623a3d0 ffff8803bec7db38 0000000181074c76 Jul 6 12:21:47 adam kernel: [ 6372.363856] Call Trace: Jul 6 12:21:47 adam kernel: [ 6372.363864] [<ffffffffa04eb8e0>] wait_for_response+0xbc/0xbe [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363869] [<ffffffff81074d4c>] ? __init_waitqueue_head+0x4b/0x4b Jul 6 12:21:47 adam kernel: [ 6372.363875] [<ffffffffa04ec6fe>] SendReceive+0x19b/0x2b5 [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363879] [<ffffffffa04d67e7>] CIFSSMBUnixQPathInfo+0x167/0x212 [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363883] [<ffffffff814f36e5>] ? _raw_spin_unlock+0x28/0x2c Jul 6 12:21:47 adam kernel: [ 6372.363888] [<ffffffffa04e76ab>] cifs_get_inode_info_unix+0x91/0x16b [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363893] [<ffffffffa04e082b>] ? build_path_from_dentry+0xda/0x1c2 [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363897] [<ffffffffa04e082b>] ? build_path_from_dentry+0xda/0x1c2 [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363901] [<ffffffff8112623e>] ? __kmalloc+0xfa/0x10c Jul 6 12:21:47 adam kernel: [ 6372.363905] [<ffffffffa04e082b>] ? build_path_from_dentry+0xda/0x1c2 [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363909] [<ffffffffa04e08de>] ? build_path_from_dentry+0x18d/0x1c2 [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363914] [<ffffffffa04e8972>] cifs_revalidate_dentry_attr+0x111/0x178 [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363919] [<ffffffffa04e8ab7>] cifs_getattr+0x83/0x10b [cifs] Jul 6 12:21:47 adam kernel: [ 6372.363922] [<ffffffff8113b7d9>] vfs_getattr+0x45/0x63 Jul 6 12:21:47 adam kernel: [ 6372.363925] [<ffffffff8113b84f>] vfs_fstatat+0x58/0x6e Jul 6 12:21:47 adam kernel: [ 6372.363928] [<ffffffff8113b8a0>] vfs_stat+0x1b/0x1d Jul 6 12:21:47 adam kernel: [ 6372.363930] [<ffffffff8113b99f>] sys_newstat+0x1a/0x33 Jul 6 12:21:47 adam kernel: [ 6372.363933] [<ffffffff81140b21>] ? path_put+0x1f/0x23 Jul 6 12:21:47 adam kernel: [ 6372.363937] [<ffffffff810ac3c9>] ? audit_syscall_entry+0x11c/0x148 Jul 6 12:21:47 adam kernel: [ 6372.363940] [<ffffffff8125cbbe>] ? trace_hardirqs_on_thunk+0x3a/0x3f Jul 6 12:21:47 adam kernel: [ 6372.363944] [<ffffffff814fa242>] system_call_fastpath+0x16/0x1b Jul 6 12:21:47 adam kernel: [ 6372.363947] Jul 6 12:21:47 adam kernel: [ 6372.363948] Restarting tasks ... done. Yes. The problem is that wait_event_killable doesn't handle the freezer. I think we also need to add a way to handle task freezing when in TASK_KILLABLE sleep. This is doable, but means adding a new wait_event_killable_freezable() sleep (or something along those lines). Created attachment 514019 [details]
patch: add wait_event_freezekillable and have cifs use it
Ok, here's a potential patch. I'm not 100% convinced that it doesn't have a race if a freeze event comes in, but it's probably good enough for an initial test. Adam, can you test this out and let me know if it helps?
The potential race I'm thinking about is here in __wait_event_killable:
if (condition) \
break; \
if (!fatal_signal_pending(current)) { \
schedule(); \
continue; \
} \
...I'm not sure if we'd continue to hang if we got a freeze event after checking the condition but before the schedule(). I'll probably need to run this by Rafal or someone else upstream to be sure.
Still, it's an unlikely race, so testing this patch would be helpful.
roger, will test - thanks jeff. Still fails with the patch. Running 3.0.0-3.1.fc16.x86_64 - that's 3.0.0-3.fc16.x86_64 with the patch applied - I get this trace: Aug 3 03:54:16 adam kernel: [151854.705496] PM: Syncing filesystems ... done. Aug 3 03:54:36 adam kernel: [151854.932420] Freezing user space processes ... Aug 3 03:54:36 adam kernel: [151874.917385] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0): Aug 3 03:54:36 adam kernel: [151874.917488] umount D ffff88040b5c4b90 4944 30573 30556 0x00800084 Aug 3 03:54:36 adam kernel: [151874.917493] ffff8803c57e7b38 0000000000000046 0000000000000292 ffff880400000000 Aug 3 03:54:36 adam kernel: [151874.917498] ffff88040b5c47c0 ffff8803c57e7fd8 ffff8803c57e7fd8 00000000001d3080 Aug 3 03:54:36 adam kernel: [151874.917502] ffff880446308000 ffff88040b5c47c0 ffff8803c57e7b38 00000001810764ae Aug 3 03:54:36 adam kernel: [151874.917506] Call Trace: Aug 3 03:54:36 adam kernel: [151874.917515] [<ffffffffa04d77c0>] wait_for_response+0xbc/0xbe [cifs] Aug 3 03:54:36 adam kernel: [151874.917520] [<ffffffff81076584>] ? __init_waitqueue_head+0x4b/0x4b Aug 3 03:54:36 adam kernel: [151874.917525] [<ffffffffa04d85de>] SendReceive+0x19b/0x2b5 [cifs] Aug 3 03:54:36 adam kernel: [151874.917530] [<ffffffffa04c25e7>] CIFSSMBUnixQPathInfo+0x167/0x212 [cifs] Aug 3 03:54:36 adam kernel: [151874.917535] [<ffffffffa04d358b>] cifs_get_inode_info_unix+0x91/0x16b [cifs] Aug 3 03:54:36 adam kernel: [151874.917539] [<ffffffff8108abfe>] ? lock_acquire+0xdb/0x103 Aug 3 03:54:36 adam kernel: [151874.917543] [<ffffffffa04cc4ec>] ? cifs_ci_compare+0x70/0x70 [cifs] Aug 3 03:54:36 adam kernel: [151874.917546] [<ffffffff8108aafa>] ? lock_release+0x173/0x19c Aug 3 03:54:36 adam kernel: [151874.917550] [<ffffffffa04cc542>] ? rcu_read_unlock+0x21/0x23 [cifs] Aug 3 03:54:36 adam kernel: [151874.917555] [<ffffffffa04cc78f>] ? build_path_from_dentry+0x1e6/0x21b [cifs] Aug 3 03:54:36 adam kernel: [151874.917560] [<ffffffffa04d4852>] cifs_revalidate_dentry_attr+0x111/0x178 [cifs] Aug 3 03:54:36 adam kernel: [151874.917565] [<ffffffffa04d4997>] cifs_getattr+0x83/0x10b [cifs] Aug 3 03:54:36 adam kernel: [151874.917569] [<ffffffff8113eb65>] vfs_getattr+0x45/0x63 Aug 3 03:54:36 adam kernel: [151874.917572] [<ffffffff8113ebdb>] vfs_fstatat+0x58/0x6e Aug 3 03:54:36 adam kernel: [151874.917575] [<ffffffff8113ec2c>] vfs_stat+0x1b/0x1d Aug 3 03:54:36 adam kernel: [151874.917577] [<ffffffff8113ed2b>] sys_newstat+0x1a/0x33 Aug 3 03:54:36 adam kernel: [151874.917580] [<ffffffff81143ecd>] ? path_put+0x1f/0x23 Aug 3 03:54:36 adam kernel: [151874.917584] [<ffffffff810aff85>] ? audit_syscall_entry+0x11c/0x148 Aug 3 03:54:36 adam kernel: [151874.917587] [<ffffffff812452be>] ? trace_hardirqs_on_thunk+0x3a/0x3f Aug 3 03:54:36 adam kernel: [151874.917591] [<ffffffff814e4fc2>] system_call_fastpath+0x16/0x1b Aug 3 03:54:36 adam kernel: [151874.917594] Aug 3 03:54:36 adam kernel: [151874.917595] Restarting tasks ... done. jeff: any news here? this has been making it impossible for me to suspend for weeks :/ Sorry, I got sidetracked on some other stuff, but I took another look at this today. I added some debug code in the __wait_event_killable macro, but I think there's a crucial hint here in the stack trace you posted as well. Here's the one from my machine: [18662.635383] sky2 0000:09:00.0: eth0: disabling interface [18662.790079] 9718: __wait_event_killable wakeup, freezing=0 [18664.024830] PM: Syncing filesystems ... done. [18664.041172] PM: Preparing system for mem sleep [18664.192465] Freezing user space processes ... [18684.202257] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0): [18684.202372] umount D ffff88010fe94700 0 9718 9701 0x00800084 [18684.202376] ffff8801169c3b08 0000000000000082 ffff8801169c3b40 ffff88003796e140 [18684.202380] ffff880117dd8000 ffff8801169c3fd8 ffff8801169c3fd8 0000000000012540 [18684.202383] ffff88010af92e60 ffff880117dd8000 ffff8801169c3b18 ffff8801169c3ad8 [18684.202387] Call Trace: ...so what happened here is that process 9718 got woken up just after eth0 went down, but not by the freezer. I suspect it got woken up by cifsd or something (though I'd need to do some more debugging to discern that). The crucial thing though here is the actual process name -- "umount". It looks like what's happening here is that the suspend code is trying to unmount the cifs mount *after* taking down the network interfaces. I think that's a big part of the problem here. If the userspace suspend code wants to unmount the filesystem, it ought to wait to take down the interfaces until after that. If not, then it probably ought not try to unmount the fs at all. I suspect that if that's fixed, then this problem will go away. I'm just a little out of the loop on how suspending in userspace works. Why does it try to unmount the filesystems here in the first place? I guess, to summarize my point -- why attempt to unmount the fs at all if you're going to rip out the network interfaces before that can complete? At that point you might as well just leave them mounted. In the meantime I'll keep looking at what's happening with the hung process, but something seems wrong in this design... Created attachment 523204 [details]
ring buffer log with cifsFYI turned up and extra waitqueue debugging
Here we can see what's happening (at least on my laptop). Something downs the network interface, and nearly simultaneously attempts to unmount the cifs fs. That unmount can't complete though (since the interface is down).
Eventually after waking back up, it does complete, but by then the suspend has given up.
Note that about half the time, suspending in this situation works on this host. When that occurs though, it looks like the unmount finishes before the network interface goes down.
The thing I'm trying to determine now is -- why is the umount process not being woken up after this printk pops?
[18664.192465] Freezing user space processes ...
so, the user-space side of things is handled in pm-suspend, which is part of pm-utils. pm-suspend itself is a script, and it appears to source scripts in /usr/lib64/pm-utils/sleep.d/ . One of those is /usr/lib64/pm-utils/sleep.d/55NetworkManager , which does the 'take down the network interface' bit. I am having trouble finding out what's doing the 'umount filesystems' bit, though. still looking. it looks like setting PM_DEBUG = true might get more debugging output from pm_utils, in case that might help. could it be systemd's automount stuff which is trying to take down the mount? let's throw some CCs at the problem! lennart, bill - any idea what in userspace would be trying to umount a CIFS share (configured as systemd.automount) at the same time as the network is being taken down, on suspend? netfs init script is called from a NetworkManager dispatcher script. See /etc/NetworkManager/dispatcher.d/05-netfs. okay. I notice that /etc/rc.d/init.d/netfs excepts mount points specified as 'noauto' from its 'start' action but for 'stop' it uses any point listed in /etc/mtab :
CIFSMTAB=$(LC_ALL=C awk '$3 == "cifs" { print $2 }' /proc/mounts | tac)
and that awk command does indeed return my /share/data partition. So it looks likely that indeed is what's making it get umounted.
so, presumably it should be up to NM to make sure it doesn't take the interface down until the umount is actually complete? let's add dcbw to the CC party.
The question I have is -- why unmount these in the first place? Sure, the netif might go away but you might have the fs pinned in place. For instance, a user process could be sitting with its cwd inside a cifs or nfs mount. A umount will then fail... It seems to me that we'd be better off not trying to unmount those on suspend at all. It's fine when the mount is idle, but when it isn't it's not likely to work. That does mean that we'll need to ensure that processes that are hung waiting on I/O can still be frozen though. I think that's something we can achieve, but I'm still working through the details. Created attachment 523251 [details]
patch -- add wait_event_freezekillable and have cifs use it
A more highly experimental patch than the last one. This one also modifies fake_signal_wake_up to wake up TASK_KILLABLE tasks. I don't *think* this would break anything but I'm far from certain.
This seems to fix the problem for me, but it's quite possible that it'll get shot down upstream. Still, results testing with this would be helpful...
To clarify...even if you fix the suspend userspace code not to try and unmount we still will likely need to fix this, regardless. It's possible that when the network interfaces are taken down for the suspend that there's another process on the box that's doing something on the cifs mount. If there is, then we need to allow it to freeze while it's waiting on the reply. When it wakes back up, the cifs code should deal with the reconnect... Note that someone mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=717735 - the highly similar bug for NFS - that it was working for them in rc6, and indeed I just tried a suspend last night with rc6, and it worked. Has there been a change recently that might affect this? I've only had success once, so it'd be premature to close the bug, but it's interesting. Probably worth noting I had a process actively using the mount when I suspended, which is not usually the case. Been running kernels with this patch on my laptop and desktop for a week or so, noticed no problems and have been able to suspend both systems first time, every time. Looks good to me! -- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers *** This bug has been marked as a duplicate of bug 717735 *** |