Red Hat Bugzilla – Bug 442789
oops in cifs module while trying to stop a thread (kthread_stop) during filesystem mount
Last modified: 2011-01-24 17:54:56 EST
Description of problem:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
System Panic with this stack trace
cpu 0x0: Vector: 300 (Data Access) at [c00000002e50b390]
pc: c0000000000a1f74: .kfree+0x8c/0xfc
lr: c00000000005d8d8: .free_task+0x30/0x60
current = 0xc0000000a07ef520
paca = 0xc000000000404800
pid = 23924, comm = mount.cifs
[c00000002e50b6b0] c00000000005d8d8 .free_task+0x30/0x60
[c00000002e50b740] c00000000007f1f4 .kthread_stop+0xf0/0x168
[c00000002e50b7e0] d00000000058dad0 .cifs_mount+0xde0/0x1070 [cifs]
[c00000002e50b990] d00000000057908c .cifs_read_super+0x8c/0x1fc [cifs]
[c00000002e50ba30] d000000000579918 .cifs_get_sb+0x9c/0x124 [cifs]
[c00000002e50bad0] c0000000000ce7d8 .do_kern_mount+0xfc/0x29c
[c00000002e50bb80] c0000000000ef074 .do_new_mount+0x90/0xf0
[c00000002e50bc30] c0000000000efb88 .do_mount+0x1e4/0x22c
[c00000002e50bd60] c0000000000ff4a8 .compat_sys_mount+0x188/0x258
[c00000002e50be30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff53758
SP (ffffe680) is in userspace
Systems tests/testsuite keeps running.
This happens in connect.c, in cifs_mount function at this piece of code
tsk = srvTcp->tsk;
Created attachment 302668 [details]
proposed upstream patch
Proposed patch -- only lightly tested.
I think that the problem here is that cifs_demultiplex_thread is allowed to
exit when signalled or if kthread_should_stop returns true. It should actually
only be allowed to exit when kthread_should_stop returns true. That should
prevent this panic.
Shagggy asked whether this patch might cause us to hang on the second pass into
kernel_recvmsg. I don't think that it will since the signal should still be
pending when we return from the first kernel_recvmsg call, so the next call
into it should return quickly.
The light testing I've done seems to indicate that that is the case. A umount
proceeded quickly and didn't hang.
That patch isn't what we want I don't think. We want to allow the thread to
start coming down in some cases, but not to actually exit until after
kthread_stop is called.
I'm working on a patchset for upstream that should (hopefully) close these races.
This doesn't really appear to be a regression, AFAICT. This looks to be a
long-standing problem that Shirish just now happened to hit.
I've sent an initial patchset upstream that I think will fix this, awaiting
comments on it there...
Actually. I think this is a regression. I think this problem was introduced when
cifsd was changed to use the kthread interface instead of kernel_thread.
It's not something someone is likely to hit frequently. Shirish only hit it when
attempting to mount a share while the box was being stressed.
Either way, once we settle on an upstream patch, I'll see about getting this
proposed for RHEL.
Created attachment 304304 [details]
patch -- don't allow cifsd to exit until kthread_stop is called
This is the patch I'm currently working with. This just makes cifsd go to sleep
until kthread_stop is actually called after it exits the main while loop.
Mirroring Business Justification manually, as automated system is still being fixed:
------- Additional Comment #57 From Shirish S. Pargaonkar 2008-05-30 10:41 EDT
[reply] ------- Internal Only
This bug manifests as a system crash which can result in unexpected
interruption for customers who are expecting systems to run smoothly.
It can result in a data loss and jeopardise data integrirty as filesystem
operations can get interrupted due to system crash.
This is a race condition probelm, so any environment which has operations
going on that are stressing the system can run into race condition.
Operations such as mount and unmount and accessing data over remote file
systems over cifs/smb filesystems can race to cause system crashes.
On a highly active or a stressed system, race conditions are unpredictable,
so the crash can happen within short period of time or it can take longer.
The race is between cifsd daemon and mount and unmount operations.
This problem is not specific to a particular platform or release level,
the bug has been there to manifest in a system crash and should be fixed
in all release levels.
------- Additional Comment #60 From Emily J. Ratliff 2008-05-30 14:22 EDT
Given Shirish's excellent description of the customer impact, can we boost the
priority of the RH IT from medium to high?
How often is this being hit? I'd prefer to let this have more time upstream
before we add it to RHEL, but if you're hitting it regularly then I'll
Also, it would be helpful to know whether IBM has tested this patch and can
confirm that it fixes the problem they've seen.
Mirroring system still appears broken:
------- Additional Comment #65 From Shirish S. Pargaonkar 2008-06-02 07:53 EDT
[reply] ------- Internal Only
(In reply to comment #62)
> How often is this being hit? I'd prefer to let this have more time
> upstream before we add it to RHEL, but if you're hitting it regularly then
> I'll reconsider that...
This problem can be recreated fairly consistently. IBM bugzilla has another
bug opened for this race condition related crash, this patch fixes that
problem as well.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
A few hours after I posted this patch internally, Steve French found a problem
with it. If the server just closes the connection on a Negotiate Protocol error,
then the thread can hang indefinitely without coming down. There's a one line
fix that's been pushed upstream to Linus, and we should probably also take it
for RHEL. I plan to repost this in the next day or so.
Created attachment 309852 [details]
updated patch -- wake up response_q before going to sleep
Updated patch. Wake up the response_q before going to sleep. This prevents
deadlock when a server just closes the connection during session setup.
Committed in 75.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
~ Final Notice ~ Testing Phase Ending Soon
This bug should have been fixed in the latest RHEL 4.7 Release Candidate,
available **now** on partners.redhat.com.
If you have already verified that this bug has been properly fixed or have found
any issues, please provide detailed feedback of your testing results, including
the package version and snapshot tested. The bug status will be updated for you,
based on the returned testing results.
Contact your Partner Manager with any additional questions.
ParterVerified. See IT#181036.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
Partners, I would like to thank you all for your participation in assuring the
quality of this RHEL 4.7 Update Release. My hat's off to you all. Thanks.