Bug 444865 - oops in cifs module while trying to stop a thread (kthread_stop) during filesystem mount
Summary: oops in cifs module while trying to stop a thread (kthread_stop) during files...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: ppc64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Martin Jenner
URL:
Whiteboard:
: 446932 (view as bug list)
Depends On:
Blocks: 442789
TreeView+ depends on / blocked
 
Reported: 2008-05-01 11:29 UTC by Jeff Layton
Modified: 2018-10-19 18:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 19:56:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
updated patch (3.92 KB, patch)
2008-06-19 15:29 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 43953 0 None None None Never
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Jeff Layton 2008-05-01 11:29:05 UTC
+++ This bug was initially created as a clone of Bug #442789 +++

Description of problem:

System panic

Version-Release number of selected component (if applicable):

cifs 1.50cRH

How reproducible:

Happened once.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

System Panic with this stack trace

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000002e50b390]
    pc: c0000000000a1f74: .kfree+0x8c/0xfc
    lr: c00000000005d8d8: .free_task+0x30/0x60
    sp: c00000002e50b610
   msr: 8000000000001032
   dar: 100100
 dsisr: 40000000
  current = 0xc0000000a07ef520
  paca    = 0xc000000000404800
    pid   = 23924, comm = mount.cifs
0:mon> t
[c00000002e50b6b0] c00000000005d8d8 .free_task+0x30/0x60
[c00000002e50b740] c00000000007f1f4 .kthread_stop+0xf0/0x168
[c00000002e50b7e0] d00000000058dad0 .cifs_mount+0xde0/0x1070 [cifs]
[c00000002e50b990] d00000000057908c .cifs_read_super+0x8c/0x1fc [cifs]
[c00000002e50ba30] d000000000579918 .cifs_get_sb+0x9c/0x124 [cifs]
[c00000002e50bad0] c0000000000ce7d8 .do_kern_mount+0xfc/0x29c
[c00000002e50bb80] c0000000000ef074 .do_new_mount+0x90/0xf0
[c00000002e50bc30] c0000000000efb88 .do_mount+0x1e4/0x22c
[c00000002e50bd60] c0000000000ff4a8 .compat_sys_mount+0x188/0x258
[c00000002e50be30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff53758
SP (ffffe680) is in userspace


Expected results:

Systems tests/testsuite keeps running.

Additional info:

This happens in connect.c, in cifs_mount function at this piece of code

                                force_sig(SIGKILL, srvTcp->tsk);
                                tsk = srvTcp->tsk;
                                if (tsk)
                                        kthread_stop(tsk);            <---

-- Additional comment from jlayton on 2008-04-16 16:20 EST --
Created an attachment (id=302668)
proposed upstream patch

Proposed patch -- only lightly tested.

I think that the problem here is that cifs_demultiplex_thread is allowed to
exit when signalled or if kthread_should_stop returns true. It should actually
only be allowed to exit when kthread_should_stop returns true. That should
prevent this panic.

Shagggy asked whether this patch might cause us to hang on the second pass into
kernel_recvmsg. I don't think that it will since the signal should still be
pending when we return from the first kernel_recvmsg call, so the next call
into it should return quickly.

The light testing I've done seems to indicate that that is the case. A umount
proceeded quickly and didn't hang.


-- Additional comment from jlayton on 2008-04-17 15:11 EST --
That patch isn't what we want I don't think. We want to allow the thread to
start coming down in some cases, but not to actually exit until after
kthread_stop is called.

I'm working on a patchset for upstream that should (hopefully) close these races.


-- Additional comment from jlayton on 2008-04-17 15:15 EST --
This doesn't really appear to be a regression, AFAICT. This looks to be a
long-standing problem that Shirish just now happened to hit.

-- Additional comment from jlayton on 2008-04-18 06:50 EST --
I've sent an initial patchset upstream that I think will fix this, awaiting
comments on it there...

Comment 1 Jeff Layton 2008-05-20 13:17:44 UTC
*** Bug 446932 has been marked as a duplicate of this bug. ***

Comment 3 RHEL Program Management 2008-06-10 14:07:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Jeff Layton 2008-06-11 17:32:03 UTC
A few hours after I posted this patch internally, Steve French found a problem
with it. If the server just closes the connection on a Negotiate Protocol error,
then the thread can hang indefinitely without coming down. There's a one line
fix that's been pushed upstream to Linus, and we should probably also take it
for RHEL. I plan to repost this in the next day or so.


Comment 7 Jeff Layton 2008-06-19 15:29:19 UTC
Created attachment 309854 [details]
updated patch

Updated patch. Wake up the response_q before going to sleep. This prevents
deadlock when a server just closes the connection during session setup.

Comment 8 Brad Peters 2008-07-11 22:05:27 UTC
Jeff, please let me know if anything is needed from IBM to get this patch into
5.3.  Thanks!

Comment 9 Don Zickus 2008-07-23 18:55:20 UTC
in kernel-2.6.18-99.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 15 errata-xmlrpc 2009-01-20 19:56:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.