Bug 444865 - oops in cifs module while trying to stop a thread (kthread_stop) during filesystem mount
Summary: oops in cifs module while trying to stop a thread (kthread_stop) during files...
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 5.2
Hardware: ppc64 Linux
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Martin Jenner
: 446932 (view as bug list)
Depends On:
Blocks: 442789
TreeView+ depends on / blocked
Reported: 2008-05-01 11:29 UTC by Jeff Layton
Modified: 2018-10-19 18:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-20 19:56:44 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
updated patch (3.92 KB, patch)
2008-06-19 15:29 UTC, Jeff Layton
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC
IBM Linux Technology Center 43953 None None None Never

Description Jeff Layton 2008-05-01 11:29:05 UTC
+++ This bug was initially created as a clone of Bug #442789 +++

Description of problem:

System panic

Version-Release number of selected component (if applicable):

cifs 1.50cRH

How reproducible:

Happened once.

Steps to Reproduce:
Actual results:

System Panic with this stack trace

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000002e50b390]
    pc: c0000000000a1f74: .kfree+0x8c/0xfc
    lr: c00000000005d8d8: .free_task+0x30/0x60
    sp: c00000002e50b610
   msr: 8000000000001032
   dar: 100100
 dsisr: 40000000
  current = 0xc0000000a07ef520
  paca    = 0xc000000000404800
    pid   = 23924, comm = mount.cifs
0:mon> t
[c00000002e50b6b0] c00000000005d8d8 .free_task+0x30/0x60
[c00000002e50b740] c00000000007f1f4 .kthread_stop+0xf0/0x168
[c00000002e50b7e0] d00000000058dad0 .cifs_mount+0xde0/0x1070 [cifs]
[c00000002e50b990] d00000000057908c .cifs_read_super+0x8c/0x1fc [cifs]
[c00000002e50ba30] d000000000579918 .cifs_get_sb+0x9c/0x124 [cifs]
[c00000002e50bad0] c0000000000ce7d8 .do_kern_mount+0xfc/0x29c
[c00000002e50bb80] c0000000000ef074 .do_new_mount+0x90/0xf0
[c00000002e50bc30] c0000000000efb88 .do_mount+0x1e4/0x22c
[c00000002e50bd60] c0000000000ff4a8 .compat_sys_mount+0x188/0x258
[c00000002e50be30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff53758
SP (ffffe680) is in userspace

Expected results:

Systems tests/testsuite keeps running.

Additional info:

This happens in connect.c, in cifs_mount function at this piece of code

                                force_sig(SIGKILL, srvTcp->tsk);
                                tsk = srvTcp->tsk;
                                if (tsk)
                                        kthread_stop(tsk);            <---

-- Additional comment from jlayton@redhat.com on 2008-04-16 16:20 EST --
Created an attachment (id=302668)
proposed upstream patch

Proposed patch -- only lightly tested.

I think that the problem here is that cifs_demultiplex_thread is allowed to
exit when signalled or if kthread_should_stop returns true. It should actually
only be allowed to exit when kthread_should_stop returns true. That should
prevent this panic.

Shagggy asked whether this patch might cause us to hang on the second pass into
kernel_recvmsg. I don't think that it will since the signal should still be
pending when we return from the first kernel_recvmsg call, so the next call
into it should return quickly.

The light testing I've done seems to indicate that that is the case. A umount
proceeded quickly and didn't hang.

-- Additional comment from jlayton@redhat.com on 2008-04-17 15:11 EST --
That patch isn't what we want I don't think. We want to allow the thread to
start coming down in some cases, but not to actually exit until after
kthread_stop is called.

I'm working on a patchset for upstream that should (hopefully) close these races.

-- Additional comment from jlayton@redhat.com on 2008-04-17 15:15 EST --
This doesn't really appear to be a regression, AFAICT. This looks to be a
long-standing problem that Shirish just now happened to hit.

-- Additional comment from jlayton@redhat.com on 2008-04-18 06:50 EST --
I've sent an initial patchset upstream that I think will fix this, awaiting
comments on it there...

Comment 1 Jeff Layton 2008-05-20 13:17:44 UTC
*** Bug 446932 has been marked as a duplicate of this bug. ***

Comment 3 RHEL Product and Program Management 2008-06-10 14:07:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update

Comment 5 Jeff Layton 2008-06-11 17:32:03 UTC
A few hours after I posted this patch internally, Steve French found a problem
with it. If the server just closes the connection on a Negotiate Protocol error,
then the thread can hang indefinitely without coming down. There's a one line
fix that's been pushed upstream to Linus, and we should probably also take it
for RHEL. I plan to repost this in the next day or so.

Comment 7 Jeff Layton 2008-06-19 15:29:19 UTC
Created attachment 309854 [details]
updated patch

Updated patch. Wake up the response_q before going to sleep. This prevents
deadlock when a server just closes the connection during session setup.

Comment 8 Brad Peters 2008-07-11 22:05:27 UTC
Jeff, please let me know if anything is needed from IBM to get this patch into
5.3.  Thanks!

Comment 9 Don Zickus 2008-07-23 18:55:20 UTC
in kernel-2.6.18-99.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 15 errata-xmlrpc 2009-01-20 19:56:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.