Bug 444865

Summary: oops in cifs module while trying to stop a thread (kthread_stop) during filesystem mount
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: shirishp, staubach, steved, tao
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 19:56:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 442789    
Attachments:
Description Flags
updated patch none

Description Jeff Layton 2008-05-01 11:29:05 UTC
+++ This bug was initially created as a clone of Bug #442789 +++

Description of problem:

System panic

Version-Release number of selected component (if applicable):

cifs 1.50cRH

How reproducible:

Happened once.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

System Panic with this stack trace

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000002e50b390]
    pc: c0000000000a1f74: .kfree+0x8c/0xfc
    lr: c00000000005d8d8: .free_task+0x30/0x60
    sp: c00000002e50b610
   msr: 8000000000001032
   dar: 100100
 dsisr: 40000000
  current = 0xc0000000a07ef520
  paca    = 0xc000000000404800
    pid   = 23924, comm = mount.cifs
0:mon> t
[c00000002e50b6b0] c00000000005d8d8 .free_task+0x30/0x60
[c00000002e50b740] c00000000007f1f4 .kthread_stop+0xf0/0x168
[c00000002e50b7e0] d00000000058dad0 .cifs_mount+0xde0/0x1070 [cifs]
[c00000002e50b990] d00000000057908c .cifs_read_super+0x8c/0x1fc [cifs]
[c00000002e50ba30] d000000000579918 .cifs_get_sb+0x9c/0x124 [cifs]
[c00000002e50bad0] c0000000000ce7d8 .do_kern_mount+0xfc/0x29c
[c00000002e50bb80] c0000000000ef074 .do_new_mount+0x90/0xf0
[c00000002e50bc30] c0000000000efb88 .do_mount+0x1e4/0x22c
[c00000002e50bd60] c0000000000ff4a8 .compat_sys_mount+0x188/0x258
[c00000002e50be30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff53758
SP (ffffe680) is in userspace


Expected results:

Systems tests/testsuite keeps running.

Additional info:

This happens in connect.c, in cifs_mount function at this piece of code

                                force_sig(SIGKILL, srvTcp->tsk);
                                tsk = srvTcp->tsk;
                                if (tsk)
                                        kthread_stop(tsk);            <---

-- Additional comment from jlayton on 2008-04-16 16:20 EST --
Created an attachment (id=302668)
proposed upstream patch

Proposed patch -- only lightly tested.

I think that the problem here is that cifs_demultiplex_thread is allowed to
exit when signalled or if kthread_should_stop returns true. It should actually
only be allowed to exit when kthread_should_stop returns true. That should
prevent this panic.

Shagggy asked whether this patch might cause us to hang on the second pass into
kernel_recvmsg. I don't think that it will since the signal should still be
pending when we return from the first kernel_recvmsg call, so the next call
into it should return quickly.

The light testing I've done seems to indicate that that is the case. A umount
proceeded quickly and didn't hang.


-- Additional comment from jlayton on 2008-04-17 15:11 EST --
That patch isn't what we want I don't think. We want to allow the thread to
start coming down in some cases, but not to actually exit until after
kthread_stop is called.

I'm working on a patchset for upstream that should (hopefully) close these races.


-- Additional comment from jlayton on 2008-04-17 15:15 EST --
This doesn't really appear to be a regression, AFAICT. This looks to be a
long-standing problem that Shirish just now happened to hit.

-- Additional comment from jlayton on 2008-04-18 06:50 EST --
I've sent an initial patchset upstream that I think will fix this, awaiting
comments on it there...

Comment 1 Jeff Layton 2008-05-20 13:17:44 UTC
*** Bug 446932 has been marked as a duplicate of this bug. ***

Comment 3 RHEL Program Management 2008-06-10 14:07:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Jeff Layton 2008-06-11 17:32:03 UTC
A few hours after I posted this patch internally, Steve French found a problem
with it. If the server just closes the connection on a Negotiate Protocol error,
then the thread can hang indefinitely without coming down. There's a one line
fix that's been pushed upstream to Linus, and we should probably also take it
for RHEL. I plan to repost this in the next day or so.


Comment 7 Jeff Layton 2008-06-19 15:29:19 UTC
Created attachment 309854 [details]
updated patch

Updated patch. Wake up the response_q before going to sleep. This prevents
deadlock when a server just closes the connection during session setup.

Comment 8 Brad Peters 2008-07-11 22:05:27 UTC
Jeff, please let me know if anything is needed from IBM to get this patch into
5.3.  Thanks!

Comment 9 Don Zickus 2008-07-23 18:55:20 UTC
in kernel-2.6.18-99.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 15 errata-xmlrpc 2009-01-20 19:56:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html