Bug 442789 - oops in cifs module while trying to stop a thread (kthread_stop) during filesystem mount
oops in cifs module while trying to stop a thread (kthread_stop) during files...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.7
ppc64 Linux
high Severity high
: rc
: ---
Assigned To: Jeff Layton
: OtherQA
Depends On: 444865
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-16 15:57 EDT by Shirish S. Pargaonkar
Modified: 2011-01-24 17:54 EST (History)
4 users (show)

See Also:
Fixed In Version: RHSA-2008-0665
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-24 15:29:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
proposed upstream patch (960 bytes, patch)
2008-04-16 16:20 EDT, Jeff Layton
no flags Details | Diff
patch -- don't allow cifsd to exit until kthread_stop is called (4.07 KB, patch)
2008-05-01 07:36 EDT, Jeff Layton
no flags Details | Diff
updated patch -- wake up response_q before going to sleep (3.97 KB, patch)
2008-06-19 11:28 EDT, Jeff Layton
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 43875 None None None Never

  None (edit)
Description Shirish S. Pargaonkar 2008-04-16 15:57:48 EDT
Description of problem:

System panic

Version-Release number of selected component (if applicable):

cifs 1.50cRH

How reproducible:

Happened once.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

System Panic with this stack trace

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000002e50b390]
    pc: c0000000000a1f74: .kfree+0x8c/0xfc
    lr: c00000000005d8d8: .free_task+0x30/0x60
    sp: c00000002e50b610
   msr: 8000000000001032
   dar: 100100
 dsisr: 40000000
  current = 0xc0000000a07ef520
  paca    = 0xc000000000404800
    pid   = 23924, comm = mount.cifs
0:mon> t
[c00000002e50b6b0] c00000000005d8d8 .free_task+0x30/0x60
[c00000002e50b740] c00000000007f1f4 .kthread_stop+0xf0/0x168
[c00000002e50b7e0] d00000000058dad0 .cifs_mount+0xde0/0x1070 [cifs]
[c00000002e50b990] d00000000057908c .cifs_read_super+0x8c/0x1fc [cifs]
[c00000002e50ba30] d000000000579918 .cifs_get_sb+0x9c/0x124 [cifs]
[c00000002e50bad0] c0000000000ce7d8 .do_kern_mount+0xfc/0x29c
[c00000002e50bb80] c0000000000ef074 .do_new_mount+0x90/0xf0
[c00000002e50bc30] c0000000000efb88 .do_mount+0x1e4/0x22c
[c00000002e50bd60] c0000000000ff4a8 .compat_sys_mount+0x188/0x258
[c00000002e50be30] c000000000011280 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 000000000ff53758
SP (ffffe680) is in userspace


Expected results:

Systems tests/testsuite keeps running.

Additional info:

This happens in connect.c, in cifs_mount function at this piece of code

                                force_sig(SIGKILL, srvTcp->tsk);
                                tsk = srvTcp->tsk;
                                if (tsk)
                                        kthread_stop(tsk);            <---
Comment 1 Jeff Layton 2008-04-16 16:20:43 EDT
Created attachment 302668 [details]
proposed upstream patch

Proposed patch -- only lightly tested.

I think that the problem here is that cifs_demultiplex_thread is allowed to
exit when signalled or if kthread_should_stop returns true. It should actually
only be allowed to exit when kthread_should_stop returns true. That should
prevent this panic.

Shagggy asked whether this patch might cause us to hang on the second pass into
kernel_recvmsg. I don't think that it will since the signal should still be
pending when we return from the first kernel_recvmsg call, so the next call
into it should return quickly.

The light testing I've done seems to indicate that that is the case. A umount
proceeded quickly and didn't hang.
Comment 2 Jeff Layton 2008-04-17 15:11:08 EDT
That patch isn't what we want I don't think. We want to allow the thread to
start coming down in some cases, but not to actually exit until after
kthread_stop is called.

I'm working on a patchset for upstream that should (hopefully) close these races.
Comment 3 Jeff Layton 2008-04-17 15:15:48 EDT
This doesn't really appear to be a regression, AFAICT. This looks to be a
long-standing problem that Shirish just now happened to hit.
Comment 4 Jeff Layton 2008-04-18 06:50:51 EDT
I've sent an initial patchset upstream that I think will fix this, awaiting
comments on it there...
Comment 5 Jeff Layton 2008-05-01 07:34:24 EDT
Actually. I think this is a regression. I think this problem was introduced when
cifsd was changed to use the kthread interface instead of kernel_thread.

It's not something someone is likely to hit frequently. Shirish only hit it when
attempting to mount a share while the box was being stressed.

Either way, once we settle on an upstream patch, I'll see about getting this
proposed for RHEL.
Comment 6 Jeff Layton 2008-05-01 07:36:44 EDT
Created attachment 304304 [details]
patch -- don't allow cifsd to exit until kthread_stop is called

This is the patch I'm currently working with. This just makes cifsd go to sleep
until kthread_stop is actually called after it exits the main while loop.
Comment 7 Brad Peters 2008-05-30 11:28:14 EDT
Mirroring Business Justification manually, as automated system is still being fixed:

 ------- Additional Comment #57 From Shirish S. Pargaonkar  2008-05-30 10:41 EDT
 [reply] -------     Internal Only

This bug manifests as a system crash which can result in unexpected
interruption for customers who are expecting systems to run smoothly.
It can result in a data loss and jeopardise data integrirty as filesystem 
operations can get interrupted due to system crash.

This is a race condition probelm, so any environment which has operations 
going on that are stressing the system can run into race condition. 
Operations such as mount and unmount and accessing data over remote file 
systems over cifs/smb filesystems can race to cause system crashes.  
On a highly active or a stressed system, race conditions are unpredictable,
so the crash can happen within short period of time or it can take longer.

The race is between cifsd daemon and mount and unmount operations.  
This  problem is not specific to a particular platform or release level, 
the bug has  been there to manifest in a system crash and should be fixed
in all release levels.
Comment 8 Brad Peters 2008-05-30 16:16:08 EDT
 ------- Additional Comment #60 From Emily J. Ratliff  2008-05-30 14:22 EDT 
[reply] 

Hi Ben,

Given Shirish's excellent description of the customer impact, can we boost the
priority of the RH IT from medium to high?

Emily
Comment 9 Jeff Layton 2008-06-02 07:06:12 EDT
How often is this being hit? I'd prefer to let this have more time upstream
before we add it to RHEL, but if you're hitting it regularly then I'll
reconsider that...
Comment 10 Jeff Layton 2008-06-02 07:09:05 EDT
Also, it would be helpful to know whether IBM has tested this patch and can
confirm that it fixes the problem they've seen.
Comment 11 Brad Peters 2008-06-09 11:38:17 EDT
Mirroring system still appears broken:

 ------- Additional Comment #65 From Shirish S. Pargaonkar  2008-06-02 07:53 EDT
 [reply] -------     Internal Only

(In reply to comment #62)
>  How often is this being hit? I'd prefer to let this have more time
> upstream before we add it to RHEL, but if you're hitting it regularly then 
> I'll reconsider that...

This problem can be recreated fairly consistently.  IBM bugzilla has another
bug opened for this race condition related crash, this patch fixes that 
problem as well.
Comment 14 RHEL Product and Program Management 2008-06-10 10:26:45 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 16 Jeff Layton 2008-06-11 13:31:45 EDT
A few hours after I posted this patch internally, Steve French found a problem
with it. If the server just closes the connection on a Negotiate Protocol error,
then the thread can hang indefinitely without coming down. There's a one line
fix that's been pushed upstream to Linus, and we should probably also take it
for RHEL. I plan to repost this in the next day or so.
Comment 20 Jeff Layton 2008-06-19 11:28:37 EDT
Created attachment 309852 [details]
updated patch -- wake up response_q before going to sleep

Updated patch. Wake up the response_q before going to sleep. This prevents
deadlock when a server just closes the connection during session setup.
Comment 22 Vivek Goyal 2008-06-25 17:20:04 EDT
Committed in 75.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 24 Chris Ward 2008-07-14 07:09:16 EDT
~ Final Notice ~ Testing Phase Ending Soon

This bug should have been fixed in the latest RHEL 4.7 Release Candidate,
available **now** on partners.redhat.com.

If you have already verified that this bug has been properly fixed or have found
any issues, please provide detailed feedback of your testing results, including
the package version and snapshot tested. The bug status will be updated for you,
based on the returned testing results.

Contact your Partner Manager with any additional questions.
Comment 26 Chris Ward 2008-07-21 06:00:26 EDT
ParterVerified. See IT#181036.
Comment 28 errata-xmlrpc 2008-07-24 15:29:03 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html
Comment 29 Chris Ward 2008-07-29 03:27:48 EDT
Partners, I would like to thank you all for your participation in assuring the
quality of this RHEL 4.7 Update Release. My hat's off to you all. Thanks.

Note You need to log in before you can comment on or make changes to this bug.