Description of problem: System panic Version-Release number of selected component (if applicable): cifs 1.50cRH How reproducible: Happened once. Steps to Reproduce: 1. 2. 3. Actual results: System Panic with this stack trace 0:mon> e cpu 0x0: Vector: 300 (Data Access) at [c00000002e50b390] pc: c0000000000a1f74: .kfree+0x8c/0xfc lr: c00000000005d8d8: .free_task+0x30/0x60 sp: c00000002e50b610 msr: 8000000000001032 dar: 100100 dsisr: 40000000 current = 0xc0000000a07ef520 paca = 0xc000000000404800 pid = 23924, comm = mount.cifs 0:mon> t [c00000002e50b6b0] c00000000005d8d8 .free_task+0x30/0x60 [c00000002e50b740] c00000000007f1f4 .kthread_stop+0xf0/0x168 [c00000002e50b7e0] d00000000058dad0 .cifs_mount+0xde0/0x1070 [cifs] [c00000002e50b990] d00000000057908c .cifs_read_super+0x8c/0x1fc [cifs] [c00000002e50ba30] d000000000579918 .cifs_get_sb+0x9c/0x124 [cifs] [c00000002e50bad0] c0000000000ce7d8 .do_kern_mount+0xfc/0x29c [c00000002e50bb80] c0000000000ef074 .do_new_mount+0x90/0xf0 [c00000002e50bc30] c0000000000efb88 .do_mount+0x1e4/0x22c [c00000002e50bd60] c0000000000ff4a8 .compat_sys_mount+0x188/0x258 [c00000002e50be30] c000000000011280 syscall_exit+0x0/0x18 --- Exception: c01 (System Call) at 000000000ff53758 SP (ffffe680) is in userspace Expected results: Systems tests/testsuite keeps running. Additional info: This happens in connect.c, in cifs_mount function at this piece of code force_sig(SIGKILL, srvTcp->tsk); tsk = srvTcp->tsk; if (tsk) kthread_stop(tsk); <---
Created attachment 302668 [details] proposed upstream patch Proposed patch -- only lightly tested. I think that the problem here is that cifs_demultiplex_thread is allowed to exit when signalled or if kthread_should_stop returns true. It should actually only be allowed to exit when kthread_should_stop returns true. That should prevent this panic. Shagggy asked whether this patch might cause us to hang on the second pass into kernel_recvmsg. I don't think that it will since the signal should still be pending when we return from the first kernel_recvmsg call, so the next call into it should return quickly. The light testing I've done seems to indicate that that is the case. A umount proceeded quickly and didn't hang.
That patch isn't what we want I don't think. We want to allow the thread to start coming down in some cases, but not to actually exit until after kthread_stop is called. I'm working on a patchset for upstream that should (hopefully) close these races.
This doesn't really appear to be a regression, AFAICT. This looks to be a long-standing problem that Shirish just now happened to hit.
I've sent an initial patchset upstream that I think will fix this, awaiting comments on it there...
Actually. I think this is a regression. I think this problem was introduced when cifsd was changed to use the kthread interface instead of kernel_thread. It's not something someone is likely to hit frequently. Shirish only hit it when attempting to mount a share while the box was being stressed. Either way, once we settle on an upstream patch, I'll see about getting this proposed for RHEL.
Created attachment 304304 [details] patch -- don't allow cifsd to exit until kthread_stop is called This is the patch I'm currently working with. This just makes cifsd go to sleep until kthread_stop is actually called after it exits the main while loop.
Mirroring Business Justification manually, as automated system is still being fixed: ------- Additional Comment #57 From Shirish S. Pargaonkar 2008-05-30 10:41 EDT [reply] ------- Internal Only This bug manifests as a system crash which can result in unexpected interruption for customers who are expecting systems to run smoothly. It can result in a data loss and jeopardise data integrirty as filesystem operations can get interrupted due to system crash. This is a race condition probelm, so any environment which has operations going on that are stressing the system can run into race condition. Operations such as mount and unmount and accessing data over remote file systems over cifs/smb filesystems can race to cause system crashes. On a highly active or a stressed system, race conditions are unpredictable, so the crash can happen within short period of time or it can take longer. The race is between cifsd daemon and mount and unmount operations. This problem is not specific to a particular platform or release level, the bug has been there to manifest in a system crash and should be fixed in all release levels.
------- Additional Comment #60 From Emily J. Ratliff 2008-05-30 14:22 EDT [reply] Hi Ben, Given Shirish's excellent description of the customer impact, can we boost the priority of the RH IT from medium to high? Emily
How often is this being hit? I'd prefer to let this have more time upstream before we add it to RHEL, but if you're hitting it regularly then I'll reconsider that...
Also, it would be helpful to know whether IBM has tested this patch and can confirm that it fixes the problem they've seen.
Mirroring system still appears broken: ------- Additional Comment #65 From Shirish S. Pargaonkar 2008-06-02 07:53 EDT [reply] ------- Internal Only (In reply to comment #62) > How often is this being hit? I'd prefer to let this have more time > upstream before we add it to RHEL, but if you're hitting it regularly then > I'll reconsider that... This problem can be recreated fairly consistently. IBM bugzilla has another bug opened for this race condition related crash, this patch fixes that problem as well.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
A few hours after I posted this patch internally, Steve French found a problem with it. If the server just closes the connection on a Negotiate Protocol error, then the thread can hang indefinitely without coming down. There's a one line fix that's been pushed upstream to Linus, and we should probably also take it for RHEL. I plan to repost this in the next day or so.
Created attachment 309852 [details] updated patch -- wake up response_q before going to sleep Updated patch. Wake up the response_q before going to sleep. This prevents deadlock when a server just closes the connection during session setup.
Committed in 75.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
~ Final Notice ~ Testing Phase Ending Soon This bug should have been fixed in the latest RHEL 4.7 Release Candidate, available **now** on partners.redhat.com. If you have already verified that this bug has been properly fixed or have found any issues, please provide detailed feedback of your testing results, including the package version and snapshot tested. The bug status will be updated for you, based on the returned testing results. Contact your Partner Manager with any additional questions.
ParterVerified. See IT#181036.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html
Partners, I would like to thank you all for your participation in assuring the quality of this RHEL 4.7 Update Release. My hat's off to you all. Thanks.