458210 – CIFS related kernel panic in find_unc

Bug 458210 - CIFS related kernel panic in find_unc

Summary: CIFS related kernel panic in find_unc

Keywords:
Status:	CLOSED DUPLICATE of bug 462150
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-08-07 00:02 UTC by vaughn skinner
Modified:	2014-06-18 07:38 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-10-07 10:45:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screen dump of kernel panic output (83.68 KB, image/jpeg) 2008-08-07 00:02 UTC, vaughn skinner	no flags	Details
View All

Description vaughn skinner 2008-08-07 00:02:20 UTC

Created attachment 313652 [details]
Screen dump of kernel panic output

Description of problem:

CIFS mount to a windows server was broken.  When umounted and remounted, the kernel paniced and the box locked up.

Version-Release number of selected component (if applicable):

Kernel: 2.6.18-92.1.6
CIFS: 1.50RH

How reproducible:

Not reproducible. Something similar happened once before.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

This is a weak bug report but it is all I have and hopefully it helps another bug get clarified.

Comment 1 Jeff Layton 2008-08-07 13:13:49 UTC

Vaughn,
   Thanks for the bug report. More info is always nice, but we may be able to determine at least something from the partial stack trace. I need to confirm the exact kernel that you were using though. Are you running:

kernel-2.6.18-92.1.6.el5.i686.rpm

...on this box and not some other variant (i.e. PAE, xen, etc)?


Assuming that this is the case, here's the assembly around the crash:

    aeee:       74 14                   je     af04 <find_unc+0x6f>
    aef0:       ff 74 24 04             pushl  0x4(%esp)
    aef4:       ff 70 38                pushl  0x38(%eax)
    aef7:       68 fe 2b 00 00          push   $0x2bfe
    aefc:       e8 fc ff ff ff          call   aefd <find_unc+0x68>
    af01:       83 c4 0c                add    $0xc,%esp
    af04:       8b 45 24                mov    0x24(%ebp),%eax
    af07:       8b 54 24 04             mov    0x4(%esp),%edx
    af0b:       8b 40 1c                mov    0x1c(%eax),%eax
    af0e:       39 50 38                cmp    %edx,0x38(%eax)     <<<<< CRASH
    af11:       0f 85 84 00 00 00       jne    af9b <find_unc+0x106>
    af17:       f6 05 00 00 00 00 01    testb  $0x1,0x0
    af1e:       74 12                   je     af32 <find_unc+0x9d>
    af20:       53                      push   %ebx
    af21:       8d 45 28                lea    0x28(%ebp),%eax
    af24:       50                      push   %eax
    af25:       68 35 2c 00 00          push   $0x2c35

....looking now to see if I can match this up to the C code and determine where it fell down.

Comment 2 Jeff Layton 2008-08-07 14:37:33 UTC

Adding upstream CIFS maintainer to cc list in case he has any thoughts on this.

From find_unc() in the CIFS code:

--------------[snip]--------------
                        if (tcon->ses->server) {
                                cFYI(1,
                                     ("old ip addr: %x == new ip %x ?",
                                      tcon->ses->server->addr.sockAddr.sin_addr.
                                      s_addr, new_target_ip_addr));
                                if (tcon->ses->server->addr.sockAddr.sin_addr.
                                    s_addr == new_target_ip_addr) {
        /* BB lock tcon, server and tcp session and increment use count here? */
                                        /* found a match on the TCP session */
                                        /* BB check if reconnection needed */
                                        cFYI(1,
                                              ("IP match, old UNC: %s new: %s",
                                              tcon->treeName, uncName));
--------------[snip]--------------

My best guess is that it paniced dereferencing this in the second if statement above...

    tcon->ses->server->addr.sockAddr.sin_addr.s_addr

...I suspect that that means that the "server" pointer here was bogus. Beyond that, I really can't tell much. We don't have the top of the oops message, so some of this is based on speculation.

It sort of looks like the server pointer here might not be adequately protected. The locking rules around it are most certainly not clear and don't seem to be consistent.

If this happens again, then getting the entire oops message (or even better a crash dump) would be most helpful.

Steve, any thoughts?

Comment 3 vaughn skinner 2008-08-07 15:24:58 UTC

(In reply to comment #1)
> Vaughn,
>    Thanks for the bug report. More info is always nice, but we may be able to
> determine at least something from the partial stack trace. I need to confirm
> the exact kernel that you were using though. Are you running:
> 
> kernel-2.6.18-92.1.6.el5.i686.rpm
> 
> ...on this box and not some other variant (i.e. PAE, xen, etc)?

Yes.  

Linux luke.ppllabs.com 2.6.18-92.1.6.el5 #1 SMP Fri Jun 20 02:36:16 EDT 2008 i686 i686 i386 GNU/Linux

[root@luke backups]# rpm -qi kernel-2.6.18-92.1.6.el5
Name        : kernel                       Relocations: (not relocatable)
Version     : 2.6.18                            Vendor: Red Hat, Inc.
Release     : 92.1.6.el5                    Build Date: Fri 20 Jun 2008 12:45:59 AM PDT
Install Date: Thu 03 Jul 2008 10:05:25 PM PDT      Build Host: hs20-bc2-3.build.redhat.com
Group       : System Environment/Kernel     Source RPM: kernel-2.6.18-92.1.6.el5.src.rpm
Size        : 39058583                         License: GPLv2
Signature   : DSA/SHA1, Tue 24 Jun 2008 08:28:27 PM PDT, Key ID 5326810137017186

Comment 4 Jeff Layton 2008-08-09 11:52:50 UTC

Thanks Vaughn. Some more questions:

1) what do you mean when you say the CIFS mount to the windows server was "broken"?

2) was there more than one CIFS mount on this host to the same server?

3) did you happen to do a lazy umount (umount -l) or anything like that?

Comment 5 vaughn skinner 2008-08-09 21:41:26 UTC

(In reply to comment #4)
> Thanks Vaughn. Some more questions:
> 
> 1) what do you mean when you say the CIFS mount to the windows server was
> "broken"?

When I tried to access the mount, nothing was returned and an error generated.  I didn't keep track of the error so I can't help more on this point.

> 2) was there more than one CIFS mount on this host to the same server?

Yes.  7 I believe

> 3) did you happen to do a lazy umount (umount -l) or anything like that?

No.  All 7 were mounted from /etc/fstab

$ # mount output
//192.168.xx.4/LSC on /mounts/192.168.xx.4/LSC type cifs (rw,mand)

Comment 6 vaughn skinner 2008-08-10 00:06:21 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > Thanks Vaughn. Some more questions:
> > 
> > 1) what do you mean when you say the CIFS mount to the windows server was
> > "broken"?
> 
> When I tried to access the mount, nothing was returned and an error generated. 
> I didn't keep track of the error so I can't help more on this point.

Also, I had recently setup a new mount to the same server.  It and many (perhaps all) of the other mounts were broken at the same time.

Comment 7 vaughn skinner 2008-08-18 21:20:49 UTC

It appears that the server might be having the problem again.  I haven't umounted the mount.  Is there any logging I can turn on that might help?

[hci@luke EKG_Results]$ ls Sent
ls: .: No such device or address

SYSLOG:
Aug 18 14:09:04 luke kernel:  CIFS VFS: Error 0xfffffffa on cifs_get_inode_info in lookup of \Sent

I see that 'echo 1 > /proc/fs/cifs/cifsFYI' gives lots of info.

Anything else?

Comment 8 Jeff Layton 2008-08-19 00:22:05 UTC

Yes, cranking up cifsFYI might give us some more info. From what you just posted:

0xfffffffa == -6 == -ENXIO

...I don't see any place in CIFS that sets this value explicitly, but there are several errors that can be returned by the server that get translated to this:

./netmisc.c:	{ERRbaddrive, -ENXIO},
./netmisc.c:	{ERRnosuchshare, -ENXIO},
./netmisc.c:	{ERRinvtid, -ENXIO},
./netmisc.c:	{ERRinvnetname, -ENXIO},
./netmisc.c:	{ERRinvdevice, -ENXIO},

...given that this was previously working, my guess would be that it's returning ERRinvtid, but the logs generated by cifsFYI might help clarify this. 

The big question is whether you'll be able to reproduce this panic when/if you try to remount this share. If you are able to reproduce the panic, getting either a crash dump or (at least) a complete oops message will be essential to fixing this.

This may be 2 (or more) separate problems:

1) the problem that causes the mount to get into this state

2) the problem that causes the crash before when you remounted the share

...and these problems may or may not be related.

Comment 9 vaughn skinner 2008-08-19 04:38:01 UTC

False alarm.  I waited until the maintenance window and found that another admin had deleted the cifs share on the server.

Comment 10 Jeff Layton 2008-10-07 10:45:58 UTC

I've been spending some time looking at bug 462150, and I think this problem may be a duplicate of it. CIFS VFS tries to share things like sockets and SMB sessions when mounting the same shares multiple times on a machine. Unfortunately, the refcounting is too loose and that can lead to races similar to the ones in this case.

I'm going to go ahead and mark this a duplicate of that case. We can reopen it later if it turns out to be a different problem.

*** This bug has been marked as a duplicate of bug 462150 ***

Note You need to log in before you can comment on or make changes to this bug.