Created attachment 313652 [details] Screen dump of kernel panic output Description of problem: CIFS mount to a windows server was broken. When umounted and remounted, the kernel paniced and the box locked up. Version-Release number of selected component (if applicable): Kernel: 2.6.18-92.1.6 CIFS: 1.50RH How reproducible: Not reproducible. Something similar happened once before. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: This is a weak bug report but it is all I have and hopefully it helps another bug get clarified.
Vaughn, Thanks for the bug report. More info is always nice, but we may be able to determine at least something from the partial stack trace. I need to confirm the exact kernel that you were using though. Are you running: kernel-2.6.18-92.1.6.el5.i686.rpm ...on this box and not some other variant (i.e. PAE, xen, etc)? Assuming that this is the case, here's the assembly around the crash: aeee: 74 14 je af04 <find_unc+0x6f> aef0: ff 74 24 04 pushl 0x4(%esp) aef4: ff 70 38 pushl 0x38(%eax) aef7: 68 fe 2b 00 00 push $0x2bfe aefc: e8 fc ff ff ff call aefd <find_unc+0x68> af01: 83 c4 0c add $0xc,%esp af04: 8b 45 24 mov 0x24(%ebp),%eax af07: 8b 54 24 04 mov 0x4(%esp),%edx af0b: 8b 40 1c mov 0x1c(%eax),%eax af0e: 39 50 38 cmp %edx,0x38(%eax) <<<<< CRASH af11: 0f 85 84 00 00 00 jne af9b <find_unc+0x106> af17: f6 05 00 00 00 00 01 testb $0x1,0x0 af1e: 74 12 je af32 <find_unc+0x9d> af20: 53 push %ebx af21: 8d 45 28 lea 0x28(%ebp),%eax af24: 50 push %eax af25: 68 35 2c 00 00 push $0x2c35 ....looking now to see if I can match this up to the C code and determine where it fell down.
Adding upstream CIFS maintainer to cc list in case he has any thoughts on this. From find_unc() in the CIFS code: --------------[snip]-------------- if (tcon->ses->server) { cFYI(1, ("old ip addr: %x == new ip %x ?", tcon->ses->server->addr.sockAddr.sin_addr. s_addr, new_target_ip_addr)); if (tcon->ses->server->addr.sockAddr.sin_addr. s_addr == new_target_ip_addr) { /* BB lock tcon, server and tcp session and increment use count here? */ /* found a match on the TCP session */ /* BB check if reconnection needed */ cFYI(1, ("IP match, old UNC: %s new: %s", tcon->treeName, uncName)); --------------[snip]-------------- My best guess is that it paniced dereferencing this in the second if statement above... tcon->ses->server->addr.sockAddr.sin_addr.s_addr ...I suspect that that means that the "server" pointer here was bogus. Beyond that, I really can't tell much. We don't have the top of the oops message, so some of this is based on speculation. It sort of looks like the server pointer here might not be adequately protected. The locking rules around it are most certainly not clear and don't seem to be consistent. If this happens again, then getting the entire oops message (or even better a crash dump) would be most helpful. Steve, any thoughts?
(In reply to comment #1) > Vaughn, > Thanks for the bug report. More info is always nice, but we may be able to > determine at least something from the partial stack trace. I need to confirm > the exact kernel that you were using though. Are you running: > > kernel-2.6.18-92.1.6.el5.i686.rpm > > ...on this box and not some other variant (i.e. PAE, xen, etc)? Yes. Linux luke.ppllabs.com 2.6.18-92.1.6.el5 #1 SMP Fri Jun 20 02:36:16 EDT 2008 i686 i686 i386 GNU/Linux [root@luke backups]# rpm -qi kernel-2.6.18-92.1.6.el5 Name : kernel Relocations: (not relocatable) Version : 2.6.18 Vendor: Red Hat, Inc. Release : 92.1.6.el5 Build Date: Fri 20 Jun 2008 12:45:59 AM PDT Install Date: Thu 03 Jul 2008 10:05:25 PM PDT Build Host: hs20-bc2-3.build.redhat.com Group : System Environment/Kernel Source RPM: kernel-2.6.18-92.1.6.el5.src.rpm Size : 39058583 License: GPLv2 Signature : DSA/SHA1, Tue 24 Jun 2008 08:28:27 PM PDT, Key ID 5326810137017186
Thanks Vaughn. Some more questions: 1) what do you mean when you say the CIFS mount to the windows server was "broken"? 2) was there more than one CIFS mount on this host to the same server? 3) did you happen to do a lazy umount (umount -l) or anything like that?
(In reply to comment #4) > Thanks Vaughn. Some more questions: > > 1) what do you mean when you say the CIFS mount to the windows server was > "broken"? When I tried to access the mount, nothing was returned and an error generated. I didn't keep track of the error so I can't help more on this point. > 2) was there more than one CIFS mount on this host to the same server? Yes. 7 I believe > 3) did you happen to do a lazy umount (umount -l) or anything like that? No. All 7 were mounted from /etc/fstab $ # mount output //192.168.xx.4/LSC on /mounts/192.168.xx.4/LSC type cifs (rw,mand)
(In reply to comment #5) > (In reply to comment #4) > > Thanks Vaughn. Some more questions: > > > > 1) what do you mean when you say the CIFS mount to the windows server was > > "broken"? > > When I tried to access the mount, nothing was returned and an error generated. > I didn't keep track of the error so I can't help more on this point. Also, I had recently setup a new mount to the same server. It and many (perhaps all) of the other mounts were broken at the same time.
It appears that the server might be having the problem again. I haven't umounted the mount. Is there any logging I can turn on that might help? [hci@luke EKG_Results]$ ls Sent ls: .: No such device or address SYSLOG: Aug 18 14:09:04 luke kernel: CIFS VFS: Error 0xfffffffa on cifs_get_inode_info in lookup of \Sent I see that 'echo 1 > /proc/fs/cifs/cifsFYI' gives lots of info. Anything else?
Yes, cranking up cifsFYI might give us some more info. From what you just posted: 0xfffffffa == -6 == -ENXIO ...I don't see any place in CIFS that sets this value explicitly, but there are several errors that can be returned by the server that get translated to this: ./netmisc.c: {ERRbaddrive, -ENXIO}, ./netmisc.c: {ERRnosuchshare, -ENXIO}, ./netmisc.c: {ERRinvtid, -ENXIO}, ./netmisc.c: {ERRinvnetname, -ENXIO}, ./netmisc.c: {ERRinvdevice, -ENXIO}, ...given that this was previously working, my guess would be that it's returning ERRinvtid, but the logs generated by cifsFYI might help clarify this. The big question is whether you'll be able to reproduce this panic when/if you try to remount this share. If you are able to reproduce the panic, getting either a crash dump or (at least) a complete oops message will be essential to fixing this. This may be 2 (or more) separate problems: 1) the problem that causes the mount to get into this state 2) the problem that causes the crash before when you remounted the share ...and these problems may or may not be related.
False alarm. I waited until the maintenance window and found that another admin had deleted the cifs share on the server.
I've been spending some time looking at bug 462150, and I think this problem may be a duplicate of it. CIFS VFS tries to share things like sockets and SMB sessions when mounting the same shares multiple times on a machine. Unfortunately, the refcounting is too loose and that can lead to races similar to the ones in this case. I'm going to go ahead and mark this a duplicate of that case. We can reopen it later if it turns out to be a different problem. *** This bug has been marked as a duplicate of bug 462150 ***