Bug 717060

Summary: Trying to mount a CIFS share crashes the entire system
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rawhideCC: aquini, gansalmon, itamar, jlayton, jonathan, kernel-maint, madhu.chinakonda, smfltc, smfrench, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-08 10:59:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace output
none
trace from 3.0-0.rc6.git0.1.fc16.x86_64
none
patch -- remove bogus call to cifs_cleanup_volume_info
none
patch -- fix several regressions when chasing DFS referrals at mount time none

Description Adam Williamson 2011-06-27 20:19:44 UTC
Trying to mount a CIFS share on Rawhide - 3.0-0.rc4.git3.1.fc16.x86_64 - hangs the entire system (monitor displays part of the traceback, and system becomes entirely unresponsive, can only restart via the reset switch). Same share worked fine with F15. Trace:

Jun 27 13:12:22 adam kernel: [ 1026.522051] FS-Cache: Loaded
Jun 27 13:12:22 adam kernel: [ 1026.523794] FS-Cache: Netfs 'cifs' registered for caching
Jun 27 13:12:22 adam kernel: [ 1026.536690] CIFS VFS: default security mechanism requested.  The default security mechanism will be upgraded from ntlm to ntlmv2 in kernel release 3.1
Jun 27 13:12:22 adam kernel: [ 1026.620900] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
Jun 27 13:12:22 adam kernel: [ 1026.620934] IP: [<ffffffffa04e6280>] cifs_get_tcp_session+0x62/0x5e0 [cifs]
Jun 27 13:12:22 adam kernel: [ 1026.620958] PGD 416610067 PUD 419ccc067 PMD 0 
Jun 27 13:12:22 adam kernel: [ 1026.620975] Oops: 0000 [#1] SMP 
Jun 27 13:12:22 adam kernel: [ 1026.620987] CPU 4 
Jun 27 13:12:22 adam kernel: [ 1026.620993] Modules linked in: des_generic md4 nls_utf8 cifs fscache tcp_lp fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle tun bridge stp llc ppdev parport_pc lp parport sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack coretemp snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio snd_seq uvcvideo snd_hwdep eeepc_wmi videodev asus_wmi media snd_usbmidi_lib btusb bluetooth snd_rawmidi sparse_keymap snd_seq_device snd_pcm v4l2_compat_ioctl32 rfkill snd_timer snd iTCO_wdt soundcore r8169 mii microcode i2c_i801 snd_page_alloc shpchp xhci_hcd iTCO_vendor_support e1000e virtio_net kvm uinput firewire_ohci firewire_core crc_itu_t usb_storage uas nouveau ttm drm_kms_helper drm i2c_algo_bit i2c_core mxm_wmi wmi video [last unloaded: scsi_wait_scan]
Jun 27 13:12:22 adam kernel: [ 1026.621318] 
Jun 27 13:12:22 adam kernel: [ 1026.621325] Pid: 19141, comm: mount.cifs Tainted: G        W   3.0-0.rc4.git3.1.fc16.x86_64 #1 System manufacturer System Product Name/P8P67 DELUXE
Jun 27 13:12:22 adam kernel: [ 1026.621352] RIP: 0010:[<ffffffffa04e6280>]  [<ffffffffa04e6280>] cifs_get_tcp_session+0x62/0x5e0 [cifs]
Jun 27 13:12:22 adam kernel: [ 1026.621375] RSP: 0018:ffff8804162a9bd8  EFLAGS: 00010246
Jun 27 13:12:22 adam kernel: [ 1026.621385] RAX: 0000000000000000 RBX: ffff88044171d918 RCX: 0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621397] RDX: ffff8804162a9be0 RSI: 000000000000013d RDI: ffff8804162a9c60
Jun 27 13:12:22 adam kernel: [ 1026.621409] RBP: ffff8804162a9c98 R08: 0000000000000002 R09: 0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621420] R10: 0000000000000000 R11: ffffea000edeb1c0 R12: 0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621432] R13: ffff88042037ca88 R14: ffff880443876910 R15: 0000000000000005
Jun 27 13:12:22 adam kernel: [ 1026.621444] FS:  00007f011310e740(0000) GS:ffff88045ee00000(0000) knlGS:0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621459] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 27 13:12:22 adam kernel: [ 1026.621469] CR2: 0000000000000020 CR3: 0000000419fbf000 CR4: 00000000000406e0
Jun 27 13:12:22 adam kernel: [ 1026.621481] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621492] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 27 13:12:22 adam kernel: [ 1026.621505] Process mount.cifs (pid: 19141, threadinfo ffff8804162a8000, task ffff880406500000)
Jun 27 13:12:22 adam kernel: [ 1026.621517] Stack:
Jun 27 13:12:22 adam kernel: [ 1026.621523]  000000000000cc50 0000000000000000 0000000000000000 0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621545]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621566]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jun 27 13:12:22 adam kernel: [ 1026.621586] Call Trace:
Jun 27 13:12:22 adam kernel: [ 1026.621601]  [<ffffffffa04eadf5>] cifs_mount+0xe1/0x4de [cifs]
Jun 27 13:12:22 adam kernel: [ 1026.621615]  [<ffffffffa04dbf64>] cifs_do_mount+0x1ab/0x364 [cifs]
Jun 27 13:12:22 adam kernel: [ 1026.621631]  [<ffffffff81211929>] ? selinux_sb_copy_data+0x192/0x1ab
Jun 27 13:12:22 adam kernel: [ 1026.621646]  [<ffffffff8113a84c>] mount_fs+0x69/0x155
Jun 27 13:12:22 adam kernel: [ 1026.621658]  [<ffffffff81103f9c>] ? __alloc_percpu+0x10/0x12
Jun 27 13:12:22 adam kernel: [ 1026.621671]  [<ffffffff8114f90a>] vfs_kern_mount+0x63/0xa0
Jun 27 13:12:22 adam kernel: [ 1026.621683]  [<ffffffff811505de>] do_kern_mount+0x4d/0xdf
Jun 27 13:12:22 adam kernel: [ 1026.621695]  [<ffffffff81151c74>] do_mount+0x63c/0x69f
Jun 27 13:12:22 adam kernel: [ 1026.621706]  [<ffffffff81151f58>] sys_mount+0x88/0xc2
Jun 27 13:12:22 adam kernel: [ 1026.621718]  [<ffffffff814f9e82>] system_call_fastpath+0x16/0x1b
Jun 27 13:12:22 adam kernel: [ 1026.621728] Code: 48 ff ff ff 49 89 fc 48 89 d7 f3 ab 74 1d 49 8b 4c 24 20 49 8b 54 24 18 48 c7 c6 c1 76 50 a0 48 c7 c7 29 79 50 a0 e8 89 2b 00 e1 
Jun 27 13:12:22 adam kernel: [ 1026.621924] RIP  [<ffffffffa04e6280>] cifs_get_tcp_session+0x62/0x5e0 [cifs]
Jun 27 13:12:22 adam kernel: [ 1026.621943]  RSP <ffff8804162a9bd8>
Jun 27 13:12:22 adam kernel: [ 1026.621951] CR2: 0000000000000020
Jun 27 13:12:22 adam kernel: [ 1026.644600] ---[ end trace 33b5bdcde362acf5 ]---

Comment 1 Adam Williamson 2011-06-28 16:32:09 UTC
I hoped that one of the big list of cifs fixes in rc5 would fix this, but no joy: still crashes instantly with 3.0-0.rc5.git0.1.fc16.x86_64 .

Comment 2 Jeff Layton 2011-06-28 20:32:36 UTC
cc'ing Steve F. as I'm on vacation this week and won't have time to dig into it until next week...

What might be helpful is starting with some details of what mount options you're using and what server you're mounting. For bonus points, if you could follow the directions here to get a listing of the crash site, that would be very helpful:

    http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Oopses

Comment 3 Adam Williamson 2011-06-29 01:32:39 UTC
Thanks. At first I had a line in /etc/fstab:

//192.168.1.13/Volume_1 /share/data cifs rsize=8192,wsize=8192,nosuid,soft,user=guest,noauto,comment=systemd.automount 0 0

but after I removed that, simply mounting it manually with:

mount.cifs //192.168.1.13/Volume_1 /share/data

is enough to cause the crash, which happens instantly - like, I hit enter, and boom, I see the console with the trace on it.

The server is a D-Link DNS-323 - http://www.dlink.com/products/?pid=509 - running stock, up-to-date firmware. I just logged into it via telnet, and it appears to be running:

/ # smbd -V    
Version 3.0.24

I still have one system on Fedora 15, which is able to mount and use the exact same share just fine, using the same fstab line. That's running 2.6.38.8-32.fc15.x86_64 .

I'll try for the bonus points in a minute :)

Comment 4 Adam Williamson 2011-06-29 17:37:13 UTC
OK, for the bonus points:

(gdb) list *(cifs_get_tcp_session+0x62)
0xb211 is in cifs_get_tcp_session (fs/cifs/connect.c:1695).
1690	
1691		memset(&addr, 0, sizeof(struct sockaddr_storage));
1692	
1693		cFYI(1, "UNC: %s ip: %s", volume_info->UNC, volume_info->UNCip);
1694	
1695		if (volume_info->UNCip && volume_info->UNC) {
1696			rc = cifs_fill_sockaddr((struct sockaddr *)&addr,
1697						volume_info->UNCip,
1698						strlen(volume_info->UNCip),
1699						volume_info->port);

I wasn't sure if 0x5e0 mattered, so I did that too:

(gdb) list *(cifs_get_tcp_session+0x5e0)
0xb78f is in cifs_reconnect (fs/cifs/connect.c:79).
74	 * reconnect tcp session
75	 * wake up waiters on reconnection? - (not needed currently)
76	 */
77	static int
78	cifs_reconnect(struct TCP_Server_Info *server)
79	{
80		int rc = 0;
81		struct list_head *tmp, *tmp2;
82		struct cifs_ses *ses;
83		struct cifs_tcon *tcon;

Hope that helps.

Comment 5 Jeff Layton 2011-06-30 10:22:10 UTC
Hmmm...just to be sure, you did do the analysis on the same kernel that you saw the oops, right? If you don't, the offsets might not match and you'll end up in the wrong place.

In any case...it looks like it crashed here on a NULL pointer dereference:

    if (volume_info->UNCip && volume_info->UNC) {

...which would imply that volume_info was NULL, but that's *really* odd as I don't see any way that that could occur. In any case, I'll try to recreate this when I get the chance...

Comment 6 Adam Williamson 2011-06-30 15:35:45 UTC
I wondered that too, but yes, I checked, they match.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 7 Jeff Layton 2011-07-02 11:00:48 UTC
(In reply to comment #3)

> but after I removed that, simply mounting it manually with:
> 
> mount.cifs //192.168.1.13/Volume_1 /share/data
> 
> is enough to cause the crash, which happens instantly - like, I hit enter, and
> boom, I see the console with the trace on it.
> 

Strange. I tried to reproduce this with a similar set of mount options, but no luck. Can you paste in the oops from the more recent kernel? I'd like to verify that that it looks similar.

Also, if you're able could you strace the above mount.cifs command? Something like

    # strace -o /tmp/mount.cifs.strace -f -v -s 256 mount.cifs //192...

...and then attach mount.cifs.strace here? That should show me exactly what mount options are getting passed to the kernel.

Comment 8 Adam Williamson 2011-07-05 19:43:30 UTC
okay, here's the strace output, and the oops, from 3.0-0.rc6.git0.1.fc16.x86_64 . The behaviour seems a bit different now (or maybe it was like this before and somehow I missed it) - the system doesn't crash. Immediately upon hitting 'enter' I see a console with the trace on it, as before, but ctrl-alt-f2 gets me to a console, and ctrl-alt-f1 gets me back to the desktop, and everything still seems to be working (except the mount didn't happen, obviously).

Comment 9 Adam Williamson 2011-07-05 19:43:50 UTC
Created attachment 511371 [details]
strace output

strace output

Comment 10 Adam Williamson 2011-07-05 19:45:05 UTC
Created attachment 511372 [details]
trace from 3.0-0.rc6.git0.1.fc16.x86_64

Comment 11 Adam Williamson 2011-07-05 19:57:35 UTC
I just checked again and we're still at the same lines of code with that trace.

Comment 12 Adam Williamson 2011-07-05 20:18:57 UTC
one thing I forgot to mention is that it prompts for a password when run from console; I usually just hit enter. But I just tried entering a password - 'pass' - and it failed again. So I don't think the empty password is significant.

Comment 13 Jeff Layton 2011-07-05 20:48:35 UTC
Ok, I wonder if you're hitting a DFS referral here. I see one bit of suspect code in cifs_mount that might be causing this:

        if (referral_walks_count) {
                if (tcon)
                        cifs_put_tcon(tcon);
                else if (pSesInfo)
                        cifs_put_smb_ses(pSesInfo);

                cifs_cleanup_volume_info(&volume_info);
                FreeXid(xid);
        }

...the problem here though is that cifs_cleanup_volume_info will zero out the volume_info pointer and then the subsequent call to cifs_get_tcp_session will oops like this. Let me look over the code a bit more and I'll see if I can get you a test patch.

Comment 14 Jeff Layton 2011-07-05 21:04:36 UTC
Created attachment 511393 [details]
patch -- remove bogus call to cifs_cleanup_volume_info

Adam, could you test this patch and let me know if it fixes the issue?

Comment 15 Jeff Layton 2011-07-05 21:21:44 UTC
Ok, I figured out how to reproduce the panic -- just needed to have the client chase a DFS referral at mount time. I'll test the patch out now...

Comment 16 Jeff Layton 2011-07-06 12:59:59 UTC
Created attachment 511491 [details]
patch -- fix several regressions when chasing DFS referrals at mount time

If you haven't yet built a kernel with the other patch, it might be good to test this one instead. It incorporates the earlier patch but also fixes a number of other regressions I uncovered while investigating this.

This patchset has been sent upstream and I'm awaiting comment from the cifs maintainer. With luck, it should make 3.0.

Comment 17 Jeff Layton 2011-07-08 10:59:32 UTC
Adam tested these out and they fixed the issue for him. I've posted the patches upstream and they should (hopefully) make 3.0, assuming that Steve F pushes them in time.

Comment 18 Adam Williamson 2011-07-12 04:53:51 UTC
did the patches make rc6?

Comment 19 Jeff Layton 2011-07-12 10:41:55 UTC
They made -rc7