Bug 214622
Summary: | oops when mounting cifs | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Need Real Name <jon> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 6 | CC: | amyagi, bjoern.robbe, cnighswonger, johnny, master, shirishp, smfrench, ugo.viti, wtogami | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | 2.6.19-1.2911.6.5.fc6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-04-18 23:46:05 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Need Real Name
2006-11-08 17:48:16 UTC
This is the problem I am having and reported in Bugzilla #211672. It happens on FC5 (kernel 2.6.18) as well. Up to the 2.6.17 kernel, no such lockups occurred. Also, if I do not do cifs mount, the system is stable. The current reporter's kernel is not tainted, so the root cause is most likely unrelated to the nvidia driver as initially suspected (see Bug#211672). I am not using Nvidia hardware nor have I installed any third party software. This system is a fresh install of Fedora core 6 with no outside software installed. System is 100% stable until I mount a windows share. If the system does not crash within 30 seconds of mounting the share then it will continue to stay stable. The only difference between your case and mine is that my systems (actually this happens on two different machines) crash eventually if not immediately. Probably 1 out of 5 times is an immediate lockup and other times may be a few minutes or a few hours after the cifs mount. As a workaround, I went back to FC5 kernel 2.6.17 on both machines. They are very stable. This is just another confirmation. I installed kernel 2.6.18-1.2849.fc6 and all updates available as of 2006-11-14. The problem persisted and the system locked up a few minutes after I did a cifs mount. It is very clear now: kernels prior to 2.6.18 with cifs mount = stable kernel 2.6.18 without cifs mount = stable can you try with the work-in-progress kernel at http://people.redhat.com/davej/kernels/Fedora/ There's some CIFS fixes in there which *might* help. Just tried 2856.fc6. As soon as I cifs-mounted a share, system crashed. Puzzling that no call stack is listed to dmesg (where is list_del being called from ...). Is it filtered out by some setting in FC? In the absence of information showing where the problem is, could you get debug information as follows: 1) load cifs module (e.g. "/sbin/modprobe cifs") 2) turn on cifs debugging flags (e.g. "echo 7 > /proc/fs/cifs/cifsFYI") 3) clear the message buffer of errors ("dmesg -c") 4) attempt the mount (e.g. "mount -t cifs //winserve/share /mnt/windows user,username=myname,password=xxxxx,workgroup=MYNETBIOSDOM") 5) after the failure, turn off debugging ("echo 0 > /proc/fs/cifs/cifsFYI") 6) save the debugging output ("dmesg > outputfile") 7) modify the outputfile, anonymizing the server name etc. if you wish 8) attach the outputfile (or for more privacy send directly to me and to Dave) Hi Steve, Did you have a chance to look at the attachment I posted in Bug#211672 ? It was done with cifs debugging turned on. I also referred to a recent post on the kernel maillist by someone who experienced cifs-related kernel panic. An ecxerpt here: "I am seeing a kernel panic in cifs module. It seems to be a result of invalid inode entry in dentry for the file it is trying to validate. The inode->i_ino is set zero and inode->i_mapping is set to NULL in the inode pointer in the dentry (0xdf8ea200) structure. I went through the cifs code and could not find any valid case that could trigger this situation. Is there any case which can lead to this situation?" Akemi Question for the original poster: Is your cpu multi-processor or multi core? If so, could you try booting your system with the clocksource=acpi_pm option as I posted in Bug#211672 ? Akemi I can think of no case in this version of cifs that could lead to the inode being null in the case described - but the bug reported was in list_del not dereferencing a null pointer from the (presumably freed) inode. yes, the solution posted in Bug 211672 (clocksource=acpi_pm) has fixed my issue as well. CPU information: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 4 cpu MHz : 2992.602 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr bogomips : 7485.77 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 4 cpu MHz : 2992.602 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc up pni monitor ds_cpl cid xtpr bogomips : 7485.77 sorry, I spoke too soon. After another 45 minutes using the system and having a CIFS mount, my system had a panic: Nov 29 12:50:21 x kernel: BUG: unable to handle kernel paging request at virtual address 0080b4e4 Nov 29 12:50:21 x kernel: printing eip: Nov 29 12:50:21 x kernel: c04e0b51 Nov 29 12:50:21 x kernel: 2c0d3000 -> *pde = 00000000:14eb0001 Nov 29 12:50:21 x kernel: 292b0000 -> *pme = 00000000:14ed4067 Nov 29 12:50:21 x kernel: 292d4000 -> *pte = 00000000:00000000 Nov 29 12:50:21 x kernel: Oops: 0000 [#1] Nov 29 12:50:21 x kernel: SMP Nov 29 12:50:21 x kernel: last sysfs file: /power/state Nov 29 12:50:21 x kernel: Modules linked in: nls_utf8 cifs bridge netloop netbk blktap blkbk hidp l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video sbs i2c_ec i2c_core button battery asus_acpi ac sg ipv6 parport_pc lp parport floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore pcspkr tg3 snd_page_alloc i82875p_edac serio_raw edac_mc usb_storage ide_cd cdrom ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Nov 29 12:50:21 x kernel: CPU: 0 Nov 29 12:50:21 x kernel: EIP: 0061:[<c04e0b51>] Not tainted VLI Nov 29 12:50:21 x kernel: EFLAGS: 00010096 (2.6.18-1.2849.fc6xen #1) Nov 29 12:50:21 x kernel: EIP is at list_del+0x9/0x6c Nov 29 12:50:21 x kernel: eax: 0080b4e4 ebx: e8897f20 ecx: 00000006 edx: 00000000 Nov 29 12:50:21 x kernel: esi: ed7fd7c0 edi: e89b3000 ebp: c0d39dc0 esp: c0b4eefc Nov 29 12:50:21 x kernel: ds: 007b es: 007b ss: 0069 Nov 29 12:50:21 x kernel: Process events/0 (pid: 8, ti=c0b4e000 task=ed7c25e0 task.ti=c0b4e000) Nov 29 12:50:21 x kernel: Stack: c14fc400 e7f2d040 ed7fd340 e8897f20 c0462271 c0b1ca80 00000006 00000000 Nov 29 12:50:21 x kernel: ed7f8220 ed7f8220 00000006 ed7f8200 00000000 c0462374 00000000 00000000 Nov 29 12:50:21 x kernel: c0d39dc0 ed7fd7e4 ed7fd7c0 c0d39dc0 ed7c9cc0 00000000 c0463814 00000000 Nov 29 12:50:21 x kernel: Call Trace: Nov 29 12:50:21 x kernel: [<c0462271>] free_block+0x63/0xdc Nov 29 12:50:21 x kernel: [<c0462374>] drain_array+0x8a/0xb5 Nov 29 12:50:21 x kernel: [<c0463814>] cache_reap+0x85/0x117 Nov 29 12:50:21 x kernel: [<c042b210>] run_workqueue+0x83/0xc5 Nov 29 12:50:21 x kernel: [<c042bb00>] worker_thread+0xd9/0x10d Nov 29 12:50:21 x kernel: [<c042e013>] kthread+0xc0/0xed Nov 29 12:50:21 x kernel: [<c0402a69>] kernel_thread_helper+0x5/0xb Nov 29 12:50:21 x kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb Nov 29 12:50:21 x kernel: Nov 29 12:50:21 x kernel: Leftover inexact backtrace: Nov 29 12:50:21 x kernel: Nov 29 12:50:21 x kernel: ======================= Nov 29 12:50:21 x kernel: Code: 8d 46 04 e8 86 00 00 00 8d 4b 0c 8b 51 04 8d 46 0c 83 c4 14 5b 5e 5f e9 72 00 00 00 89 c3 eb e8 90 90 53 89 c3 83 ec 0c 8b 40 04 <8b> 00 39 d8 74 1c 89 5c 24 04 89 44 24 08 c7 04 24 38 0e 63 c0 Nov 29 12:50:21 x kernel: EIP: [<c04e0b51>] list_del+0x9/0x6c SS:ESP 0069:c0b4eefc Nov 29 12:50:21 x kernel: <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20 Nov 29 12:50:21 x kernel: in_atomic():0, irqs_disabled():1 Nov 29 12:50:21 x kernel: [<c0405707>] dump_trace+0x69/0x1af Nov 29 12:50:21 x kernel: [<c0405865>] show_trace_log_lvl+0x18/0x2c Nov 29 12:50:21 x kernel: [<c0405e05>] show_trace+0xf/0x11 Nov 29 12:50:21 x kernel: [<c0405e34>] dump_stack+0x15/0x17 Nov 29 12:50:21 x kernel: [<c0430b92>] down_read+0x12/0x20 Nov 29 12:50:21 x kernel: [<c0428c41>] blocking_notifier_call_chain+0xe/0x29 Nov 29 12:50:21 x kernel: [<c041ed09>] do_exit+0x1b/0x776 Nov 29 12:50:21 x kernel: [<c0405da6>] die+0x289/0x2ae Nov 29 12:50:22 x kernel: [<c060abf0>] do_page_fault+0xabf/0xc3c Nov 29 12:50:22 x kernel: [<c040502b>] error_code+0x2b/0x30 Nov 29 12:50:22 x kernel: DWARF2 unwinder stuck at error_code+0x2b/0x30 Nov 29 12:50:22 x kernel: Nov 29 12:50:22 x kernel: Leftover inexact backtrace: Nov 29 12:50:22 x kernel: Nov 29 12:50:22 x kernel: [<c04e0b51>] list_del+0x9/0x6c Nov 29 12:50:22 x kernel: [<c0462271>] free_block+0x63/0xdc Nov 29 12:50:22 x kernel: [<c0462374>] drain_array+0x8a/0xb5 Nov 29 12:50:22 x kernel: [<c0463814>] cache_reap+0x85/0x117 Nov 29 12:50:22 x kernel: [<c042b210>] run_workqueue+0x83/0xc5 Nov 29 12:50:22 x kernel: [<c060936b>] _spin_lock_irqsave+0x12/0x17 Nov 29 12:50:22 x kernel: [<c046378f>] cache_reap+0x0/0x117 Nov 29 12:50:22 x kernel: [<c042bb00>] worker_thread+0xd9/0x10d Nov 29 12:50:22 x kernel: [<c04178a1>] default_wake_function+0x0/0xc Nov 29 12:50:22 x kernel: [<c042ba27>] worker_thread+0x0/0x10d Nov 29 12:50:22 x kernel: [<c042e013>] kthread+0xc0/0xed Nov 29 12:50:22 x kernel: [<c042df53>] kthread+0x0/0xed Nov 29 12:50:22 x kernel: [<c0402a69>] kernel_thread_helper+0x5/0xb Nov 29 12:50:22 x kernel: ======================= I spoke too soon, too. My system crashed next morning at 4 AM. This has happened before. One or more cronjobs run at this time, which apparently caused the panic. But it looks like the clocksource option makes it a bit harder to trigger the crash. Akemi There was possibly corruption going on in slab.c in bug 216001 on some of these kernels, but I don't see how it could end up affecting only cifs... On the other hand it's really easy to test one of Dave's new kernels with the slab fix. If the new kernels you referred to are the ones in Dave's message (comment#5), then that did not fix the problem I am seeing (comment#6). I am wondering if this post in LKML is related to mine: http://lkml.org/lkml/2006/11/29/156 Dave, 2.6.19 is out. If/when you have built this version, I would like to try it. Akemi Hi all, Not finding this bug first, I reported similar problems in bug 217915. I attached some portions of my 'messages' log at the following link... https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=142506&action=view This occurs on two different hardware configurations both running FC6 but different kernel builds: System A: 2.6.18-1.2849.fc6 System B: 2.6.18-1.2798.fc6 I realize all of this agrees with the above comments, but maybe the attached log cuts will help some. Chris I compiled 2.6.19 from kernel.org. The problem persisted. Akemi Created attachment 142683 [details]
kernel (2.6.19) message with cifs debugging on
While compiling 2.6.19, I selected the "more debugging for cifs" option. Then
enabled cifs with "echo 7 > /proc/fs/cifs/cifsFYI". The attached file is an
example of a crash log from the moment of cifs-mounting of a Windows share
through the lockup.
Akemi
The two log files confirm to me that we are not in the middle of a cifs request, although in both cases a cifs readdir did occur after the mount but before the oops (a statfs was sandwiched in between in one of the two logs). It is remotely possible that a readdir corrupted memory but seems a long shot. I wish we could track someone down who understands what this cache_reap kernel thread is doing ... we need to narrow down who is corrupting this list. The two log files confirm to me that we are not in the middle of a cifs request, although in both cases a cifs readdir did occur after the mount but before the oops (a statfs was sandwiched in between in one of the two logs). It is remotely possible that a readdir corrupted memory but seems a long shot. I wish we could track someone down who understands what this cache_reap kernel thread is doing ... we need to narrow down who is corrupting this list, it is not obvious to me why cifs could affect this list. Just for the record. I installed kernel-2.6.18-1.2860.fc6 in testing. The system crashed as soon as I did a cifs-mount. Akemi I confirm this bad problem... the kernel oops (complete machine hang, hard reset needed) in random mode, when copying data from mounted cifs share or even umounting a share or at cifs module unload. Tested on FC6 with kernels from 2.6.18-1.2798.fc6 to 2.6.18-1.2868.fc6. i tryed kernel 2.6.19-1.2877.fc7 (2.6.20-rc1) too, from FC7 development tree (updated mkinitrd and nash to use this kernel) and the system continue to hangs. i forced to install kernel-2.6.17-1.2157_FC5 on my FC6 box, and the system is rock solid now (never crashed) on cifs mount/umount/copy data operations. So, i think this is definitely a kernel bug. I haven't tried 2.6.18/2.6.19 Vanilla Kernel. Best Regards Just a quick note for those who are seeing this problem. Samba programmers have been working on this and will be posting a fix soon. I understand it might be a temporary fix but things are looking good now. Akemi This is a patch for 1.45 version of cifs. I think this should help fix the problem. diff -u sess.c sess.c.mod --- sess.c 2006-08-02 16:15:17.000000000 -0500 +++ sess.c.mod 2006-12-21 09:43:19.000000000 -0600 @@ -179,10 +179,9 @@ cFYI(1,("bleft %d",bleft)); - /* word align, if bytes remaining is not even */ - if(bleft % 2) { + /* word align, if bytes remaining is even */ + if(!(bleft % 2)) { bleft--; - data++; } words_left = bleft / 2; @@ -506,6 +505,7 @@ /* and lanman response is 3 */ bytes_remaining = BCC(smb_buf); bcc_ptr = pByteArea(smb_buf); + bcc_ptr++; if(smb_buf->WordCount == 4) { __u16 blob_len; I have two test machines running with the patch provided by Shirish. Both used to have system lockups before the patch. After the patch was applied, I have not seen a single kernel oops/crash on either machine. This is with a number of mounts/umounts/reboots. The test kernel was 2.6.18-1.2868.fc6 compiled with the above patch. Later, I installed the same kernel using rpm's and replaced cifs.ko with my patched version. That worked, too. Akemi This is a duplicate of Bug 211672. Please refer to that report because the latest patch has been posted there. Akemi Does anyone know if this patch has made it into FC6 xen kernels? I'm having the cifs crashing problem on my xen machines and am running a completely up to date FC6 system. To make the crash happen, I just manually run my backup jobs (they mount a Windows share). [root@firewall2 log]# uname -a Linux firewall2.xxxxxxxxxx.com 2.6.19-1.2895.fc6xen #1 SMP Wed Jan 10 19:47:12 EST 2007 i686 athlon i386 GNU/Linux [root@firewall2 cron.daily]# list_del corruption. next->prev should be c69fd480, but was 0000000e ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:70! invalid opcode: 0000 [#1] SMP last sysfs file: /block/ram0/range Modules linked in: nls_utf8 cifs ipv6 autofs4 hidp l2cap bluetooth iptable_raw xt_policy xt_multiport ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_TCPMSS ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_iprange ipt_hashlimit ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype ip_nat_tftp ip_nat_snmp_basic ip_nat_pptp ip_nat_irc ip_nat_ftp ip_nat_amanda ip_conntrack_tftp ip_conntrack_pptp ip_conntrack_netbios_ns ip_conntrack_irc ip_conntrack_ftp ts_kmp ip_conntrack_amanda xt_tcpmss xt_pkttype xt_physdev bridge xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat ip_nat ip_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables sunrpc xennet parport_pc lp parport pcspkr dm_snapshot dm_zero dm_mirror dm_mod raid456 xor ext3 jbd xenblk CPU: 0 EIP: 0061:[<c04e9d30>] Not tainted VLI EFLAGS: 00010082 (2.6.19-1.2895.fc6xen #1) EIP is at list_del+0x48/0x6c eax: 00000048 ebx: c69fd480 ecx: c0683b30 edx: f5416000 esi: c117f6a0 edi: c0902000 ebp: c0d5cca0 esp: c0d2def0 ds: 007b es: 007b ss: 0069 Process events/0 (pid: 5, ti=c0d2d000 task=c006e030 task.ti=c0d2d000) Stack: c0646193 c69fd480 0000000e c69fd480 c0467706 c078afc0 c0686700 c117ecc0 00000005 00000004 c117fed0 c117fec0 00000005 c117fea0 00000000 c0467809 00000000 00000000 c0d5cca0 c117f6c4 c117f6a0 c0d5cca0 c0d404a0 00000000 Call Trace: [<c0467706>] free_block+0x77/0xf0 [<c0467809>] drain_array+0x8a/0xb5 [<c0468df0>] cache_reap+0x53/0x117 [<c042d603>] run_workqueue+0x97/0xdd [<c042dfc0>] worker_thread+0xd9/0x10d [<c043058c>] kthread+0xc0/0xec [<c0405253>] kernel_thread_helper+0x7/0x10 ======================= Code: c0 e8 9a 4b f3 ff 0f 0b 41 00 82 61 64 c0 8b 03 8b 40 04 39 d8 74 1c 89 5c 24 04 89 44 24 08 c7 04 24 93 61 64 c0 e8 75 4b f3 ff <0f> 0b 46 00 82 61 64 c0 8b 13 8b 43 04 89 42 04 89 10 c7 43 04 EIP: [<c04e9d30>] list_del+0x48/0x6c SS:ESP 0069:c0d2def0 <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20 in_atomic():0, irqs_disabled():1 [<c04056ff>] dump_trace+0x69/0x1b6 [<c0405864>] show_trace_log_lvl+0x18/0x2c [<c0405e4b>] show_trace+0xf/0x11 [<c0405e7a>] dump_stack+0x15/0x17 [<c0433252>] down_read+0x12/0x28 [<c042aca2>] blocking_notifier_call_chain+0xe/0x29 [<c0420d75>] do_exit+0x1b/0x787 [<c0405dec>] die+0x2af/0x2d4 [<c0406262>] do_invalid_op+0xa2/0xab [<c0619deb>] error_code+0x2b/0x30 [<c04e9d30>] list_del+0x48/0x6c [<c0467706>] free_block+0x77/0xf0 [<c0467809>] drain_array+0x8a/0xb5 [<c0468df0>] cache_reap+0x53/0x117 [<c042d603>] run_workqueue+0x97/0xdd [<c042dfc0>] worker_thread+0xd9/0x10d [<c043058c>] kthread+0xc0/0xec [<c0405253>] kernel_thread_helper+0x7/0x10 ======================= OK -- I wrote a little script to mount/unmount a Windows share. Within 10 cycles, I get the oops. No reading, no writing to the Windows machine. Just a mount and unmount. Sounds like you are hit by this bug. I am afraid it may take some more time before the fix is included in any versions of FC (including FC6). In the meantime, you may have to apply the patch posted in Bug 211672. Akemi Eventhough this patch is listed in 211672, I list here nonetheless http://www.kernel.org/git/?p=linux/kernel/git/sfrench/cifs- 2.6.git;a=commitdiff;h=8e6f195af0e1f226e9b2e0256af8df46adb9d595 News! The cifs patch has been included in the latest kernels which are available from the Fedora testing directory. FC5 is: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/5/ FC6 is: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/6/ Thank you, Chuck and Dave. Akemi OK ... we are also tracking this issue in the CentOS-5 bug tracker as it effects our compiled kernel. We have created some cifs.ko kernel modules that should work on any of the 2.6.18-8.x.el5 kernels for el5 i686 and x86_64 (including xen and PAE). So if anyone has to make this work now before an official fix makes it out, you can try our modules and/or review the CentOS bug here: http://bugs.centos.org/view.php?id=1776 *** Bug 221610 has been marked as a duplicate of this bug. *** |