Description of problem: I updated to kernel 2.6.18-1.2200.fc5smp. The system crashes a few minutes after I start working in X (KDE). I am using the nvidia driver. Crash-related lines from the /var/adm/messages files are attached. Version-Release number of selected component (if applicable):2.6.18-1.2200.fc5smp How reproducible:Booted the system twice and crashed both times Steps to Reproduce: 1.boot the 2.6.18 kernel 2.start X 3.work for a few minutes Actual results: Total system freeze Expected results: Additional info:
Created attachment 139017 [details] Relevant lines from /var/adm/messages
This is a bug in the nvidia module. Only nvidia can fix this.
I have uninstalled the Nvidia driver from the system and used the nv driver that came with the distribution. I still have exactly the same symptom - a hard lockup.
Since FC6 came out yesterday, I did a fresh installation of it. The system has not crashed so far (for 1 day).
After 3 days of stable operation of freshly installed FC6, the crash problem came back. It started happening when/after I created links to remote Windows shares on my Desktop (KDE). Those shares are cifs-mounted through autofs. It may be possible that my problem is related to bug# 211070 (although mounting itself is not a problem whether it is done manually or via autofs).
I am now more convinced that my system crash is related to cifs. As far as I stay away from cifs mounting, the system is stable.
I did not see cifs symbols in the call stack of the crash. Do you have another example of dmesg output we could look at? Any chance you could reproduce the crash with cifs debugging enabled and send the dmesg output ("echo 1 > /proc/fs/cifs/cifsFYI") - since this could log a lot to dmesg even if started just before the failure, it may wrap the buffer but it might be helpful to see the operation before the list del. If the target directory or subdirectory has more than 100 files and you are not running the fixed version of fc6, it might run into the EINVAL on readdir problem - but that seems unlikely to cause list corruption
Created attachment 140084 [details] 1st example
Created attachment 140085 [details] 2nd example
Created attachment 140086 [details] 3rd example
Created attachment 140087 [details] 4th example
I sent 4 other examples from dmesg output. As for debugging, could you tell me a bit more about how this is done? Thanks for looking into the problem.
If I understand correctly, I just need to set cifsFYI to 1 to enable debugging? No need to compile the cifs modules with additional flags?
OK, I enabled cifs debugging and "successfully" crashed the system. The attached file is from /var/log/messages.
Created attachment 140143 [details] dmesg with cifs debugging enabled
Bug#214622 apparently has the same root cause as mine.
This is just a confirmation that the 2.6.18-1 kernel in testing has the same problem as expected because the bug in this report (as well as in bug#214622) has not been addressed.
Found this post on the kernel maillist: Subject: Kernel panic in cifs_revalidate From: "Chakri n" <chakriin5> Newsgroups: gmane.linux.kernel Date: Tue, 21 Nov 2006 00:24:40 -0800 Hi, I am seeing a kernel panic in cifs module. It seems to be a result of invalid inode entry in dentry for the file it is trying to validate. The inode->i_ino is set zero and inode->i_mapping is set to NULL in the inode pointer in the dentry (0xdf8ea200) structure. I went through the cifs code and could not find any valid case that could trigger this situation. Is there any case which can lead to this situation? 0xed47fe70 0xc0133b30 filemap_fdatawait+0x20 (0x0, 0xe0e1c780, 0x0, 0xf5b35000, 0x0) kernel .text 0xc0100000 0xc0133b10 0xc0133bc0 0xed47feb8 0xf8b49855 [cifs]cifs_revalidate+0x225 (0xdf8ea200) cifs .text 0xf8b27060 0xf8b49630 0xf8b49af0 0xed47fec4 0xf8b3ec71 [cifs]cifs_d_revalidate+0x11 (0xdf8ea200, 0x0, 0xef47a031) cifs .text 0xf8b27060 0xf8b3ec60 0xf8b3ec7d 0xed47fed8 0xc0151c03 cached_lookup+0x43 (0xe8e03a00, 0xed47fefc, 0x0, 0x1, 0xe7f5b0f8) kernel .text 0xc0100000 0xc0151bc0 0xc0151c20 0xed47ff18 0xc01522a8 link_path_walk+0x3e8 kernel .text 0xc0100000 0xc0151ec0 0xc0152610 0xed47ff20 0xc0152629 path_walk+0x19 (0x8002, 0x8003, 0x83141a0) kernel .text 0xc0100000 0xc0152610 0xc0152630 0xed47ff34 0xc015280a path_lookup+0x3a (0x0, 0x0, 0x2, 0x0, 0x0) kernel .text 0xc0100000 0xc01527d0 0xc0152810 0xed47ff64 0xc0152d3a open_namei+0x6a (0xef47a000, 0x8003, 0x0, 0xed47ff7c, 0xe8e03a00) kernel .text 0xc0100000 0xc0152cd0 0xc0153260 0xed47ffa0 0xc01448b1 filp_open+0x41 (0xef47a000, 0x8002, 0x0, 0xed47e000, 0x8002) kernel .text 0xc0100000 0xc0144870 0xc01448e0 0xed47ffbc 0xc0144ca1 sys_open+0x51 (0x83141a0, 0x8002, 0x0, 0x8002, 0x83141a0) Thanks --Chakri
I am not aware of any cases the cifs code where you could reference from dentry to invalid inode, but the attachment posted was caused by list_del ... (which seems at first glance to be unrelated) it would be useful to know what list it was trying to delete (since there is no cifs code in the call stack there). It looks like the kernel dmesg log is overflowing - any chance you could delete the message log ("dmesg -c") and use a kernel (rebuild with different debug config options) with a larger dmesg size so the dmesg log does not drop so many entries.
This is an update of my status. In an attmpt to debug the problem better, I installed FC6 as a guest in VMware. After many times of cifs-mounting, kernel panic has not happened. This made me wonder what the difference between the virtual FC6 and the real one. One obvious difference is that the VM is running on a single processor and the two machines that have the lockup problem are dual-core Athlons. After searching through postings/maillists on the net, I came across what might be related to my issue. It is regarding the AMD processors and the choice of clocksource. Following someone's post, I booted the system with the clocksource=acpi_pm option. No crash despite cifs-mounts. Then I booted again without that option. The machine locked up as soon as I did a cifs-mount. I went back with the acpi_pm option again, and it has been running stable for a few hours now. With some more digging I found that the system uses hpet if I do not specify the clocksource. The VM uses TSC. But I still do not understand why and how cifs is involved. Hope this new piece of info helps resolve the problem.
The problem also seems to be resolved for me when I set clocksource=acpi_pm . Ive been running for a while with that option and remounted the volume several times with no crashes. My cpu information: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 4 cpu MHz : 2992.602 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr bogomips : 7485.77 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 4 cpu MHz : 2992.602 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc up pni monitor ds_cpl cid xtpr bogomips : 7485.77
sorry, I spoke too soon. After another 45 minutes using the system and having a CIFS mount, my system had a panic: Nov 29 12:50:21 x kernel: BUG: unable to handle kernel paging request at virtual address 0080b4e4 Nov 29 12:50:21 x kernel: printing eip: Nov 29 12:50:21 x kernel: c04e0b51 Nov 29 12:50:21 x kernel: 2c0d3000 -> *pde = 00000000:14eb0001 Nov 29 12:50:21 x kernel: 292b0000 -> *pme = 00000000:14ed4067 Nov 29 12:50:21 x kernel: 292d4000 -> *pte = 00000000:00000000 Nov 29 12:50:21 x kernel: Oops: 0000 [#1] Nov 29 12:50:21 x kernel: SMP Nov 29 12:50:21 x kernel: last sysfs file: /power/state Nov 29 12:50:21 x kernel: Modules linked in: nls_utf8 cifs bridge netloop netbk blktap blkbk hidp l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video sbs i2c_ec i2c_core button battery asus_acpi ac sg ipv6 parport_pc lp parport floppy snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore pcspkr tg3 snd_page_alloc i82875p_edac serio_raw edac_mc usb_storage ide_cd cdrom ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Nov 29 12:50:21 x kernel: CPU: 0 Nov 29 12:50:21 x kernel: EIP: 0061:[<c04e0b51>] Not tainted VLI Nov 29 12:50:21 x kernel: EFLAGS: 00010096 (2.6.18-1.2849.fc6xen #1) Nov 29 12:50:21 x kernel: EIP is at list_del+0x9/0x6c Nov 29 12:50:21 x kernel: eax: 0080b4e4 ebx: e8897f20 ecx: 00000006 edx: 00000000 Nov 29 12:50:21 x kernel: esi: ed7fd7c0 edi: e89b3000 ebp: c0d39dc0 esp: c0b4eefc Nov 29 12:50:21 x kernel: ds: 007b es: 007b ss: 0069 Nov 29 12:50:21 x kernel: Process events/0 (pid: 8, ti=c0b4e000 task=ed7c25e0 task.ti=c0b4e000) Nov 29 12:50:21 x kernel: Stack: c14fc400 e7f2d040 ed7fd340 e8897f20 c0462271 c0b1ca80 00000006 00000000 Nov 29 12:50:21 x kernel: ed7f8220 ed7f8220 00000006 ed7f8200 00000000 c0462374 00000000 00000000 Nov 29 12:50:21 x kernel: c0d39dc0 ed7fd7e4 ed7fd7c0 c0d39dc0 ed7c9cc0 00000000 c0463814 00000000 Nov 29 12:50:21 x kernel: Call Trace: Nov 29 12:50:21 x kernel: [<c0462271>] free_block+0x63/0xdc Nov 29 12:50:21 x kernel: [<c0462374>] drain_array+0x8a/0xb5 Nov 29 12:50:21 x kernel: [<c0463814>] cache_reap+0x85/0x117 Nov 29 12:50:21 x kernel: [<c042b210>] run_workqueue+0x83/0xc5 Nov 29 12:50:21 x kernel: [<c042bb00>] worker_thread+0xd9/0x10d Nov 29 12:50:21 x kernel: [<c042e013>] kthread+0xc0/0xed Nov 29 12:50:21 x kernel: [<c0402a69>] kernel_thread_helper+0x5/0xb Nov 29 12:50:21 x kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb Nov 29 12:50:21 x kernel: Nov 29 12:50:21 x kernel: Leftover inexact backtrace: Nov 29 12:50:21 x kernel: Nov 29 12:50:21 x kernel: ======================= Nov 29 12:50:21 x kernel: Code: 8d 46 04 e8 86 00 00 00 8d 4b 0c 8b 51 04 8d 46 0c 83 c4 14 5b 5e 5f e9 72 00 00 00 89 c3 eb e8 90 90 53 89 c3 83 ec 0c 8b 40 04 <8b> 00 39 d8 74 1c 89 5c 24 04 89 44 24 08 c7 04 24 38 0e 63 c0 Nov 29 12:50:21 x kernel: EIP: [<c04e0b51>] list_del+0x9/0x6c SS:ESP 0069:c0b4eefc Nov 29 12:50:21 x kernel: <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20 Nov 29 12:50:21 x kernel: in_atomic():0, irqs_disabled():1 Nov 29 12:50:21 x kernel: [<c0405707>] dump_trace+0x69/0x1af Nov 29 12:50:21 x kernel: [<c0405865>] show_trace_log_lvl+0x18/0x2c Nov 29 12:50:21 x kernel: [<c0405e05>] show_trace+0xf/0x11 Nov 29 12:50:21 x kernel: [<c0405e34>] dump_stack+0x15/0x17 Nov 29 12:50:21 x kernel: [<c0430b92>] down_read+0x12/0x20 Nov 29 12:50:21 x kernel: [<c0428c41>] blocking_notifier_call_chain+0xe/0x29 Nov 29 12:50:21 x kernel: [<c041ed09>] do_exit+0x1b/0x776 Nov 29 12:50:21 x kernel: [<c0405da6>] die+0x289/0x2ae Nov 29 12:50:22 x kernel: [<c060abf0>] do_page_fault+0xabf/0xc3c Nov 29 12:50:22 x kernel: [<c040502b>] error_code+0x2b/0x30 Nov 29 12:50:22 x kernel: DWARF2 unwinder stuck at error_code+0x2b/0x30 Nov 29 12:50:22 x kernel: Nov 29 12:50:22 x kernel: Leftover inexact backtrace: Nov 29 12:50:22 x kernel: Nov 29 12:50:22 x kernel: [<c04e0b51>] list_del+0x9/0x6c Nov 29 12:50:22 x kernel: [<c0462271>] free_block+0x63/0xdc Nov 29 12:50:22 x kernel: [<c0462374>] drain_array+0x8a/0xb5 Nov 29 12:50:22 x kernel: [<c0463814>] cache_reap+0x85/0x117 Nov 29 12:50:22 x kernel: [<c042b210>] run_workqueue+0x83/0xc5 Nov 29 12:50:22 x kernel: [<c060936b>] _spin_lock_irqsave+0x12/0x17 Nov 29 12:50:22 x kernel: [<c046378f>] cache_reap+0x0/0x117 Nov 29 12:50:22 x kernel: [<c042bb00>] worker_thread+0xd9/0x10d Nov 29 12:50:22 x kernel: [<c04178a1>] default_wake_function+0x0/0xc Nov 29 12:50:22 x kernel: [<c042ba27>] worker_thread+0x0/0x10d Nov 29 12:50:22 x kernel: [<c042e013>] kthread+0xc0/0xed Nov 29 12:50:22 x kernel: [<c042df53>] kthread+0x0/0xed Nov 29 12:50:22 x kernel: [<c0402a69>] kernel_thread_helper+0x5/0xb Nov 29 12:50:22 x kernel: =======================
I spoke too soon, too. My system crashed next morning at 4 AM. This has happened before. One or more cronjobs run at this time, which apparently caused the panic. But it looks like the clocksource option makes it a bit harder to trigger the crash. Akemi
I'm having the same problems. From my post to the Fedora mailing list: When trying to mount cifs shares with any 2.6.18 kernel on FC5 or FC6 the system becomes unstable, usually crashing with a kernel panic a few seconds after the mount. I've seen this on several machines, including a dual athlon box (i386), and an athlon64 (x86_64). A typical log for what goes on is attached below. I've also been having stability problems with 2.6.18 on my Debian Sid box, though not related to cifs mounts, so I'm wondering if this kernel release might have just escaped the barn a bit early. In every case I've had to revert to 2.6.17 and had no problems since. Message from syslogd@mgl26 at Fri Dec 1 10:14:06 2006 ... mgl26 kernel: ------------[ cut here ]------------ Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: kernel BUG at lib/list_debug.c:65! Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: invalid opcode: 0000 [#1] Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: SMP Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: CPU: 0 Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: EIP is at list_del+0x23/0x6c Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: eax: 00000048 ebx: d0b53da0 ecx: c067e1d0 edx: 00000086 Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: esi: c176f944 edi: c176f920 ebp: c176f930 esp: ddfdff20 Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: ds: 007b es: 007b ss: 0068 Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: Process events/0 (pid: 5, ti=ddfdf000 task=dde00030 task.ti=ddfdf000) Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: Stack: c0641cb6 d0b53da0 d0b50080 d0b53da0 c046b6dc 00000005 ddd413e0 00000003 Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: c176f920 ddd413e0 c146ff40 00000282 c046caf9 00000000 00000000 c13c7aa0 Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: c13c7aa4 c0433c38 00000246 c146ff40 c146ff60 c046ca47 00000000 c146ff60 Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: Call Trace: Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: [<c046b6dc>] drain_freelist+0x3b/0x7b Message from syslogd@mgl26 at Fri Dec 1 10:14:07 2006 ... mgl26 kernel: [<c046caf9>] cache_reap+0xb2/0x117 Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: [<c0433c38>] run_workqueue+0x83/0xc5 Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: [<c0434528>] worker_thread+0xd9/0x10d Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: [<c04369fb>] kthread+0xc0/0xed Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: [<c0404dab>] kernel_thread_helper+0x7/0x10 Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x7/0x10 Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: Leftover inexact backtrace: Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: ======================= Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: Code: 00 00 89 c3 eb e8 90 90 53 89 c3 83 ec 0c 8b 40 04 8b 00 39 d8 74 1c 89 5c 24 04 89 44 24 08 c7 04 24 b6 1c 64 c0 e8 dc bd f3 ff <0f> 0b 41 00 f3 1c 64 c0 8b 03 8b 40 04 39 d8 74 1c 89 5c 24 04 Message from syslogd@mgl26 at Fri Dec 1 10:14:08 2006 ... mgl26 kernel: EIP: [<c04e99eb>] list_del+0x23/0x6c SS:ESP 0068:ddfdff2
Just a quick note for those who are seeing this problem. Samba programmers have been working on this and will be posting a fix soon. I understand it might be a temporary fix but things are looking good now. Akemi
This is a patch for 1.45 version of cifs. I think this should help. diff -u sess.c sess.c.mod --- sess.c 2006-08-02 16:15:17.000000000 -0500 +++ sess.c.mod 2006-12-21 09:43:19.000000000 -0600 @@ -179,10 +179,9 @@ cFYI(1,("bleft %d",bleft)); - /* word align, if bytes remaining is not even */ - if(bleft % 2) { + /* word align, if bytes remaining is even */ + if(!(bleft % 2)) { bleft--; - data++; } words_left = bleft / 2; @@ -506,6 +505,7 @@ /* and lanman response is 3 */ bytes_remaining = BCC(smb_buf); bcc_ptr = pByteArea(smb_buf); + bcc_ptr++; if(smb_buf->WordCount == 4) { __u16 blob_len;
I have two test machines running with the patch provided by Shirish. Both used to have system lockups before the patch. After the patch was applied, I have not seen a single kernel oops/crash on either machine. This is with a number of mounts/umounts/reboots. The test kernel was 2.6.18-1.2868.fc6 compiled with the above patch. Later, I installed the same kernel using rpm's and replaced cifs.ko with my patched version. That worked, too. Akemi
Has anyone tried this against Win9x (or OS/2) or anything which is ASCII only - at first glance it looks odd that the bcc is updated even for the ascii case.
This is the patch http://www.kernel.org/git/?p=linux/kernel/git/sfrench/cifs- 2.6.git;a=commitdiff;h=8e6f195af0e1f226e9b2e0256af8df46adb9d595 It is slightly different than the one I posted above. How is the process to make it into various existing distros such as RHEL5. This is not needed in RHEL4
Just applied the latest patch to FC6 and it worked (although your earlier patch also worked in my case). I want to see this patch included in the kernel which would eventually be propagated to all distros. But it is more important that Fedora Core gets fixed now. With the demise of Fedora Legacy, FC5 will be EOL'd sooner than expected. Those who cannot move on to FC6 because of this cifs bug will be in trouble if the revised kernel is not made available in time. Akemi
Just noticed today that there is a new FC5 kernel in the testing directory (2.6.19-1.2287.fc5). Apparently, this does not have the cifs fix. Dave, can you tell us if/when this cifs patch gets included? Akemi
News! The cifs patch has been included in the latest kernels which are available from the Fedora testing directory. FC5 is: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/5/ FC6 is: http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/6/ Thank you, Chuck and Dave. Akemi
*** Bug 221787 has been marked as a duplicate of this bug. ***