Bug 214622

Summary:

oops when mounting cifs

Product:

[Fedora] Fedora

Reporter:

Need Real Name <jon>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Brian Brock <bbrock>

Severity:

urgent

Docs Contact:

Priority:

medium

Version:

CC:

amyagi, bjoern.robbe, cnighswonger, johnny, master, shirishp, smfrench, ugo.viti, wtogami

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

2.6.19-1.2911.6.5.fc6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-04-18 23:46:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
kernel (2.6.19) message with cifs debugging on	none

Description Need Real Name 2006-11-08 17:48:16 UTC

Description of problem:  System has an OPPS when attempting to mount a windows
fileshare.


Version-Release number of selected component (if applicable): Linux x
2.6.18-1.2798.fc6xen #1 SMP Mon Oct 16 15:11:19 EDT 2006 i686 i686 i386 GNU/Linux

How reproducible: Happens about 1 in 5 times


Steps to Reproduce:
1. create an fstab entry like:
//winserve/share       /mnt/windows      cifs  
noauto,user,username=myname,password=xxxxx,workgroup=mydomain.com    0 0

2. type mount /mnt/windows

  
Actual results:
Nov  7 23:04:51 x kernel: BUG: unable to handle kernel paging request at virtual
address 0080b4e4
Nov  7 23:04:51 x kernel:  printing eip:
Nov  7 23:04:51 x kernel: c04e0621
Nov  7 23:04:51 x kernel: 29ac6000 -> *pde = 00000000:151f2001
Nov  7 23:04:51 x kernel: 291f2000 -> *pme = 00000000:00000000
Nov  7 23:04:51 x kernel: Oops: 0000 [#1]
Nov  7 23:04:51 x kernel: SMP
Nov  7 23:04:51 x kernel: last sysfs file: /power/state
Nov  7 23:04:51 x kernel: Modules linked in: nls_utf8 cifs bridge netloop netbk
blktap blkbk hidp l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video sbs
i2c_ec i2c_core button battery asus_acpi ac ipv6 parport_pc lp parport floppy
snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i82875p_edac edac_mc pcspkr ide_cd cdrom
serio_raw tg3 ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Nov  7 23:04:51 x kernel: CPU:    1
Nov  7 23:04:51 x kernel: EIP:    0061:[<c04e0621>]    Not tainted VLI
Nov  7 23:04:51 x kernel: EFLAGS: 00010096   (2.6.18-1.2798.fc6xen #1)
Nov  7 23:04:51 x kernel: EIP is at list_del+0x9/0x6c
Nov  7 23:04:51 x kernel: eax: 0080b4e4   ebx: dfd57b20   ecx: d9072a40   edx:
00000000
Nov  7 23:04:51 x kernel: esi: ed7fd7c0   edi: c3386000   ebp: c0d39dc0   esp:
c0b4defc
Nov  7 23:04:51 x kernel: ds: 007b   es: 007b   ss: 0069
Nov  7 23:04:51 x kernel: Process events/1 (pid: 9, ti=c0b4d000 task=ed7c2070
task.ti=c0b4d000)
Nov  7 23:04:51 x kernel: Stack: c156c3c0 c3ed28a0 ed7fd7c0 dfd57b20 c0461f61
c0b066c0 00000014 0000000d


Expected results:
A mounted filesystem

Additional info:
[root@x log]# uname -a
Linux x 2.6.18-1.2798.fc6xen #1 SMP Mon Oct 16 15:11:19 EDT 2006 i686 i686 i386
GNU/Linux
[root@x log]# cat /etc/redhat-release 
Fedora Core release 6 (Zod)
[root@x log]# rpm -q -a | grep -i kernel
kernel-headers-2.6.18-1.2798.fc6
kernel-doc-2.6.18-1.2798.fc6
kernel-2.6.18-1.2798.fc6
kernel-xen-2.6.18-1.2798.fc6
kernel-devel-2.6.18-1.2798.fc6
[root@x log]# ls -la /boot/
total 9393
drwxr-xr-x  4 root root    1024 Oct 30 16:34 .
drwxr-xr-x 25 root root    4096 Nov  8 10:01 ..
-rw-r--r--  1 root root   70411 Oct 16 14:49 config-2.6.18-1.2798.fc6
-rw-r--r--  1 root root   65249 Oct 16 15:23 config-2.6.18-1.2798.fc6xen
drwxr-xr-x  2 root root    1024 Oct 30 16:34 grub
-rw-------  1 root root 1515095 Oct 30 16:34 initrd-2.6.18-1.2798.fc6.img
-rw-------  1 root root 1515217 Oct 30 07:48 initrd-2.6.18-1.2798.fc6xen.img
drwx------  2 root root   12288 Oct 30 07:42 lost+found
-rw-r--r--  1 root root   95207 Oct 16 14:49 symvers-2.6.18-1.2798.fc6.gz
-rw-r--r--  1 root root   95032 Oct 16 15:23 symvers-2.6.18-1.2798.fc6xen.gz
-rw-r--r--  1 root root  887248 Oct 16 14:49 System.map-2.6.18-1.2798.fc6
-rw-r--r--  1 root root  865778 Oct 16 15:23 System.map-2.6.18-1.2798.fc6xen
-rw-r--r--  1 root root 1815804 Oct 16 14:49 vmlinuz-2.6.18-1.2798.fc6
-rw-r--r--  1 root root 1728127 Oct 16 15:23 vmlinuz-2.6.18-1.2798.fc6xen
-rw-r--r--  1 root root  272336 Oct 16 14:34 xen.gz-2.6.18-1.2798.fc6
-rwxr-xr-x  1 root root  607044 Oct 16 15:55 xen-syms-2.6.18-1.2798.fc6

[root@x log]# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 4
cpu MHz         : 2992.634
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 7485.48

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 4
cpu MHz         : 2992.634
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc up pni monitor ds_cpl cid xtpr
bogomips        : 7485.48

[root@x log]# cat /proc/meminfo 
MemTotal:      1939744 kB
MemFree:         18384 kB
Buffers:         19516 kB
Cached:        1475352 kB
SwapCached:          0 kB
Active:         938900 kB
Inactive:       828940 kB
HighTotal:     1203752 kB
HighFree:         4788 kB
LowTotal:       735992 kB
LowFree:         13596 kB
SwapTotal:     3071872 kB
SwapFree:      3071728 kB
Dirty:             204 kB
Writeback:           0 kB
AnonPages:      272944 kB
Mapped:          79688 kB
Slab:           106408 kB
PageTables:       6952 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   4041744 kB
Committed_AS:   659736 kB
VmallocTotal:   114680 kB
VmallocUsed:      4996 kB
VmallocChunk:   109384 kB

Comment 1 Akemi Yagi 2006-11-08 22:56:22 UTC

This is the problem I am having and reported in Bugzilla #211672.  It happens on
FC5 (kernel 2.6.18) as well.  Up to the 2.6.17 kernel, no such lockups occurred.
 Also, if I do not do cifs mount, the system is stable.  The current reporter's
kernel is not tainted, so the root cause is most likely unrelated to the nvidia
driver as initially suspected (see Bug#211672).

Comment 2 Need Real Name 2006-11-09 18:24:09 UTC

I am not using Nvidia hardware nor have I installed any third party software.  
This system is a fresh install of Fedora core 6 with no outside software
installed.  System is 100% stable until I mount a windows share.   If the system
does not crash within 30 seconds of mounting the share then it will continue to
stay stable.

Comment 3 Akemi Yagi 2006-11-09 18:42:02 UTC

The only difference between your case and mine is that my systems (actually this
happens on two different machines) crash eventually if not immediately. 
Probably 1 out of 5 times is an immediate lockup and other times may be a few
minutes or a few hours after the cifs mount.  As a workaround, I went back to
FC5 kernel 2.6.17 on both machines.  They are very stable.

Comment 4 Akemi Yagi 2006-11-15 00:03:23 UTC

This is just another confirmation.  I installed kernel 2.6.18-1.2849.fc6 and all
updates available as of 2006-11-14.  The problem persisted and the system locked
up a few minutes after I did a cifs mount.  It is very clear now:

kernels prior to 2.6.18 with cifs mount = stable 
kernel 2.6.18 without cifs mount = stable

Comment 5 Dave Jones 2006-11-24 21:19:36 UTC

can you try with the work-in-progress kernel at
http://people.redhat.com/davej/kernels/Fedora/

There's some CIFS fixes in there which *might* help.

Comment 6 Akemi Yagi 2006-11-24 22:17:05 UTC

Just tried 2856.fc6.  As soon as I cifs-mounted a share, system crashed.

Comment 7 Steve French 2006-11-25 17:01:55 UTC

Puzzling that no call stack is listed to dmesg (where is list_del being called
from ...). Is it filtered out by some setting in FC?

In the absence of information showing where the problem is, could you get debug
information as follows:

1) load cifs module (e.g. "/sbin/modprobe cifs")
2) turn on cifs debugging flags (e.g. "echo 7 > /proc/fs/cifs/cifsFYI")
3) clear the message buffer of errors ("dmesg -c")
4) attempt the mount (e.g. "mount -t cifs //winserve/share /mnt/windows
       user,username=myname,password=xxxxx,workgroup=MYNETBIOSDOM")
5) after the failure, turn off debugging ("echo 0 > /proc/fs/cifs/cifsFYI")
6) save the debugging output ("dmesg > outputfile")
7) modify the outputfile, anonymizing the server name etc. if you wish
8) attach the outputfile (or for more privacy send directly to me and to Dave)

Comment 8 Akemi Yagi 2006-11-25 17:18:22 UTC

Hi Steve,

Did you have a chance to look at the attachment I posted in Bug#211672 ?  It was
done with cifs debugging turned on.  I also referred to a recent post on the
kernel maillist by someone who experienced cifs-related kernel panic. An ecxerpt
here:

"I am seeing a kernel panic in cifs module. It seems to be a result of
invalid inode entry in dentry for the file it is trying to validate.

The inode->i_ino is set zero and inode->i_mapping is set to NULL in
the inode pointer in the dentry (0xdf8ea200) structure. I went through
the cifs code and could not find any valid case that could trigger
this situation. Is there any case which can lead to this situation?"

Akemi

Comment 9 Akemi Yagi 2006-11-29 01:01:17 UTC

Question for the original poster:

Is your cpu multi-processor or multi core?  If so, could you try booting your
system with the clocksource=acpi_pm option as I posted in Bug#211672 ?

Akemi

Comment 10 Steve French 2006-11-29 02:34:47 UTC

I can think of no case in this version of cifs that could lead to the inode
being null in the case described - but the bug reported was in list_del not
dereferencing a null pointer from the (presumably freed) inode.

Comment 11 Need Real Name 2006-11-29 17:12:01 UTC

yes, the solution posted in Bug 211672 (clocksource=acpi_pm) has fixed my issue
as well.  CPU information:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 4
cpu MHz         : 2992.602
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 7485.77

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 4
cpu MHz         : 2992.602
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc up pni monitor ds_cpl cid xtpr
bogomips        : 7485.77

Comment 12 Need Real Name 2006-11-29 17:59:20 UTC

sorry, I spoke too soon.  After another 45 minutes using the system and having a
CIFS mount, my system had a panic:

Nov 29 12:50:21 x kernel: BUG: unable to handle kernel paging request at virtual
address 0080b4e4
Nov 29 12:50:21 x kernel:  printing eip:
Nov 29 12:50:21 x kernel: c04e0b51
Nov 29 12:50:21 x kernel: 2c0d3000 -> *pde = 00000000:14eb0001
Nov 29 12:50:21 x kernel: 292b0000 -> *pme = 00000000:14ed4067
Nov 29 12:50:21 x kernel: 292d4000 -> *pte = 00000000:00000000
Nov 29 12:50:21 x kernel: Oops: 0000 [#1]
Nov 29 12:50:21 x kernel: SMP
Nov 29 12:50:21 x kernel: last sysfs file: /power/state
Nov 29 12:50:21 x kernel: Modules linked in: nls_utf8 cifs bridge netloop netbk
blktap blkbk hidp l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video sbs
i2c_ec i2c_core button battery asus_acpi ac sg ipv6 parport_pc lp parport floppy
snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore pcspkr tg3 snd_page_alloc i82875p_edac serio_raw edac_mc
usb_storage ide_cd cdrom ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd
ohci_hcd uhci_hcd
Nov 29 12:50:21 x kernel: CPU:    0
Nov 29 12:50:21 x kernel: EIP:    0061:[<c04e0b51>]    Not tainted VLI
Nov 29 12:50:21 x kernel: EFLAGS: 00010096   (2.6.18-1.2849.fc6xen #1)
Nov 29 12:50:21 x kernel: EIP is at list_del+0x9/0x6c
Nov 29 12:50:21 x kernel: eax: 0080b4e4   ebx: e8897f20   ecx: 00000006   edx:
00000000
Nov 29 12:50:21 x kernel: esi: ed7fd7c0   edi: e89b3000   ebp: c0d39dc0   esp:
c0b4eefc
Nov 29 12:50:21 x kernel: ds: 007b   es: 007b   ss: 0069
Nov 29 12:50:21 x kernel: Process events/0 (pid: 8, ti=c0b4e000 task=ed7c25e0
task.ti=c0b4e000)
Nov 29 12:50:21 x kernel: Stack: c14fc400 e7f2d040 ed7fd340 e8897f20 c0462271
c0b1ca80 00000006 00000000
Nov 29 12:50:21 x kernel:        ed7f8220 ed7f8220 00000006 ed7f8200 00000000
c0462374 00000000 00000000
Nov 29 12:50:21 x kernel:        c0d39dc0 ed7fd7e4 ed7fd7c0 c0d39dc0 ed7c9cc0
00000000 c0463814 00000000
Nov 29 12:50:21 x kernel: Call Trace:
Nov 29 12:50:21 x kernel:  [<c0462271>] free_block+0x63/0xdc
Nov 29 12:50:21 x kernel:  [<c0462374>] drain_array+0x8a/0xb5
Nov 29 12:50:21 x kernel:  [<c0463814>] cache_reap+0x85/0x117
Nov 29 12:50:21 x kernel:  [<c042b210>] run_workqueue+0x83/0xc5
Nov 29 12:50:21 x kernel:  [<c042bb00>] worker_thread+0xd9/0x10d
Nov 29 12:50:21 x kernel:  [<c042e013>] kthread+0xc0/0xed
Nov 29 12:50:21 x kernel:  [<c0402a69>] kernel_thread_helper+0x5/0xb
Nov 29 12:50:21 x kernel: DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
Nov 29 12:50:21 x kernel:
Nov 29 12:50:21 x kernel: Leftover inexact backtrace:
Nov 29 12:50:21 x kernel:
Nov 29 12:50:21 x kernel:  =======================
Nov 29 12:50:21 x kernel: Code: 8d 46 04 e8 86 00 00 00 8d 4b 0c 8b 51 04 8d 46
0c 83 c4 14 5b 5e 5f e9 72 00 00 00 89 c3 eb e8 90 90 53 89 c3 83 ec 0c 8b 40 04
<8b> 00 39 d8 74 1c 89 5c 24 04 89 44 24 08 c7 04 24 38 0e 63 c0
Nov 29 12:50:21 x kernel: EIP: [<c04e0b51>] list_del+0x9/0x6c SS:ESP 0069:c0b4eefc
Nov 29 12:50:21 x kernel:  <3>BUG: sleeping function called from invalid context
at kernel/rwsem.c:20
Nov 29 12:50:21 x kernel: in_atomic():0, irqs_disabled():1
Nov 29 12:50:21 x kernel:  [<c0405707>] dump_trace+0x69/0x1af
Nov 29 12:50:21 x kernel:  [<c0405865>] show_trace_log_lvl+0x18/0x2c
Nov 29 12:50:21 x kernel:  [<c0405e05>] show_trace+0xf/0x11
Nov 29 12:50:21 x kernel:  [<c0405e34>] dump_stack+0x15/0x17
Nov 29 12:50:21 x kernel:  [<c0430b92>] down_read+0x12/0x20
Nov 29 12:50:21 x kernel:  [<c0428c41>] blocking_notifier_call_chain+0xe/0x29
Nov 29 12:50:21 x kernel:  [<c041ed09>] do_exit+0x1b/0x776
Nov 29 12:50:21 x kernel:  [<c0405da6>] die+0x289/0x2ae
Nov 29 12:50:22 x kernel:  [<c060abf0>] do_page_fault+0xabf/0xc3c
Nov 29 12:50:22 x kernel:  [<c040502b>] error_code+0x2b/0x30
Nov 29 12:50:22 x kernel: DWARF2 unwinder stuck at error_code+0x2b/0x30
Nov 29 12:50:22 x kernel:
Nov 29 12:50:22 x kernel: Leftover inexact backtrace:
Nov 29 12:50:22 x kernel:
Nov 29 12:50:22 x kernel:  [<c04e0b51>] list_del+0x9/0x6c
Nov 29 12:50:22 x kernel:  [<c0462271>] free_block+0x63/0xdc
Nov 29 12:50:22 x kernel:  [<c0462374>] drain_array+0x8a/0xb5
Nov 29 12:50:22 x kernel:  [<c0463814>] cache_reap+0x85/0x117
Nov 29 12:50:22 x kernel:  [<c042b210>] run_workqueue+0x83/0xc5
Nov 29 12:50:22 x kernel:  [<c060936b>] _spin_lock_irqsave+0x12/0x17
Nov 29 12:50:22 x kernel:  [<c046378f>] cache_reap+0x0/0x117
Nov 29 12:50:22 x kernel:  [<c042bb00>] worker_thread+0xd9/0x10d
Nov 29 12:50:22 x kernel:  [<c04178a1>] default_wake_function+0x0/0xc
Nov 29 12:50:22 x kernel:  [<c042ba27>] worker_thread+0x0/0x10d
Nov 29 12:50:22 x kernel:  [<c042e013>] kthread+0xc0/0xed
Nov 29 12:50:22 x kernel:  [<c042df53>] kthread+0x0/0xed
Nov 29 12:50:22 x kernel:  [<c0402a69>] kernel_thread_helper+0x5/0xb
Nov 29 12:50:22 x kernel:  =======================

Comment 13 Akemi Yagi 2006-11-29 18:59:49 UTC

I spoke too soon, too.  My system crashed next morning at 4 AM.  This has
happened before.  One or more cronjobs run at this time, which apparently caused
the panic.  But it looks like the clocksource option makes it a bit harder to
trigger the crash.

Akemi

Comment 14 Dan Carpenter 2006-11-30 20:46:41 UTC

There was possibly corruption going on in slab.c in bug 216001 on some of these
kernels, but I don't see how it could end up affecting only cifs...  On the
other hand it's really easy to test one of Dave's new kernels with the slab fix.

Comment 15 Akemi Yagi 2006-11-30 21:22:37 UTC

If the new kernels you referred to are the ones in Dave's message (comment#5),
then that did not fix the problem I am seeing (comment#6).  I am wondering if
this post in LKML is related to mine:

http://lkml.org/lkml/2006/11/29/156

Comment 16 Akemi Yagi 2006-11-30 21:48:12 UTC

Dave,

2.6.19 is out.  If/when you have built this version, I would like to try it.

Akemi

Comment 17 Chris Nighswonger 2006-11-30 23:21:14 UTC

Hi all,
  Not finding this bug first, I reported similar problems in bug 217915. I 
attached some portions of my 'messages' log at the following link... 

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=142506&action=view

  This occurs on two different hardware configurations both running FC6 but 
different kernel builds:

System A: 2.6.18-1.2849.fc6
System B: 2.6.18-1.2798.fc6

  I realize all of this agrees with the above comments, but maybe the attached 
log cuts will help some.

Chris

Comment 18 Akemi Yagi 2006-12-01 16:04:10 UTC

I compiled 2.6.19 from kernel.org.  The problem persisted.

Akemi

Comment 19 Akemi Yagi 2006-12-03 17:06:24 UTC

Created attachment 142683 [details]
kernel (2.6.19) message with cifs debugging on

While compiling 2.6.19, I selected the "more debugging for cifs" option.  Then
enabled cifs with "echo 7 > /proc/fs/cifs/cifsFYI".  The attached file is an
example of a crash log from the moment of cifs-mounting of a Windows share
through the lockup.

Akemi

Comment 20 Steve French 2006-12-04 03:19:34 UTC

The two log files confirm to me that we are not in the middle of a cifs request,
although in both cases a cifs readdir did occur after the mount but before the
oops (a statfs was sandwiched in between in one of the two logs).  It is
remotely possible that a readdir corrupted memory but seems a long shot.

I wish we could track someone down who understands what this cache_reap kernel
thread is doing ... we need to narrow down who is corrupting this list.

Comment 21 Steve French 2006-12-04 03:20:20 UTC

The two log files confirm to me that we are not in the middle of a cifs request,
although in both cases a cifs readdir did occur after the mount but before the
oops (a statfs was sandwiched in between in one of the two logs).  It is
remotely possible that a readdir corrupted memory but seems a long shot.

I wish we could track someone down who understands what this cache_reap kernel
thread is doing ... we need to narrow down who is corrupting this list, it is
not obvious to me why cifs could affect this list.

Comment 22 Akemi Yagi 2006-12-08 02:00:36 UTC

Just for the record.  I installed kernel-2.6.18-1.2860.fc6 in testing.  The
system crashed as soon as I did a cifs-mount.

Akemi

Comment 23 Ugo Viti 2006-12-20 22:20:06 UTC

I confirm this bad problem... 

the kernel oops (complete machine hang, hard reset needed) in random mode, when
copying data from mounted cifs share or even umounting a share or at cifs module
unload.

Tested on FC6 with kernels from 2.6.18-1.2798.fc6 to 2.6.18-1.2868.fc6.

i tryed kernel 2.6.19-1.2877.fc7 (2.6.20-rc1) too, from FC7 development tree
(updated mkinitrd and nash to use this kernel) and the system continue to hangs.

i forced to install kernel-2.6.17-1.2157_FC5 on my FC6 box, and the system is
rock solid now (never crashed) on cifs mount/umount/copy data operations.

So, i think this is definitely a kernel bug.

I haven't tried 2.6.18/2.6.19 Vanilla Kernel.

Best Regards

Comment 24 Akemi Yagi 2006-12-20 22:29:16 UTC

Just a quick note for those who are seeing this problem.  Samba programmers have
been working on this and will be posting a fix soon.  I understand it might be a
temporary fix but things are looking good now.

Akemi

Comment 25 Shirish S. Pargaonkar 2006-12-21 18:49:17 UTC

This is a patch for 1.45 version of cifs.  I think this should help fix the problem.

diff -u sess.c sess.c.mod
--- sess.c      2006-08-02 16:15:17.000000000 -0500
+++ sess.c.mod  2006-12-21 09:43:19.000000000 -0600
@@ -179,10 +179,9 @@
        cFYI(1,("bleft %d",bleft));


-       /* word align, if bytes remaining is not even */
-       if(bleft % 2) {
+       /* word align, if bytes remaining is even */
+       if(!(bleft % 2)) {
                bleft--;
-               data++;
        }
        words_left = bleft / 2;

@@ -506,6 +505,7 @@
        /* and lanman response is 3 */
        bytes_remaining = BCC(smb_buf);
        bcc_ptr = pByteArea(smb_buf);
+       bcc_ptr++;

        if(smb_buf->WordCount == 4) {
                __u16 blob_len;

Comment 26 Akemi Yagi 2006-12-22 18:03:23 UTC

I have two test machines running with the patch provided by Shirish.  Both used
to have system lockups before the patch.  After the patch was applied, I have
not seen a single kernel oops/crash on either machine.  This is with a number of
mounts/umounts/reboots.

The test kernel was 2.6.18-1.2868.fc6 compiled with the above patch.  Later, I
installed the same kernel using rpm's and replaced cifs.ko with my patched
version.  That worked, too.

Akemi

Comment 27 Akemi Yagi 2007-01-25 16:41:18 UTC

This is a duplicate of Bug 211672.  Please refer to that report because the
latest patch has been posted there.

Akemi

Comment 28 Dave Bradley 2007-01-25 16:59:44 UTC

Does anyone know if this patch has made it into FC6 xen kernels? I'm having the
cifs crashing problem on my xen machines and am running a completely up to date
FC6 system.

To make the crash happen, I just manually run my backup jobs (they mount a
Windows share).


[root@firewall2 log]# uname -a
Linux firewall2.xxxxxxxxxx.com 2.6.19-1.2895.fc6xen #1 SMP Wed Jan 10 19:47:12
EST 2007 i686 athlon i386 GNU/Linux

[root@firewall2 cron.daily]# list_del corruption. next->prev should be c69fd480,
but was 0000000e
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:70!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /block/ram0/range
Modules linked in: nls_utf8 cifs ipv6 autofs4 hidp l2cap bluetooth iptable_raw
xt_policy xt_multiport ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_TCPMSS
ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE
ipt_LOG ipt_iprange ipt_hashlimit ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah
ipt_addrtype ip_nat_tftp ip_nat_snmp_basic ip_nat_pptp ip_nat_irc ip_nat_ftp
ip_nat_amanda ip_conntrack_tftp ip_conntrack_pptp ip_conntrack_netbios_ns
ip_conntrack_irc ip_conntrack_ftp ts_kmp ip_conntrack_amanda xt_tcpmss
xt_pkttype xt_physdev bridge xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit
xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY
xt_tcpudp xt_state iptable_nat ip_nat ip_conntrack iptable_mangle nfnetlink
iptable_filter ip_tables x_tables sunrpc xennet parport_pc lp parport pcspkr
dm_snapshot dm_zero dm_mirror dm_mod raid456 xor ext3 jbd xenblk
CPU:    0
EIP:    0061:[<c04e9d30>]    Not tainted VLI
EFLAGS: 00010082   (2.6.19-1.2895.fc6xen #1)
EIP is at list_del+0x48/0x6c
eax: 00000048   ebx: c69fd480   ecx: c0683b30   edx: f5416000
esi: c117f6a0   edi: c0902000   ebp: c0d5cca0   esp: c0d2def0
ds: 007b   es: 007b   ss: 0069
Process events/0 (pid: 5, ti=c0d2d000 task=c006e030 task.ti=c0d2d000)
Stack: c0646193 c69fd480 0000000e c69fd480 c0467706 c078afc0 c0686700 c117ecc0 
       00000005 00000004 c117fed0 c117fec0 00000005 c117fea0 00000000 c0467809 
       00000000 00000000 c0d5cca0 c117f6c4 c117f6a0 c0d5cca0 c0d404a0 00000000 
Call Trace:
 [<c0467706>] free_block+0x77/0xf0
 [<c0467809>] drain_array+0x8a/0xb5
 [<c0468df0>] cache_reap+0x53/0x117
 [<c042d603>] run_workqueue+0x97/0xdd
 [<c042dfc0>] worker_thread+0xd9/0x10d
 [<c043058c>] kthread+0xc0/0xec
 [<c0405253>] kernel_thread_helper+0x7/0x10
 =======================
Code: c0 e8 9a 4b f3 ff 0f 0b 41 00 82 61 64 c0 8b 03 8b 40 04 39 d8 74 1c 89 5c
24 04 89 44 24 08 c7 04 24 93 61 64 c0 e8 75 4b f3 ff <0f> 0b 46 00 82 61 64 c0
8b 13 8b 43 04 89 42 04 89 10 c7 43 04 
EIP: [<c04e9d30>] list_del+0x48/0x6c SS:ESP 0069:c0d2def0
 <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic():0, irqs_disabled():1
 [<c04056ff>] dump_trace+0x69/0x1b6
 [<c0405864>] show_trace_log_lvl+0x18/0x2c
 [<c0405e4b>] show_trace+0xf/0x11
 [<c0405e7a>] dump_stack+0x15/0x17
 [<c0433252>] down_read+0x12/0x28
 [<c042aca2>] blocking_notifier_call_chain+0xe/0x29
 [<c0420d75>] do_exit+0x1b/0x787
 [<c0405dec>] die+0x2af/0x2d4
 [<c0406262>] do_invalid_op+0xa2/0xab
 [<c0619deb>] error_code+0x2b/0x30
 [<c04e9d30>] list_del+0x48/0x6c
 [<c0467706>] free_block+0x77/0xf0
 [<c0467809>] drain_array+0x8a/0xb5
 [<c0468df0>] cache_reap+0x53/0x117
 [<c042d603>] run_workqueue+0x97/0xdd
 [<c042dfc0>] worker_thread+0xd9/0x10d
 [<c043058c>] kthread+0xc0/0xec
 [<c0405253>] kernel_thread_helper+0x7/0x10
 =======================

Comment 29 Dave Bradley 2007-01-25 17:18:08 UTC

OK -- I wrote a little script to mount/unmount a Windows share. Within 10
cycles, I get the oops.

No reading, no writing to the Windows machine. Just a mount and unmount.

Comment 30 Akemi Yagi 2007-01-25 18:22:22 UTC

Sounds like you are hit by this bug.  I am afraid it may take some more time
before the fix is included in any versions of FC (including FC6).  In the
meantime, you may have to apply the patch posted in Bug 211672.

Akemi

Comment 31 Shirish S. Pargaonkar 2007-01-25 18:44:52 UTC

Eventhough this patch is listed in 211672, I list here nonetheless

http://www.kernel.org/git/?p=linux/kernel/git/sfrench/cifs-
2.6.git;a=commitdiff;h=8e6f195af0e1f226e9b2e0256af8df46adb9d595

Comment 32 Akemi Yagi 2007-02-13 05:07:14 UTC

News! The cifs patch has been included in the latest kernels which are available
from the Fedora testing directory.

FC5 is:
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/5/

FC6 is:
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/6/

Thank you, Chuck and Dave.

Akemi

Comment 33 Johnny Hughes 2007-04-18 11:43:55 UTC

OK ... we are also tracking this issue in the CentOS-5 bug tracker as it effects
our compiled kernel.

We have created some cifs.ko kernel modules that should work on any of the
2.6.18-8.x.el5 kernels for el5 i686 and x86_64 (including xen and PAE).  So if
anyone has to make this work now before an official fix makes it out, you can
try our modules and/or review the CentOS bug here:

http://bugs.centos.org/view.php?id=1776

Comment 34 Jeff Layton 2008-01-15 11:51:27 UTC

*** Bug 221610 has been marked as a duplicate of this bug. ***