Bug 187295 - CIFS under heavy load crashes system
Summary: CIFS under heavy load crashes system
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On: 199167
Blocks: 176344
TreeView+ depends on / blocked
 
Reported: 2006-03-29 20:23 UTC by Jason Bradley Nance
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-08 01:03:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
OOPS from kernel (1.64 KB, text/plain)
2006-06-14 17:08 UTC, nathan r. hruby
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 174982 0 medium CLOSED Kernel Oops or hang when archiving data to CIFS mount 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2007:0304 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 5 2007-04-28 18:58:50 UTC

Description Jason Bradley Nance 2006-03-29 20:23:26 UTC
Description of problem:
When put under heavy load, the CIFS driver crashes bringing down the rest of the
system with it.

Version-Release number of selected component (if applicable):
1.34 (kernel-smp-2.6.9-34)

How reproducible:
Always (with sufficient load)

Steps to Reproduce:
1. Mount a CIFS share.
2. Put the mount under heavy load.
3. Wait...
  
Actual results:
System crashes.

Expected results:
No crash.

Additional info:
I have tried using both Windows 2003 Server and a TeraStation NAS (old samba) as
the server for the mount.  I have also tried using both the CIFS driver and the
(deprecated) SMBFS driver.  What I did find strange, though was the following:

mount /mnt/cifsShare
cp -a /40 /GBs /of /data /mnt/cifsShare

This takes a long time, but does NOT result in a crash.  However:

mount /mnt/cifsShare
for a in /\40 /\GBs /\of /\data; do
  rsync -rtv --delete $a /mnt/cifsShare
done

OR

mount /mnt/cifsShare
for a in /\40 /\GBs /\of /\data; do
  rsync -av --delete $a /mnt/cifsShare
done

both results in a crash.

Unfortunately, this is on a production server, so when the box goes down I have
to reboot it immediately and haven't been able to grab any of the console info.

fstab says:

//server/share   /mnt/cifsShare   cifs  
credentials=/etc/cifs.creds,gid=users,dir_mode=0770,file_mode=0660   0 0

This seems similar to closed bug #174982.

Comment 1 Jason Baron 2006-03-30 15:09:55 UTC
is this a regression from U2 cifs behavior?

Comment 2 Jason Bradley Nance 2006-03-30 15:25:24 UTC
I experienced the same behavior in U2 and was waiting to test in U3 due to the
cifs update.

Comment 3 Jason Baron 2006-03-30 17:44:40 UTC
looks like we need another update :(

Comment 4 Jason Bradley Nance 2006-03-30 19:21:15 UTC
Then I won't report the "processes become deadlocked when cifs server goes away"
bug that I'm trying to figure out as well... :)

Comment 5 Jason Bradley Nance 2006-04-08 21:39:52 UTC
Is there an estimate on the next kernel rollout?

Comment 6 Jason Bradley Nance 2006-04-08 22:19:29 UTC
Do you know offhand if the e1000 issues have been resolved in the U3?  This
machine uses the e1000 driver with an onboard card as well as a dual-port PCI-X
card.  I'd hate to blame the wrong module... =)
Guess I need to do some more test...

Comment 7 Jason Bradley Nance 2006-05-10 19:20:45 UTC
Okay, I ran stress tests on the e1000 driver and was unable to make it crash, so
I  believe this is actually a CIFS issue.

Comment 8 nathan r. hruby 2006-06-14 16:59:09 UTC
METOO

I'm seeing the same behavior as well on a HP DL380 g3 (tg3 based nic's). 
Luckily (?) I seem to be able to reporduce this at will with rsync, as it
generally locks when rsync'ing a specific user's data from one system to the
CIFS filesystem.  Though the load is not heavy at all (on both the system as a
whole and the filesystem).

Please LMK if's there any specific debugging steps you'd like me to take. 

Comment 9 nathan r. hruby 2006-06-14 17:08:49 UTC
Created attachment 130880 [details]
OOPS from kernel

OOPS from serial console when running an rsync against user data that triggers
this issue.

Comment 10 nathan r. hruby 2006-06-14 17:14:43 UTC
Hurm.. Looking at the OOPS, the top of the call trace is cifs_rename.  

Oddly, the directory with the bad data contains two files "Buddy.jpg" and
"buddy.jpg"  The first one syncs just duckily, the latter one is the last file
rsync spits out before the crash.  Could this be a case sensitivity issue
tickling a deadlock?

I can not reporduce this with vi, so rsync must do domething funky when copying?

Comment 11 Jason Baron 2006-06-16 15:43:04 UTC
ok, here is a pointer to a changeset which might resolve this issue:
http://marc.theaimsgroup.com/?l=git-commits-head&m=111767133714737&w=2

I've examined the backtrace in comment #9, and it is indeed falling over on
exactly the same code that is fixed in the above patch. Therefore, i think it is
likely that this patch will solve your issue.

I've crated a test kernel with this patch (it doesn't quite apply cleanly to
rhel4), based on latest beta kernel: http://people.redhat.com/~jbaron/bz187295/.
I could build it on top of 34.0.1 if you want...

Also, i've contacted the upstream CIFS maintainer, and we are in discussion as
to what further CIFS improvements are appropriate for rhel4.

thanks.

Comment 12 Jason Baron 2006-06-22 17:45:02 UTC
anybody had a chance to test this? thanks.

Comment 13 Jason Bradley Nance 2006-06-26 20:28:19 UTC
That kernel wouldn't boot for me.  Hangs on PCI probing.  *shrug*

Comment 14 Jason Baron 2006-06-26 20:36:59 UTC
hmmm...sounds like a different issue...if possible could you post the boot log
up to the point where it fails...i'm also curious if the latest U4 beta kernels
work for you, located at: http://people.redhat.com/~jbaron/rhel4/  thanks.

Comment 15 Jason Bradley Nance 2006-06-27 20:18:27 UTC
2.6.9-40 has the same cifs module as 2.6.9-34(.0.1), is there any reason to test
that?

As far as the kernel you built goes... it's an i686 and I'm running an x86_64
install... *shrug*

Comment 17 Jason Baron 2006-07-10 15:47:16 UTC
ok, i've place test kernels for x86 and x86_64 at:
http://people.redhat.com/~jbaron/bz187295/

Please let us know if these resolve the issue. thanks.

Comment 18 Jason Bradley Nance 2006-07-13 17:48:06 UTC
This issues appears to be resolved in 2.6.9-40.1.EL.cifs.1smp.

Thank you.


Comment 19 RHEL Program Management 2006-09-07 19:24:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 RHEL Program Management 2006-09-07 19:24:07 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 RHEL Program Management 2006-09-07 19:24:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 23 Jason Baron 2006-09-14 16:05:55 UTC
committed in stream U5 build 42.10. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 26 Red Hat Bugzilla 2007-05-08 01:03:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html


Note You need to log in before you can comment on or make changes to this bug.