Bug 248824 - looping out of control and crashing
looping out of control and crashing
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cachefilesd (Show other bugs)
5.0
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: David Howells
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-07-18 20:04 EDT by Sean Laverty
Modified: 2016-07-25 08:32 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-02 09:16:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Kernel panic error from cachefilesd-0.7-6.el5.x86_64 (132.57 KB, image/jpeg)
2007-07-18 21:48 EDT, Sean Laverty
no flags Details

  None (edit)
Description Sean Laverty 2007-07-18 20:04:56 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4

Description of problem:
I've been trying to get cachefilesd working with NFSv3 on our render farm for several months.  We have about 200 render machines each licensed for RHEL5 Desktop.  The most recent official release for RHEL5 x86_64 (cachefilesd-0.7-6.el5) would run stably under light load, but would cause the system to kernel panic when serious file I/O began.

I was hopeful that the latest code release, 0.8-16.fc7, might avoid the kernel panics since it was supposed to address EOVERFLOW errors from the kernel.  Unfortunately, although it compiled without errors, when run it immediately begins using 100% of a CPU as long as it runs.  As soon as it actually caches something, however, it dies.

I was hoping someone could take a look at this and let me know if there was a potential fix or a better/more appropriate version of the daemon I should be using.  We haven't asked for support for any of our render farm licenses before, and getting cachefilesd working would be extremely beneficial for our work.

Version-Release number of selected component (if applicable):
cachefilesd-0.8-16.fc7.src.rpm

How reproducible:
Always


Steps to Reproduce:
1.run cachefilesd
2.let it try to cache NFS data
3.

Actual Results:
The cachefilesd daemon dies.

Expected Results:
The cachefilesd would begin caching files as expected with reasonable CPU overhead.

Additional info:
The cachefilesd dies with the following messages:

Jul 18 16:27:58 OC23A-102 cachefilesd[30337]: Failed to check object's in-use state: errno 95 (Operation not supported)
Jul 18 16:27:58 OC23A-102 kernel: CacheFiles: File cache on fd:00 unregistering
Jul 18 16:27:58 OC23A-102 kernel: FS-Cache: Withdrawing cache "mycache"

We're using an unmodified RHEL5 kernel.
The NFS server we're using is a NetApp 6070 running Ontap 7.2.2.
The Red Hat systems are HP BL460c blade servers with dual Intel Woodcrest CPUs and 10 GB RAM.
Comment 1 Sean Laverty 2007-07-18 21:48:58 EDT
Created attachment 159573 [details]
Kernel panic error from cachefilesd-0.7-6.el5.x86_64

This is the console output of the kernel panic produced when we tried using the
downloaded cachefilesd-0.7-6.el5.x86_64.  It would run successfully for a
while, but eventually fail with this error.  

Because of this we were hoping that cachefilesd-0.8-16.fc7 might help us get
around our problem, but it appears to not be compatible with RHEL5 WS.
Comment 2 Sean Laverty 2007-07-25 13:59:04 EDT
Output fom 'uname -a'

Linux OC23A-204 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64
x86_64 GNU/Linux

(it's pretty much the standard install from the first RHEL 5 CDs)
Comment 3 David Howells 2007-07-25 15:27:39 EDT
Is there any way you can capture the bit of text that says why there was a 
panic?  It's immediately before the register dump and so isn't in the screen 
snapshot you attached.

Comment 4 Sean Laverty 2007-07-25 15:31:07 EDT
Hi.
  I'll see what I can find.  That tends to be sent to the console of the blade
servers, and the HP ILO interface for the C-class blades doesn't have an
accessible buffer. 
Comment 5 Sean Laverty 2007-07-25 15:41:38 EDT
Something relevant:
  I'm currently running a render server with cachefilesd enabled for the NFS
volume on which image textures are stored.  The cachefilesd in use is the
cachefilesd-0.7-6.el5.x86_64 provided for RHEL5.  The textures are the primary
data that we'd like to cache, as they're read-only and exhibit high re-usability.
  When I allow two simultaneous render processes to run on the render server it
eventually kernel panics with the screen-shotted message.  However, if I only
run a single RenderMan render process it appears to have no problems.  So the
kernel panic apparently is incurred when two processes are making frequent reads
from the same files in the same cache.
  Sadly, restricting the render servers to running only a single render process
at a time is for us too wasteful.  These are dual Intel Woodcrest dual-core CPU
systems, and despite Pixar's recent efforts to improve the threaded performance
of their prman/RenderMan renderer we seldom make good us of two threads in a
render, much less four.
Comment 6 David Howells 2007-07-26 09:59:23 EDT
> I'm currently running a render server with cachefilesd enabled for the NFS
> volume on which image textures are stored.

Can I just confirm that the render server is acting as an NFS client, not an 
NFS server?
Comment 7 Sean Laverty 2007-07-26 13:39:05 EDT
  Sry, should have been more explicit.  The render server is a compute/render
server which is pulling all source data from and writing all resulting data back
to file servers via NFS3/tcp. cachefilesd is being used to cache data from only
one of several NFS mounted file systems.  The vast majority file accesses are
read transactions.
  We are attempting to run two simultaneous render processes per render node. 
Each render process typically reads 500MB-1GB of data from the cached NFS volume
in 50,000-100,000 NFS reads over a period of about 10 minutes.  The file server
itself is sustaining about 20,000 network file ops per second.
  Again, when a single render process is running on a compute/render server
cachefilesd is having no problems.  Running a second simultaneous process will
eventually result in a kernel panic.
  I'm still trying to get the full console output of the kernel panic itself,
but have not been able to find a tool that provides access to the full console
text.  I'm also trying to generate a dump from the kernel panic, but have not
been able to produce anything using kdump.
Comment 8 Sean Laverty 2007-07-31 20:02:05 EDT
  I've been able to produce vmcore files but have been unable to find Red Hat
documentation on how to compile or obtain a kernel with debug information to
proceed with analyzing the core or an appropriate debuginfo file.
  There also does not appear to be a feature in our render servers' ILO to allow
us to access any previous lines from the kernel panic.
  How do you suggest we proceed with this debugging effort?
Comment 9 RHEL Product and Program Management 2007-10-19 16:27:36 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 10 Sean Laverty 2007-10-23 15:05:52 EDT
One of our admins been able to configure a serial console connection to one of
our render servers.  The following was written to the console during one of the
cachefilesd-induced kernel panics:

Kernel BUG at fs/cachefiles/cf-namei.c:53
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/class
CPU 0 
Modules linked in: nfs lockd nfs_acl cpqci(U) ipmi_si(U) ipmi_devintf(U)
ipmi_msghandler(U) autofs4 sunrpc cachefiles fscache video sbs i2c_ec i2c_core
button battery asus_acpi acpi_memhotplug ac parport_pc lp parport shpchp bnx2(U)
serio_raw pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage cciss(U)
sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 16670, comm: mtor Tainted: P      2.6.18-8.el5 #1
RIP: 0010:[<ffffffff8822cc22>]  [<ffffffff8822cc22>]
:cachefiles:cachefiles_walk_to_object+0x8ad/0xbdb
RSP: 0018:ffff81029c45bcf8  EFLAGS: 00010246
RAX: ffff8102a3cdf3b8 RBX: ffff81029c45bd18 RCX: ffff8102a3cdf3b8
RDX: ffff8102a3cdf208 RSI: ffff810199e758e8 RDI: ffff8102acd506a8
RBP: ffff810199e758e8 R08: ffff81013df2eaf0 R09: ffff810189cfe400
R10: 000000013df2eaf0 R11: 0000000000000000 R12: ffff8102a3cdf400
R13: ffff810199e759c0 R14: ffff8102a3cdf400 R15: ffff8102acd50600
FS:  0000000000000000(0000) GS:ffffffff8038a000(0063) knlGS:00000000f1db16c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f0dd0008 CR3: 00000001cf66c000 CR4: 00000000000006e0
Process mtor (pid: 16670, threadinfo ffff81029c45a000, task ffff8102ab80d860)
Stack:  0000000000000cdd ffff810189cfe400 0000000000000000 0000271000002728
 00000031e38e9197 ffff8102ab17de44 0000000046f856d0 0000000000000028
 ffff810189cfe400 ffff8102a3cdf400 ffff8102a3cdf400 ffff8102a03b5810
Call Trace:
 [<ffffffff882294ca>] :cachefiles:cachefiles_lookup_object+0x23a/0x33a
 [<ffffffff8821a2cf>] :fscache:fscache_lookup_object+0xef/0x189
 [<ffffffff8821a90e>] :fscache:__fscache_acquire_cookie+0x1ac/0x209
 [<ffffffff882eca51>] :nfs:nfs_open+0x222/0x270
 [<ffffffff882e8966>] :nfs:nfs_opendir+0x0/0x52
 [<ffffffff882e89ab>] :nfs:nfs_opendir+0x45/0x52
 [<ffffffff8001df03>] __dentry_open+0xd9/0x1dc
 [<ffffffff80026f20>] do_filp_open+0x2a/0x38
 [<ffffffff8000ce95>] dput+0x2c/0x113
 [<ffffffff80019358>] do_sys_open+0x44/0xbe
 [<ffffffff8005f013>] sysenter_do_call+0x1b/0x67


Code: 0f 0b 68 3c fb 22 88 c2 35 00 48 8b 02 48 85 c0 75 d4 49 8d 
RIP  [<ffffffff8822cc22>] :cachefiles:cachefiles_walk_to_object+0x8ad/0xbdb
 RSP <ffff81029c45bcf8>
 <0>Kernel panic - not syncing: Fatal exception
Comment 14 RHEL Product and Program Management 2008-03-11 15:44:01 EDT
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.
Comment 15 RHEL Product and Program Management 2014-03-07 08:41:35 EST
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Comment 16 RHEL Product and Program Management 2014-06-02 09:16:06 EDT
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
Comment 17 David Howells 2016-07-25 08:32:30 EDT
A solution has been implemented upstream and backported to RHEL-7 and RHEL-6, but those rest on a later upstream evolution of fscache than is available in RHEL-5, so the effort to backport that far would be significant.

Note You need to log in before you can comment on or make changes to this bug.