Red Hat Bugzilla – Bug 248824
looping out of control and crashing
Last modified: 2016-07-25 08:32:30 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:126.96.36.199) Gecko/20070515 Firefox/188.8.131.52
Description of problem:
I've been trying to get cachefilesd working with NFSv3 on our render farm for several months. We have about 200 render machines each licensed for RHEL5 Desktop. The most recent official release for RHEL5 x86_64 (cachefilesd-0.7-6.el5) would run stably under light load, but would cause the system to kernel panic when serious file I/O began.
I was hopeful that the latest code release, 0.8-16.fc7, might avoid the kernel panics since it was supposed to address EOVERFLOW errors from the kernel. Unfortunately, although it compiled without errors, when run it immediately begins using 100% of a CPU as long as it runs. As soon as it actually caches something, however, it dies.
I was hoping someone could take a look at this and let me know if there was a potential fix or a better/more appropriate version of the daemon I should be using. We haven't asked for support for any of our render farm licenses before, and getting cachefilesd working would be extremely beneficial for our work.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
2.let it try to cache NFS data
The cachefilesd daemon dies.
The cachefilesd would begin caching files as expected with reasonable CPU overhead.
The cachefilesd dies with the following messages:
Jul 18 16:27:58 OC23A-102 cachefilesd: Failed to check object's in-use state: errno 95 (Operation not supported)
Jul 18 16:27:58 OC23A-102 kernel: CacheFiles: File cache on fd:00 unregistering
Jul 18 16:27:58 OC23A-102 kernel: FS-Cache: Withdrawing cache "mycache"
We're using an unmodified RHEL5 kernel.
The NFS server we're using is a NetApp 6070 running Ontap 7.2.2.
The Red Hat systems are HP BL460c blade servers with dual Intel Woodcrest CPUs and 10 GB RAM.
Created attachment 159573 [details]
Kernel panic error from cachefilesd-0.7-6.el5.x86_64
This is the console output of the kernel panic produced when we tried using the
downloaded cachefilesd-0.7-6.el5.x86_64. It would run successfully for a
while, but eventually fail with this error.
Because of this we were hoping that cachefilesd-0.8-16.fc7 might help us get
around our problem, but it appears to not be compatible with RHEL5 WS.
Output fom 'uname -a'
Linux OC23A-204 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64
(it's pretty much the standard install from the first RHEL 5 CDs)
Is there any way you can capture the bit of text that says why there was a
panic? It's immediately before the register dump and so isn't in the screen
snapshot you attached.
I'll see what I can find. That tends to be sent to the console of the blade
servers, and the HP ILO interface for the C-class blades doesn't have an
I'm currently running a render server with cachefilesd enabled for the NFS
volume on which image textures are stored. The cachefilesd in use is the
cachefilesd-0.7-6.el5.x86_64 provided for RHEL5. The textures are the primary
data that we'd like to cache, as they're read-only and exhibit high re-usability.
When I allow two simultaneous render processes to run on the render server it
eventually kernel panics with the screen-shotted message. However, if I only
run a single RenderMan render process it appears to have no problems. So the
kernel panic apparently is incurred when two processes are making frequent reads
from the same files in the same cache.
Sadly, restricting the render servers to running only a single render process
at a time is for us too wasteful. These are dual Intel Woodcrest dual-core CPU
systems, and despite Pixar's recent efforts to improve the threaded performance
of their prman/RenderMan renderer we seldom make good us of two threads in a
render, much less four.
> I'm currently running a render server with cachefilesd enabled for the NFS
> volume on which image textures are stored.
Can I just confirm that the render server is acting as an NFS client, not an
Sry, should have been more explicit. The render server is a compute/render
server which is pulling all source data from and writing all resulting data back
to file servers via NFS3/tcp. cachefilesd is being used to cache data from only
one of several NFS mounted file systems. The vast majority file accesses are
We are attempting to run two simultaneous render processes per render node.
Each render process typically reads 500MB-1GB of data from the cached NFS volume
in 50,000-100,000 NFS reads over a period of about 10 minutes. The file server
itself is sustaining about 20,000 network file ops per second.
Again, when a single render process is running on a compute/render server
cachefilesd is having no problems. Running a second simultaneous process will
eventually result in a kernel panic.
I'm still trying to get the full console output of the kernel panic itself,
but have not been able to find a tool that provides access to the full console
text. I'm also trying to generate a dump from the kernel panic, but have not
been able to produce anything using kdump.
I've been able to produce vmcore files but have been unable to find Red Hat
documentation on how to compile or obtain a kernel with debug information to
proceed with analyzing the core or an appropriate debuginfo file.
There also does not appear to be a feature in our render servers' ILO to allow
us to access any previous lines from the kernel panic.
How do you suggest we proceed with this debugging effort?
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
One of our admins been able to configure a serial console connection to one of
our render servers. The following was written to the console during one of the
cachefilesd-induced kernel panics:
Kernel BUG at fs/cachefiles/cf-namei.c:53
invalid opcode: 0000  SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/class
Modules linked in: nfs lockd nfs_acl cpqci(U) ipmi_si(U) ipmi_devintf(U)
ipmi_msghandler(U) autofs4 sunrpc cachefiles fscache video sbs i2c_ec i2c_core
button battery asus_acpi acpi_memhotplug ac parport_pc lp parport shpchp bnx2(U)
serio_raw pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage cciss(U)
sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 16670, comm: mtor Tainted: P 2.6.18-8.el5 #1
RIP: 0010:[<ffffffff8822cc22>] [<ffffffff8822cc22>]
RSP: 0018:ffff81029c45bcf8 EFLAGS: 00010246
RAX: ffff8102a3cdf3b8 RBX: ffff81029c45bd18 RCX: ffff8102a3cdf3b8
RDX: ffff8102a3cdf208 RSI: ffff810199e758e8 RDI: ffff8102acd506a8
RBP: ffff810199e758e8 R08: ffff81013df2eaf0 R09: ffff810189cfe400
R10: 000000013df2eaf0 R11: 0000000000000000 R12: ffff8102a3cdf400
R13: ffff810199e759c0 R14: ffff8102a3cdf400 R15: ffff8102acd50600
FS: 0000000000000000(0000) GS:ffffffff8038a000(0063) knlGS:00000000f1db16c0
CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f0dd0008 CR3: 00000001cf66c000 CR4: 00000000000006e0
Process mtor (pid: 16670, threadinfo ffff81029c45a000, task ffff8102ab80d860)
Stack: 0000000000000cdd ffff810189cfe400 0000000000000000 0000271000002728
00000031e38e9197 ffff8102ab17de44 0000000046f856d0 0000000000000028
ffff810189cfe400 ffff8102a3cdf400 ffff8102a3cdf400 ffff8102a03b5810
Code: 0f 0b 68 3c fb 22 88 c2 35 00 48 8b 02 48 85 c0 75 d4 49 8d
RIP [<ffffffff8822cc22>] :cachefiles:cachefiles_walk_to_object+0x8ad/0xbdb
<0>Kernel panic - not syncing: Fatal exception
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time. This request will be
reviewed for a future Red Hat Enterprise Linux release.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
A solution has been implemented upstream and backported to RHEL-7 and RHEL-6, but those rest on a later upstream evolution of fscache than is available in RHEL-5, so the effort to backport that far would be significant.