From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 Description of problem: I've been trying to get cachefilesd working with NFSv3 on our render farm for several months. We have about 200 render machines each licensed for RHEL5 Desktop. The most recent official release for RHEL5 x86_64 (cachefilesd-0.7-6.el5) would run stably under light load, but would cause the system to kernel panic when serious file I/O began. I was hopeful that the latest code release, 0.8-16.fc7, might avoid the kernel panics since it was supposed to address EOVERFLOW errors from the kernel. Unfortunately, although it compiled without errors, when run it immediately begins using 100% of a CPU as long as it runs. As soon as it actually caches something, however, it dies. I was hoping someone could take a look at this and let me know if there was a potential fix or a better/more appropriate version of the daemon I should be using. We haven't asked for support for any of our render farm licenses before, and getting cachefilesd working would be extremely beneficial for our work. Version-Release number of selected component (if applicable): cachefilesd-0.8-16.fc7.src.rpm How reproducible: Always Steps to Reproduce: 1.run cachefilesd 2.let it try to cache NFS data 3. Actual Results: The cachefilesd daemon dies. Expected Results: The cachefilesd would begin caching files as expected with reasonable CPU overhead. Additional info: The cachefilesd dies with the following messages: Jul 18 16:27:58 OC23A-102 cachefilesd[30337]: Failed to check object's in-use state: errno 95 (Operation not supported) Jul 18 16:27:58 OC23A-102 kernel: CacheFiles: File cache on fd:00 unregistering Jul 18 16:27:58 OC23A-102 kernel: FS-Cache: Withdrawing cache "mycache" We're using an unmodified RHEL5 kernel. The NFS server we're using is a NetApp 6070 running Ontap 7.2.2. The Red Hat systems are HP BL460c blade servers with dual Intel Woodcrest CPUs and 10 GB RAM.
Created attachment 159573 [details] Kernel panic error from cachefilesd-0.7-6.el5.x86_64 This is the console output of the kernel panic produced when we tried using the downloaded cachefilesd-0.7-6.el5.x86_64. It would run successfully for a while, but eventually fail with this error. Because of this we were hoping that cachefilesd-0.8-16.fc7 might help us get around our problem, but it appears to not be compatible with RHEL5 WS.
Output fom 'uname -a' Linux OC23A-204 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64 x86_64 GNU/Linux (it's pretty much the standard install from the first RHEL 5 CDs)
Is there any way you can capture the bit of text that says why there was a panic? It's immediately before the register dump and so isn't in the screen snapshot you attached.
Hi. I'll see what I can find. That tends to be sent to the console of the blade servers, and the HP ILO interface for the C-class blades doesn't have an accessible buffer.
Something relevant: I'm currently running a render server with cachefilesd enabled for the NFS volume on which image textures are stored. The cachefilesd in use is the cachefilesd-0.7-6.el5.x86_64 provided for RHEL5. The textures are the primary data that we'd like to cache, as they're read-only and exhibit high re-usability. When I allow two simultaneous render processes to run on the render server it eventually kernel panics with the screen-shotted message. However, if I only run a single RenderMan render process it appears to have no problems. So the kernel panic apparently is incurred when two processes are making frequent reads from the same files in the same cache. Sadly, restricting the render servers to running only a single render process at a time is for us too wasteful. These are dual Intel Woodcrest dual-core CPU systems, and despite Pixar's recent efforts to improve the threaded performance of their prman/RenderMan renderer we seldom make good us of two threads in a render, much less four.
> I'm currently running a render server with cachefilesd enabled for the NFS > volume on which image textures are stored. Can I just confirm that the render server is acting as an NFS client, not an NFS server?
Sry, should have been more explicit. The render server is a compute/render server which is pulling all source data from and writing all resulting data back to file servers via NFS3/tcp. cachefilesd is being used to cache data from only one of several NFS mounted file systems. The vast majority file accesses are read transactions. We are attempting to run two simultaneous render processes per render node. Each render process typically reads 500MB-1GB of data from the cached NFS volume in 50,000-100,000 NFS reads over a period of about 10 minutes. The file server itself is sustaining about 20,000 network file ops per second. Again, when a single render process is running on a compute/render server cachefilesd is having no problems. Running a second simultaneous process will eventually result in a kernel panic. I'm still trying to get the full console output of the kernel panic itself, but have not been able to find a tool that provides access to the full console text. I'm also trying to generate a dump from the kernel panic, but have not been able to produce anything using kdump.
I've been able to produce vmcore files but have been unable to find Red Hat documentation on how to compile or obtain a kernel with debug information to proceed with analyzing the core or an appropriate debuginfo file. There also does not appear to be a feature in our render servers' ILO to allow us to access any previous lines from the kernel panic. How do you suggest we proceed with this debugging effort?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
One of our admins been able to configure a serial console connection to one of our render servers. The following was written to the console during one of the cachefilesd-induced kernel panics: Kernel BUG at fs/cachefiles/cf-namei.c:53 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/class CPU 0 Modules linked in: nfs lockd nfs_acl cpqci(U) ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) autofs4 sunrpc cachefiles fscache video sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport shpchp bnx2(U) serio_raw pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage cciss(U) sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 16670, comm: mtor Tainted: P 2.6.18-8.el5 #1 RIP: 0010:[<ffffffff8822cc22>] [<ffffffff8822cc22>] :cachefiles:cachefiles_walk_to_object+0x8ad/0xbdb RSP: 0018:ffff81029c45bcf8 EFLAGS: 00010246 RAX: ffff8102a3cdf3b8 RBX: ffff81029c45bd18 RCX: ffff8102a3cdf3b8 RDX: ffff8102a3cdf208 RSI: ffff810199e758e8 RDI: ffff8102acd506a8 RBP: ffff810199e758e8 R08: ffff81013df2eaf0 R09: ffff810189cfe400 R10: 000000013df2eaf0 R11: 0000000000000000 R12: ffff8102a3cdf400 R13: ffff810199e759c0 R14: ffff8102a3cdf400 R15: ffff8102acd50600 FS: 0000000000000000(0000) GS:ffffffff8038a000(0063) knlGS:00000000f1db16c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000f0dd0008 CR3: 00000001cf66c000 CR4: 00000000000006e0 Process mtor (pid: 16670, threadinfo ffff81029c45a000, task ffff8102ab80d860) Stack: 0000000000000cdd ffff810189cfe400 0000000000000000 0000271000002728 00000031e38e9197 ffff8102ab17de44 0000000046f856d0 0000000000000028 ffff810189cfe400 ffff8102a3cdf400 ffff8102a3cdf400 ffff8102a03b5810 Call Trace: [<ffffffff882294ca>] :cachefiles:cachefiles_lookup_object+0x23a/0x33a [<ffffffff8821a2cf>] :fscache:fscache_lookup_object+0xef/0x189 [<ffffffff8821a90e>] :fscache:__fscache_acquire_cookie+0x1ac/0x209 [<ffffffff882eca51>] :nfs:nfs_open+0x222/0x270 [<ffffffff882e8966>] :nfs:nfs_opendir+0x0/0x52 [<ffffffff882e89ab>] :nfs:nfs_opendir+0x45/0x52 [<ffffffff8001df03>] __dentry_open+0xd9/0x1dc [<ffffffff80026f20>] do_filp_open+0x2a/0x38 [<ffffffff8000ce95>] dput+0x2c/0x113 [<ffffffff80019358>] do_sys_open+0x44/0xbe [<ffffffff8005f013>] sysenter_do_call+0x1b/0x67 Code: 0f 0b 68 3c fb 22 88 c2 35 00 48 8b 02 48 85 c0 75 d4 49 8d RIP [<ffffffff8822cc22>] :cachefiles:cachefiles_walk_to_object+0x8ad/0xbdb RSP <ffff81029c45bcf8> <0>Kernel panic - not syncing: Fatal exception
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).
A solution has been implemented upstream and backported to RHEL-7 and RHEL-6, but those rest on a later upstream evolution of fscache than is available in RHEL-5, so the effort to backport that far would be significant.