217181 – 2.6.18 kernels have trouble with isilon NFS file servers

Bug 217181 - 2.6.18 kernels have trouble with isilon NFS file servers

Summary: 2.6.18 kernels have trouble with isilon NFS file servers

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-11-24 18:56 UTC by Henning Schmiedehausen
Modified:	2008-05-06 16:56 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-05-06 16:56:29 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Cpature with working 2.6.17 kernel (last one working) (1.83 KB, application/x-bzip2) 2006-12-09 13:46 UTC, Henning Schmiedehausen	no flags	Details
Capture with failing 2.6.18 kernel (since 1.2200 none worked) (1.81 KB, application/x-bzip2) 2006-12-09 13:50 UTC, Henning Schmiedehausen	no flags	Details
Failed NFS lookup (both sides) (1.36 KB, application/x-bzip2) 2006-12-21 19:11 UTC, Henning Schmiedehausen	no flags	Details
Working NFS lookup (both side) (1.24 KB, application/x-bzip2) 2006-12-21 19:12 UTC, Henning Schmiedehausen	no flags	Details
Working NFS lookup with workaround (both sides) (1.76 KB, application/x-bzip2) 2006-12-21 19:13 UTC, Henning Schmiedehausen	no flags	Details
Show Obsolete (2) View All

Description Henning Schmiedehausen 2006-11-24 18:56:22 UTC

Description of problem:

I have a setup where a number of web servers serve PHP pages from an NFS server.
After we upgraded the web server kernels through yum to 2.6.18-1.2200 and
1.2239, we have big troubles with our isilon (FreeBSD based file cluster system,
see http://www.isilon.com/) file servers. This is _either_ related to the isilon
NFS server itself or the file system size (unfortunately the smallest isilon
filesystem I have is slightly over 2 TB).

When running <? phpinfo(); ?> through Apache httpd 2.2.2 and PHP 5.1.6, it
reports the normal PHP info when serving the files from a local file system or
from a file system mounted from a linux 2.4 server (~ 1.3 TB) 

On a 2 TB and on a 7.1 TB file system served by the isilon file cluster (OneFS
4.5), the following error is shown:

<br />
<b>Warning</b>:  Unknown: failed to open stream: Value too large for defined
data type in <b>Unknown</b> on line <b>0</b><br />
<br />
<b>Warning</b>:  Unknown: Failed opening '/home/www/info.php' for inclusion
(include_path='.:/usr/share/pear') in <b>Unknown</b> on line <b>0</b><br />

I can provide tcpdumps of the NFS communication if necessary.

Version-Release number of selected component (if applicable):
kernel-smp-2.6.17-1.2174_FC5  --- This version works ok.

kernel-smp-2.6.18-1.2200.fc5 --- Does not work
kernel-2.6.18-1.2849.fc6     --- Does not work
kernel-smp-2.6.18-1.2239.fc5 --- Does not work

How reproducible:

always

Steps to Reproduce:
1. Get an isilon file cluster
2. mount a file system on a FC5 box using a 2.6.18 kernel
3. Serve files from this filesystem through apache and php
4. use <? phpinfo(); ?> for testing.
5. observe the error.
  
Actual results:

<br />
<b>Warning</b>:  Unknown: failed to open stream: Value too large for defined
data type in <b>Unknown</b> on line <b>0</b><br />
<br />
<b>Warning</b>:  Unknown: Failed opening '/home/www/info.php' for inclusion
(include_path='.:/usr/share/pear') in <b>Unknown</b> on line <b>0</b><br />


Expected results:

Should serve the phpinfo(); report

Additional info:

cat /proc/mounts says
fileserver:/ifs/data/files/www /home/www nfs
rw,vers=3,rsize=8192,wsize=8192,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=fileserver
0 0

Changing the rsize and wsize to 32768 does not change the behaviour.

Running "php /home/www/info.php" on the command line works fine. So it seems to
be some sort of interaction of the way Apache and the embedded PHP access the
file system.

Comment 1 Steve Dickson 2006-12-04 19:52:11 UTC

Would be possible to post a bzip binary tethreal trace of this
problem. Something similar to:

tethereal -w /tmp/data.pcap host <server> ; bzip2 /tmp/data.pcap

Comment 2 Henning Schmiedehausen 2006-12-09 13:46:36 UTC

Created attachment 143214 [details]
Cpature with working 2.6.17 kernel (last one working)

This is a capture of a successful lookup

Comment 3 Henning Schmiedehausen 2006-12-09 13:50:22 UTC

Created attachment 143215 [details]
Capture with failing 2.6.18 kernel (since 1.2200 none worked)

This is a failing lookup

Comment 4 Henning Schmiedehausen 2006-12-09 13:55:44 UTC

So what I did was trying to quiet down the NFS traffic on one of our web server
boxes (I only succeeded partially, our health checker interfered with my tests,
please ignore all access to healthcheck.php). Then I ran

/sbin/service/httpd start ; sleep 1 ; wget -O - 
http://srv017.dc1.thomson-webcast.net/tools/info.php (that is the web box).

The Isilon cluster is at sto001.dc1.thomson-webcast.lan

The local IP is 10.64.0.64, the cluster ip is 10.64.1.56 and 10.64.1.62 (that
one varies from mount to mount). 

I also tried the lastest kernel from rawhide in a UP version
(vmlinuz-2.6.18-1.2849.fc6) with no changes to the behaviour. 


If you need more info, let me know; I can also provide you with a technical
contact at isilon (only through private mail)

Comment 5 Henning Schmiedehausen 2006-12-17 18:12:21 UTC

Any news or progress on this bug reports? Any more traces needed?

Comment 6 Henning Schmiedehausen 2006-12-20 21:43:39 UTC

This problem still persists with 2.6.18-1.2257. I'd really appreciate *any*
progress on this.

Comment 7 Steve Dickson 2006-12-21 00:35:26 UTC

First of all, I do apologize for not being a bit more responsive on this...

Looking at both traces I'm only seen half of the traffic...
Only the requests,  none of the replies which tells me
either the server is not responding or dropping the
replies (which I doubt)  or the server has more than one
network interface and the replies are coming back with
a different ip address (which is more likely the case)....

So would it possible to re-run the traces to catch
both side of the traffic... maybe something like
"tethereal -w /tmp/data.pcap host 10.64.1.56 and host <2ed if>"

also what nfs-utils version are you using?
Finally are there any type of errors or warnings in
the /var/log/messages file on the client?

Comment 8 Henning Schmiedehausen 2006-12-21 18:52:30 UTC

Ok, first thing, for everyone that finds this bug because of the same problems:
This is a known problem from Isilon and they have Knowledge Base article # 1568
which fixes the problem. Contact Isilon Support for this fix.

Comment 9 Henning Schmiedehausen 2006-12-21 19:11:19 UTC

Created attachment 144207 [details]
Failed NFS lookup (both sides)

This is a failed File lookup through Apache. Kernel is 2.6.18-1.2257-fc5smp.

(This time I have both sides, I capture the bonding interface, not just eth0.)

Comment 10 Henning Schmiedehausen 2006-12-21 19:12:36 UTC

Created attachment 144208 [details]
Working NFS lookup (both side)

This is a working File lookup through Apache. Kernel is 2.6.17-1.2174_FC5smp

(This time I have both sides, I capture the bonding interface, not just eth0.)

Comment 11 Henning Schmiedehausen 2006-12-21 19:13:53 UTC

Created attachment 144209 [details]
Working NFS lookup with workaround (both sides)

This is a working NFS lookup using 2.6.18-1.2257-fc5smp. I have the proposed
workaround from Isilon in place on the cluster.

Comment 12 Bug Zapper 2008-04-04 04:54:44 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 13 Bug Zapper 2008-05-06 16:56:27 UTC

This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.