Bug 58747

Summary:	Suspected kernel 2.4.9 problems with IBM ServeRaid-4x controller on x330 platform
Product:	[Retired] Red Hat Linux	Reporter:	Mike Cooling <cooling>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.1	CC:	shishz
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-30 15:39:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Cooling 2002-01-24 00:37:21 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.78 [en] (Win98; U)

Description of problem:
Please refer to bugzilla bug #58440 and TicketManager #198533 as I feel these are all related. Hardware is an IBM x330 Server Type 8654 Model
51Y with dual 1 Gigahertz processors, 1 Gig Ram, a ServeRaid-4x Ultra160 SCSI Controller and mirrored 9-gig internal disks.

After installing Redhat 7.1 (out of the box) with kernel 2.4.2-2smp an announcement was received regarding vulnerabilities with the kernel and
2.4.9-12 was released. I installed the new kernel and all seemed well. After we went into production we started noticing "flakey" things such as 1)
using the dump program to backup root to disk would sometimes hang at different points and sometimes work, regardless of the output file being
on an NFS filesystem or the internal Raid disks. Sometimes it would work; 2) the system would do strange things when booting like complaining
root was dirty and throwing the system into single user mode. But then fsck would abend (segmentation fault, shared lib out of sync with OS, or
error 11). Doing a control D to reboot would then cause it to come up OK when root was never checked.

We physically pulled the 2 raid mirror disks out of the server and moved them to another piece of hardware and all the problems moved also.

We have 2 identical servers with identical Redhat 7.1 loads (one test and one production). I found a corrupted /bin/vi program I had to restore. And a
corrupted /bin/cat on the other server that I restored. I saved the corrupted files. Executing the renamed file would fail and executing the restored file
worked in each case.Running the "diff" program stated there were differences but the sizes of the files were identical. After a reboot of each system
all 4 files now execute OK.

I've checked IBM's web site and the CD-Rom that came with the server. They have patches or drivers for kernels 2.4.2, 2.4.3, 2.4.5, but nothing for
2.4.9.

The production system is logging NFS errors to our Network appliance file server. One error is "kernel: nfs: server not responding, still trying". The
other error is "kernel: nfs: task 9788 can't get a request slot". The application running on the NFS mount has not changed and continues to work
with Redhat 6.2 on an IBM Netfinity 4000r dual 750mhz server (but no raid controller).

IBM suggests to reboot back to kernel 2.4.2smp. This is done for the test system and will be done for production tonight after 6pm PST. What do
you think? Do I have a driver problem at 2.4.9, or perhaps I have an SMP problem? Is there some other possibility I'm missing? It's driving me crazy.

Version-Release number of selected component (if applicable):
kernel 2.4.9-12smp

How reproducible:
Sometimes

Steps to Reproduce:
1. Keep rebooting until flakiness occurs.
2. or keep dumping until dump fails
3.

Additional info:

I set the severity as "high" because of the files being corrupted and the fact that the results differ with every reboot. Since all problems have moved
from one hardware platform to another this must be some kind of software, driver, or firmware problem. If tomorrow the nfs problems have not
recurred and I cannot get dump to fail, it will be a good indication the kernel and related driver was the problem.

Today I updated the bios on both systems. The raid level is current.

I can also be reached at (916) 278-7262.

Comment 1 Arjan van de Ven 2002-01-24 10:13:58 UTC

Last night we released kernel 2.4.9-20 which might be worth a shot. It has
driver 4.80 for the Serverraid; I'm not 100% sure but it might be worth checking
on the IBM site if there's new firmware for that.

Comment 2 Mike Cooling 2002-01-24 23:31:38 UTC

All symptoms of this problem seemed to have recurred with the 2.4.2 kernel. The most serious is that the NFS mount hangs and the application 
running there hangs. Kill commands do not work. Even a "df" command will hang and the "intr" flag is set on the mount. A shutdown also hangs so 
we must crash the system. This problem is serious.

I'm at home sick today, but I did see the announcement on the new kernel and will try it. The 4.80 raid firmware apears to be the most current. I was 
on their site much of yesterday and also have an open ticket with IBM.

Do you feel it's worth bringing the system up on a uni-processor kernel should that also fail (ie. do you think smp timing problems could be behind 
this)?

Comment 3 Mike Cooling 2002-01-28 19:01:01 UTC

At around 12:30PST on Jan 25 I installed kernel 2.4.9-21smp. It's a bit early to say for sure, but thus far all problems have disappeared except the 
message "kernel: nfs: server netapp1-eth1 not responding, still trying". The OK followup message usually occurs in the same second, and the 
frequency that these messages occur has been reduced.  I'll continue to keep an eye on this issue.

Comment 4 Mike Cooling 2002-01-31 20:22:14 UTC

Last Friday I installed the latest kernel (2.4.9-21smp). The situation seemed to be improved, however this morning we again logged a bunch of "can't get a request 
slot". The NFS server involved is made by Network Appliance. Their support staff indicate this is typically a problem with the ethernet device drivers. I can't find newer 
drivers on the IBM site. Here are the specific errors:

Jan 31 06:54:05 webct1 kernel: nfs: server netapp1-eth1 not responding, still trying                                                                            
Jan 31 06:54:05 webct1 kernel: nfs: server netapp1-eth1 not responding, still trying                                                                            
Jan 31 06:55:54 webct1 kernel: nfs: task 49767 can't get a request slot         
Jan 31 06:55:55 webct1 kernel: nfs: task 49768 can't get a request slot         
Jan 31 06:55:56 webct1 kernel: nfs: task 49769 can't get a request slot         
Jan 31 06:55:56 webct1 kernel: nfs: task 49770 can't get a request slot         
Jan 31 06:55:56 webct1 kernel: nfs: task 49771 can't get a request slot         
Jan 31 06:55:56 webct1 kernel: nfs: task 49772 can't get a request slot         
Jan 31 06:56:28 webct1 kernel: nfs: task 49773 can't get a request slot         
Jan 31 06:56:39 webct1 kernel: nfs: task 49774 can't get a request slot         
Jan 31 06:56:51 webct1 kernel: nfs: task 49775 can't get a request slot         
Jan 31 06:56:53 webct1 kernel: nfs: task 49776 can't get a request slot         
Jan 31 06:56:53 webct1 kernel: nfs: task 49777 can't get a request slot         
Jan 31 06:57:12 webct1 kernel: nfs: task 49778 can't get a request slot         
Jan 31 06:57:50 webct1 kernel: nfs: server netapp1-eth1 OK
Jan 31 06:58:50 webct1 kernel: nfs: server netapp1-eth1 OK
Jan 31 06:58:50 webct1 last message repeated 12 times

I'm not sure what else to try and really need some help.

Comment 5 Arjan van de Ven 2002-01-31 20:25:04 UTC

Ok any idea what sort of network card this is ?
the output of the lspci program will tell me if you don't

Comment 6 Mike Cooling 2002-01-31 20:29:20 UTC

lspci shows:

00:02.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)
00:0a.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08)

Comment 7 Arjan van de Ven 2002-01-31 20:36:28 UTC

Ok.

You can give this a shot:

edit /etc/modules.conf

and change "eepro100" to "e100", or if it already says "e100" curese IBM and
replace it with "eepro100" 

then restart (or stop networking and unload the old module manually and start
networking)

Comment 8 Mike Cooling 2002-01-31 20:50:09 UTC

The driver was set to eepro100. I've made the change on our identical test system. I'll schedule production downtime and beat up the test system in 
the meantime.  Thanks for such a quick response.

Comment 9 Mike Cooling 2002-01-31 23:26:06 UTC

I have a question about the eepro100.c driver as distributed with kernel 2.4.9-21. The source package appears to contain version 1.09 dated 
9/29/99 by Donald Becker and updates from others thru 7/2000.  However at http://www.scyld.com/network/updates.html there appears to be a 
newer rpm with  version 1.17a dated 8/7/2001 also by Donald Becker.

Should I consider installing this newer driver?    Is there a reason Redhat isn't using this driver with their kernels?

Comment 10 Arjan van de Ven 2002-02-01 09:15:21 UTC

The eepro100 driver in the kernel originated from Donald Becker, however the
paths have since then split and Donald has it's own "fork" versus what the stock
kernel has. It's not trivial to "just merge" Donald's driver nowadays...

Comment 11 Mike Cooling 2002-02-03 19:15:26 UTC

The e100 driver seems far worse. Many "Can't get a request slot" messages issued after only 1 nights use. I've gone back to the eepro100 driver.

If you refer to my Jan 24 post I asked about the possibility of this being smp related, but I never received a response to this question. I've rebooted 
with the uni-processor kernel 2.4.9-21. Will continue to monitor.

Comment 12 Arjan van de Ven 2002-02-03 19:58:43 UTC

I suspect the uniprocessor kernel just puts less load on the card :(
I don't know the physical setup but... a $15 3com 3c905 might be a smart
solution ......

Comment 13 Mike Cooling 2002-02-05 00:53:36 UTC

The uni-processor kernel may have even made the problem worse. The next thing I tried was going back to a single ethernet setup. I mounted the 
nfs file system via the standard eth0 card rather than the dedicated network on eth1. Although we had little or no NFS errors, the "web course 
teaching" application went nuts spawning a multitude of processes that weren't doing much. Load average was in the 25 to 30 range even though 
CPU was relatively idle.

I went back to dual interfaces and reduced the rsize and wsize from 32k to 4k. I did this because it appears Redhat 6.2 defaulted to something 
much smaller than 32k. The system is performing well again. We have logged the "not responding" messages serveral times, but the are followed 
by the "OK" message within 0 to 4 seconds.

Regarding your suggestion about the 3com card, the existing 82557 cards are part of the mother board so they couldn't be removed. We discussed 
the possibility of adding a 3-com card but I believe only one could be added (I'll double check this). So we would have to use one  Intel and one 
3-com for the NFS access. Or go back to one ethernet interface.  If we were to do this, would Redhat 7.1 dynamically figure out the proper driver to 
use or would I have to manually setup something in /etc/modules.conf (excuse my ignorance here - I haven't change hardware config much without 
reloading the OS).

Comment 14 Arjan van de Ven 2002-02-05 07:57:04 UTC

Hardware detection of PCI devices is automatic (eg on boot you get a "new
hardware detected" dialog)

There are PCI cards with a dual ethernet interface but I must admit that my
experience with those is "mixed", eg works for some doesn't work for others...

Comment 15 Mike Cooling 2002-02-05 18:51:18 UTC

My co-workers have already researched the dual ethernet cards for the x330 server (around $200). If this works we'll pursue this. In the meantime 
we have installed a 3c980 Cyclone 3-com card into our redundant server. I had to force the driver to full duplex, but it seems to be working OK now. 
I'll pursue scheduling some downtime and we'll pull the raid disks from the production server and install them into this server to minimize the 
downtime.

In the meantime I have been playing with rsize and wsize. The system was defaulting to 32k. We have now seen failures with 32k, 8k, and 4k. When 
4k failed last night I changed to mount to use 1k. So far we have not had any "nfs <device> not responding messages" as of midnight last night.

I've been playing with rsize and wsize because that is one of the differences between Redhat 6.2 and 7.1. However I can't tell (for sure) what the 
deafult size use to be. When I "cat /proc/mounts" at V6.2 those values are not shown.  Any ideas?

Comment 16 Mike Cooling 2002-02-05 19:38:26 UTC

I forgot to mention something important. With the new board and driver, I'm getting a new error at boot time as follows:
   Feb  5 10:00:13 webct99 depmod: depmod: *** Unresolved symbols in /lib/modules/2.4.9-21smp/kernel/drivers/hotplug/cpqphp.o

Should I be concerned?

Comment 17 Mike Cooling 2002-02-07 21:34:49 UTC

Since I dropped the rsize/wsize on the NFS mount to 1024, we have not had a failure. Since the NFS size is less that the ethernet mtu size of 1500 
we are now doing less fragmentation.  We feel this problem may relate to how the kernel and/or driver deals with fragmentation on ethernet.

What do you think?

Comment 18 Arjan van de Ven 2002-02-07 21:38:24 UTC

Fragmentation requires much more memory; I wouldn't rule out the driver getting
in trouble sometimes :9 However it's odd that BOTH eepro100 and e100 handle it
bad; maybe the hardware checksumming barfs in some corner case in
fragmentation...

Comment 19 Mike Cooling 2002-02-13 23:03:36 UTC

Sunday we had a 9 second failure (nfs not responding) even with rsize/wsize set to 1k.  However we may have found the problem and an effective work-around.

I noticed that FTP transfers were taking far too long between 2 Redhat 7.1 test systems on the same subnet (ie. a 100 meg file was taking over 16 minutes on a 
100Meg network and produced over 2000 collisions). We had forced our Cisco switch and the Network Appliance ethernet ports to full duplex, but we didn't have any 
idea what Redhat was doing. I set the option for the ethernet driver to force full duplex between the 2 test systems and rebooted. The FTP transfer then took 8 
seconds and produced no collisions.

I have forced full duplex for both Intel ethernet boards on our production system with the NFS problems as of Monday morning. We seem to have evidence that the  
combination of these Intel boards, ethernet driver, and the 7.1 kernel are not negotiating ethernet duplex mode properly.

We'll give this configuration a burn-in period before kicking the NFS rsize/wsize back up (to avoid disrupting our users further).

Comment 20 Mike Cooling 2002-02-25 19:00:39 UTC

After running 2 weeks we had 2 brief failures. Both failures occured when the Network Appliance Filer was doing a backup to tape. After opening up 
another issue with Netapp and going thru the escalation process they finally informed us of an outstanding bug with dump. Assuming it was 
now safe, we then increased the rsize/wsize to 8k and had a failure withing 2 1/2 hours (and no dump running). Clearly we are dealing with multiple 
problems here. Unfortunately our sniffer did not capture the failure this time. We are trying again today to capture a failure with the rsize/wsize set to 
8k.

Comment 21 Mike Cooling 2002-02-26 02:02:16 UTC

We had a rather nasty failure today. At an rsize/wsize of 8k, not only did we get the "not responding" but we also got the "can't get a request slot". 
This time we had the sniffer running and have a capture of the problem. We use a "WildPackets EtherPeek" sniffer. Do you have access to this type 
of sniffer, or what file formats can your sniffer read? A zipped file is about 7Meg.

Comment 22 Mike Cooling 2002-04-25 18:02:23 UTC

We have been fighting this problem for 4 months now with still no resolution. For the last couple of months we have been trying to get Network 
Appliance to fix their box. However we have a clear sniffer trace showing the Netapp does not appear to be the problem.

The trace shows RH asks Netapp for a packet and gets an immediate response. Then RH stops making requests for 2 1/2 minutes. While this is 
going on RH is logging "NFS netapp not responding errors". Then all of a sudden RH starts making requests again and the Netapp immediately 
responds.

This is very similar to Bugzilla error #51488, but they are using a 3-com PCI Tornado card with a Netapp filer.
Bug # 52652 also had failures accessing a Netapp filer and claim a new kernel fixed it. I'm trying to find out exactly which one.
Bug #53646 has the same failure between 2 RH systems but with older kernels.

We still haven't tried the 3-com card mentioned earlier. After reading #51488 I'm not very optimistic that it will do any good.

I am now on 2.4.9-21smp and am going to try 2.4.9-31smp.  I haven't seen a response on this issue since Feb. It would be nice to get some 
feedback.

Comment 23 Pete Zaitcev 2002-09-10 22:17:36 UTC

So, did the 2.4.9-31 help? Actually, the AS branch is at -34 now.

Would it be asking too much to move to 2.4.18-13?
I am scared witless to think about fixing a 2.4.9.

Comment 24 Mike Cooling 2002-09-11 20:23:24 UTC

Version 2.4.9-31 didn't help at all. Shortly after installing that kernel, we ended up installing a 3com 3c980 Cyclone card for the NFS traffic. We have 
never seen the problem since. We have 2 theories on this: 1) the eepro100 driver is not re-entrant or not able to handle the load of 2 ethernet 
boards on the same system; 2) Any ethernet driver would have the same problem if used for 2 boards on the same system. I would love to know if 
anyone has had success with dual ethernet cards under a heavy load.

We are now on Redhat 7.2 and kernel 2.4.9-34smp. The application vendor recently announced support for Redhat 7.3, which has the kernel you 
refer to. I'm not sure when I'll upgrade.

Comment 25 Bugzilla owner 2004-09-30 15:39:20 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/