Bug 58747
Summary: | Suspected kernel 2.4.9 problems with IBM ServeRaid-4x controller on x330 platform | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Mike Cooling <cooling> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.1 | CC: | shishz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-09-30 15:39:20 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mike Cooling
2002-01-24 00:37:21 UTC
Last night we released kernel 2.4.9-20 which might be worth a shot. It has driver 4.80 for the Serverraid; I'm not 100% sure but it might be worth checking on the IBM site if there's new firmware for that. All symptoms of this problem seemed to have recurred with the 2.4.2 kernel. The most serious is that the NFS mount hangs and the application running there hangs. Kill commands do not work. Even a "df" command will hang and the "intr" flag is set on the mount. A shutdown also hangs so we must crash the system. This problem is serious. I'm at home sick today, but I did see the announcement on the new kernel and will try it. The 4.80 raid firmware apears to be the most current. I was on their site much of yesterday and also have an open ticket with IBM. Do you feel it's worth bringing the system up on a uni-processor kernel should that also fail (ie. do you think smp timing problems could be behind this)? At around 12:30PST on Jan 25 I installed kernel 2.4.9-21smp. It's a bit early to say for sure, but thus far all problems have disappeared except the message "kernel: nfs: server netapp1-eth1 not responding, still trying". The OK followup message usually occurs in the same second, and the frequency that these messages occur has been reduced. I'll continue to keep an eye on this issue. Last Friday I installed the latest kernel (2.4.9-21smp). The situation seemed to be improved, however this morning we again logged a bunch of "can't get a request slot". The NFS server involved is made by Network Appliance. Their support staff indicate this is typically a problem with the ethernet device drivers. I can't find newer drivers on the IBM site. Here are the specific errors: Jan 31 06:54:05 webct1 kernel: nfs: server netapp1-eth1 not responding, still trying Jan 31 06:54:05 webct1 kernel: nfs: server netapp1-eth1 not responding, still trying Jan 31 06:55:54 webct1 kernel: nfs: task 49767 can't get a request slot Jan 31 06:55:55 webct1 kernel: nfs: task 49768 can't get a request slot Jan 31 06:55:56 webct1 kernel: nfs: task 49769 can't get a request slot Jan 31 06:55:56 webct1 kernel: nfs: task 49770 can't get a request slot Jan 31 06:55:56 webct1 kernel: nfs: task 49771 can't get a request slot Jan 31 06:55:56 webct1 kernel: nfs: task 49772 can't get a request slot Jan 31 06:56:28 webct1 kernel: nfs: task 49773 can't get a request slot Jan 31 06:56:39 webct1 kernel: nfs: task 49774 can't get a request slot Jan 31 06:56:51 webct1 kernel: nfs: task 49775 can't get a request slot Jan 31 06:56:53 webct1 kernel: nfs: task 49776 can't get a request slot Jan 31 06:56:53 webct1 kernel: nfs: task 49777 can't get a request slot Jan 31 06:57:12 webct1 kernel: nfs: task 49778 can't get a request slot Jan 31 06:57:50 webct1 kernel: nfs: server netapp1-eth1 OK Jan 31 06:58:50 webct1 kernel: nfs: server netapp1-eth1 OK Jan 31 06:58:50 webct1 last message repeated 12 times I'm not sure what else to try and really need some help. Ok any idea what sort of network card this is ? the output of the lspci program will tell me if you don't lspci shows: 00:02.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08) 00:0a.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08) Ok. You can give this a shot: edit /etc/modules.conf and change "eepro100" to "e100", or if it already says "e100" curese IBM and replace it with "eepro100" then restart (or stop networking and unload the old module manually and start networking) The driver was set to eepro100. I've made the change on our identical test system. I'll schedule production downtime and beat up the test system in the meantime. Thanks for such a quick response. I have a question about the eepro100.c driver as distributed with kernel 2.4.9-21. The source package appears to contain version 1.09 dated 9/29/99 by Donald Becker and updates from others thru 7/2000. However at http://www.scyld.com/network/updates.html there appears to be a newer rpm with version 1.17a dated 8/7/2001 also by Donald Becker. Should I consider installing this newer driver? Is there a reason Redhat isn't using this driver with their kernels? The eepro100 driver in the kernel originated from Donald Becker, however the paths have since then split and Donald has it's own "fork" versus what the stock kernel has. It's not trivial to "just merge" Donald's driver nowadays... The e100 driver seems far worse. Many "Can't get a request slot" messages issued after only 1 nights use. I've gone back to the eepro100 driver. If you refer to my Jan 24 post I asked about the possibility of this being smp related, but I never received a response to this question. I've rebooted with the uni-processor kernel 2.4.9-21. Will continue to monitor. I suspect the uniprocessor kernel just puts less load on the card :( I don't know the physical setup but... a $15 3com 3c905 might be a smart solution ...... The uni-processor kernel may have even made the problem worse. The next thing I tried was going back to a single ethernet setup. I mounted the nfs file system via the standard eth0 card rather than the dedicated network on eth1. Although we had little or no NFS errors, the "web course teaching" application went nuts spawning a multitude of processes that weren't doing much. Load average was in the 25 to 30 range even though CPU was relatively idle. I went back to dual interfaces and reduced the rsize and wsize from 32k to 4k. I did this because it appears Redhat 6.2 defaulted to something much smaller than 32k. The system is performing well again. We have logged the "not responding" messages serveral times, but the are followed by the "OK" message within 0 to 4 seconds. Regarding your suggestion about the 3com card, the existing 82557 cards are part of the mother board so they couldn't be removed. We discussed the possibility of adding a 3-com card but I believe only one could be added (I'll double check this). So we would have to use one Intel and one 3-com for the NFS access. Or go back to one ethernet interface. If we were to do this, would Redhat 7.1 dynamically figure out the proper driver to use or would I have to manually setup something in /etc/modules.conf (excuse my ignorance here - I haven't change hardware config much without reloading the OS). Hardware detection of PCI devices is automatic (eg on boot you get a "new hardware detected" dialog) There are PCI cards with a dual ethernet interface but I must admit that my experience with those is "mixed", eg works for some doesn't work for others... My co-workers have already researched the dual ethernet cards for the x330 server (around $200). If this works we'll pursue this. In the meantime we have installed a 3c980 Cyclone 3-com card into our redundant server. I had to force the driver to full duplex, but it seems to be working OK now. I'll pursue scheduling some downtime and we'll pull the raid disks from the production server and install them into this server to minimize the downtime. In the meantime I have been playing with rsize and wsize. The system was defaulting to 32k. We have now seen failures with 32k, 8k, and 4k. When 4k failed last night I changed to mount to use 1k. So far we have not had any "nfs <device> not responding messages" as of midnight last night. I've been playing with rsize and wsize because that is one of the differences between Redhat 6.2 and 7.1. However I can't tell (for sure) what the deafult size use to be. When I "cat /proc/mounts" at V6.2 those values are not shown. Any ideas? I forgot to mention something important. With the new board and driver, I'm getting a new error at boot time as follows: Feb 5 10:00:13 webct99 depmod: depmod: *** Unresolved symbols in /lib/modules/2.4.9-21smp/kernel/drivers/hotplug/cpqphp.o Should I be concerned? Since I dropped the rsize/wsize on the NFS mount to 1024, we have not had a failure. Since the NFS size is less that the ethernet mtu size of 1500 we are now doing less fragmentation. We feel this problem may relate to how the kernel and/or driver deals with fragmentation on ethernet. What do you think? Fragmentation requires much more memory; I wouldn't rule out the driver getting in trouble sometimes :9 However it's odd that BOTH eepro100 and e100 handle it bad; maybe the hardware checksumming barfs in some corner case in fragmentation... Sunday we had a 9 second failure (nfs not responding) even with rsize/wsize set to 1k. However we may have found the problem and an effective work-around. I noticed that FTP transfers were taking far too long between 2 Redhat 7.1 test systems on the same subnet (ie. a 100 meg file was taking over 16 minutes on a 100Meg network and produced over 2000 collisions). We had forced our Cisco switch and the Network Appliance ethernet ports to full duplex, but we didn't have any idea what Redhat was doing. I set the option for the ethernet driver to force full duplex between the 2 test systems and rebooted. The FTP transfer then took 8 seconds and produced no collisions. I have forced full duplex for both Intel ethernet boards on our production system with the NFS problems as of Monday morning. We seem to have evidence that the combination of these Intel boards, ethernet driver, and the 7.1 kernel are not negotiating ethernet duplex mode properly. We'll give this configuration a burn-in period before kicking the NFS rsize/wsize back up (to avoid disrupting our users further). After running 2 weeks we had 2 brief failures. Both failures occured when the Network Appliance Filer was doing a backup to tape. After opening up another issue with Netapp and going thru the escalation process they finally informed us of an outstanding bug with dump. Assuming it was now safe, we then increased the rsize/wsize to 8k and had a failure withing 2 1/2 hours (and no dump running). Clearly we are dealing with multiple problems here. Unfortunately our sniffer did not capture the failure this time. We are trying again today to capture a failure with the rsize/wsize set to 8k. We had a rather nasty failure today. At an rsize/wsize of 8k, not only did we get the "not responding" but we also got the "can't get a request slot". This time we had the sniffer running and have a capture of the problem. We use a "WildPackets EtherPeek" sniffer. Do you have access to this type of sniffer, or what file formats can your sniffer read? A zipped file is about 7Meg. We have been fighting this problem for 4 months now with still no resolution. For the last couple of months we have been trying to get Network Appliance to fix their box. However we have a clear sniffer trace showing the Netapp does not appear to be the problem. The trace shows RH asks Netapp for a packet and gets an immediate response. Then RH stops making requests for 2 1/2 minutes. While this is going on RH is logging "NFS netapp not responding errors". Then all of a sudden RH starts making requests again and the Netapp immediately responds. This is very similar to Bugzilla error #51488, but they are using a 3-com PCI Tornado card with a Netapp filer. Bug # 52652 also had failures accessing a Netapp filer and claim a new kernel fixed it. I'm trying to find out exactly which one. Bug #53646 has the same failure between 2 RH systems but with older kernels. We still haven't tried the 3-com card mentioned earlier. After reading #51488 I'm not very optimistic that it will do any good. I am now on 2.4.9-21smp and am going to try 2.4.9-31smp. I haven't seen a response on this issue since Feb. It would be nice to get some feedback. So, did the 2.4.9-31 help? Actually, the AS branch is at -34 now. Would it be asking too much to move to 2.4.18-13? I am scared witless to think about fixing a 2.4.9. Version 2.4.9-31 didn't help at all. Shortly after installing that kernel, we ended up installing a 3com 3c980 Cyclone card for the NFS traffic. We have never seen the problem since. We have 2 theories on this: 1) the eepro100 driver is not re-entrant or not able to handle the load of 2 ethernet boards on the same system; 2) Any ethernet driver would have the same problem if used for 2 boards on the same system. I would love to know if anyone has had success with dual ethernet cards under a heavy load. We are now on Redhat 7.2 and kernel 2.4.9-34smp. The application vendor recently announced support for Redhat 7.3, which has the kernel you refer to. I'm not sure when I'll upgrade. Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |