Bug 508870
| Summary: | No network traffic when igb network interface receives arp traffic during negotiation | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Veaceslav Falico <vfalico> | ||||||
| Component: | kernel | Assignee: | Stefan Assmann <sassmann> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 5.3 | CC: | agospoda, alexander.h.duyck, dhoward, dzickus, gasmith, jjarvis, jpirko, ltroan, martin.wilck, mgahagan, peterm, rpacheco, takeshi.suzuki, tao | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2009-09-02 08:11:23 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 517975 | ||||||||
| Attachments: |
|
||||||||
This code has changed a bit for RHEL5.4, so it would be nice if they could test there as well. While the checks to make sure that link is up are still in the NAPI poll routine (now igb_poll), the call to check link status was removed in igb_clean_rx_irq_adv (which may or may not make a difference). Those calls were originally left in place (not removed like they were upstream) because we don't have the same capability to prevent polling in RHEL5 that we do upstream, so I'm curious if this is something that can be resolved with a different layout of calls that are igb-driver specific. Do you have any more details on why this patch resolves the issue? Martin, could you please respond to Gospo's questions in comment #2. He's asking for a retest in the 5.4 beta and for additional informaton on why this patch fixes the problem. > While the checks to make sure that link is up are still in the NAPI poll > routine (now igb_poll), These checks have been removed in the latest OEM drivers we tested (1.3.23, 1.3.19.3). >the call to check link status was removed in >igb_clean_rx_irq_adv (which may or may not make a difference). I can't see these checks in the mentioned OEM drivers. > Do you have any more details on why this patch resolves the issue? I believe the removal of the checks in the poll routine is the important part, but only Suzuki-san's engineers can give a final answer. I can tell that the OEM drivers don't show the problem under discussion (tested with >1000 reboot cycles). (In reply to comment #7) > > While the checks to make sure that link is up are still in the NAPI poll > > routine (now igb_poll), > > These checks have been removed in the latest OEM drivers we tested (1.3.23, > 1.3.19.3). > They are gone from upstream too, but based on the fact that we don't have the upper-layer changes that made the driver checks redundant, I didn't want to remove them. > >the call to check link status was removed in > >igb_clean_rx_irq_adv (which may or may not make a difference). > > I can't see these checks in the mentioned OEM drivers. > > > Do you have any more details on why this patch resolves the issue? > > I believe the removal of the checks in the poll routine is the important part, > but only Suzuki-san's engineers can give a final answer. I can tell that the > OEM drivers don't show the problem under discussion (tested with >1000 reboot > cycles). If some analysis has been done to determine why, I would like to understand. If link is being detected as 'up' then those calls should not make a difference. If the link is not 'up' then I would consider them important and would investigate whether or not they can be removed or if there is another way to work-around the problem. > While the checks to make sure that link is up are still in the NAPI poll > routine (now igb_poll), Fujitsu Engineering told me that link (carrier) is detected as 'down' in igb_clean_rx_ring_msix when we have the problem. If link is detected as 'down' roughly 6 times ( 7 or 8 in some cases), igb_clean_rx_ring_msix is never called again, and all incoming packets is dropped forever. Removing the check in igb_clean_rx_ring_msix makes incoming buffer clean, and the problem can be workaround. > the call to check link status was removed in > igb_clean_rx_irq_adv (which may or may not make a difference). We are not sure if we really need this check in igb_clean_rx_irq_adv. In fact, we confirmed the problem can be workaround in both cases, i.e. with the check and without the check in igb_clean_rx_irq_adv. > This code has changed a bit for RHEL5.4, so it would be nice
> if they could test there as well.
RHEL5.3 GA (2.6.18-128.el5) + ethtool -r eth0 ..... NG (20% failure)
RHEL5.3 GA (2.6.18-128.el5) + reboot .............. NG (20% failure)
RHEL5.4 beta (2.6.18-155.el5) + ethtool -r eth0 ... OK ( 0% failure)
RHEL5.4 beta (2.6.18-155.el5) + reboot ............ NG (15% failure)
With proposed patch in Description: RHEL5.3 + igb-discard-packet + ethtool -r eth0 .... OK ( 0% failure) RHEL5.3 + igb-discard-packet + reboot ............. -- ( no data ) > > These checks have been removed in the latest OEM drivers we tested (1.3.23, > > 1.3.19.3). > > > > They are gone from upstream too, but based on the fact that we don't have the > upper-layer changes that made the driver checks redundant, I didn't want to > remove them. Have you discussed that with John Ronciak? As you certainly know the OEM driver packages have compatibility code for RHEL, and that code does omit the mentioned checks. I would believe they've been removed for a reason. Side note: even the OEM driver 1.2.44 didn't have them any more. The last oem driver in which I found the quit_polling label was 1.0.8, which is already quite old. > If some analysis has been done to determine why, I would like to understand. > If link is being detected as 'up' then those calls should not make a > difference. If the link is not 'up' then I would consider them important and > would investigate whether or not they can be removed or if there is another way > to work-around the problem. The argument from Suzuki-san makes sense to me. What about you? I think the key bits here in defining the issue are 82575 and shared management. I don't have the RHEL 5.4 code in front of me, but based off of the 5.3 code I have it looks like the fifo workaround for 82575 wasn't incorporated. There is a good chance what is occuring is a fifo corruption and no packets being received as a result. The patch below resolved the issue in the upstream driver. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=662d7205b3db0bf9ebcae31f30ed72a1bceb47af (In reply to comment #12) > > > These checks have been removed in the latest OEM drivers we tested (1.3.23, > > > 1.3.19.3). > > > > > > > They are gone from upstream too, but based on the fact that we don't have the > > upper-layer changes that made the driver checks redundant, I didn't want to > > remove them. > > Have you discussed that with John Ronciak? As you certainly know the OEM driver > packages have compatibility code for RHEL, and that code does omit the > mentioned checks. I would believe they've been removed for a reason. I haven't, but Intel has full access to our sources and I cannot recall them complaining about it. > > If some analysis has been done to determine why, I would like to understand. > > If link is being detected as 'up' then those calls should not make a > > difference. If the link is not 'up' then I would consider them important and > > would investigate whether or not they can be removed or if there is another way > > to work-around the problem. > > The argument from Suzuki-san makes sense to me. What about you? If you are referring to this one: (In reply to comment #9) > > Fujitsu Engineering told me that link (carrier) is detected as 'down' > in igb_clean_rx_ring_msix when we have the problem. > If link is detected as 'down' roughly 6 times ( 7 or 8 in some cases), > igb_clean_rx_ring_msix is never called again, and all incoming packets > is dropped forever. > > Removing the check in igb_clean_rx_ring_msix makes incoming buffer clean, > and the problem can be workaround. > This makes a bit of sense. There is a chance that during the 1-10ms it will take for the link to appear to be up at the OS level, there are enough frames received to fill-up the receive ring buffer. If seems there is a chance that though interrupts are enabled, the hardware will not pop any more interrupts until the ring buffer is cleared. I can't speak to whether or not the hardware works this way, but it seems plausible. I really don't have a problem removing these calls if we can be sure there are no other negative impact. As I see it, if we are polling the device that doesn't have link up, there may be an extra poll that happens from time to time, but I'm guessing we won't introduce too much extra work. I'd like to test this a bit more and target this for the next update in the event this introduces a regression we don't expect and this is no worse than what we shipped in 5.3. (In reply to comment #13) > I think the key bits here in defining the issue are 82575 and shared > management. I don't have the RHEL 5.4 code in front of me, but based off of > the 5.3 code I have it looks like the fifo workaround for 82575 wasn't > incorporated. There is a good chance what is occuring is a fifo corruption and > no packets being received as a result. > > The patch below resolved the issue in the upstream driver. > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=662d7205b3db0bf9ebcae31f30ed72a1bceb47af Thanks, Alexander. The commit that is being requested is this one: commit 1a32bcfb5706d06a49904383b02f7c1d24172b96 Author: Alexander Duyck <alexander.h.duyck> Date: Tue Aug 26 04:25:11 2008 -0700 igb: clean up a stray fake netdev code left in rx path Remove code that was in place to support fake netdev we already have commit you have linked in RHEL5.4 beta. The fake netdev code was in place to handle multiple RX queues on NAPI systems prior to 2.6.24. From what I have seen the RHEL 5.4 has NAPI but not the newer style seen in the 2.6.24 and later kernels. You can remove the call to check for the real netdev as long as you are making certain to disable all of the polling devices in the igb_down path. In regards to handling a stalled queue there should be code in the igb_watchdog_task that will fire an interrupt every 2 seconds for all of the RX queues. This should handle the situation in which the RX queue is receiving packets before the netdevice reports carrier ok. Just let me keep here up-to-date, thanks! 5.3 GA (2.6.18-128.el5) + ethtool -r ... NG (27% fail = 8/30) 5.3 GA (2.6.18-128.el5) + reboot ....... NG (53% fail = 16/30) 5.3 GA (2.6.18-128.el5) + ethtool -r ... NG (20% fail = 4/20) 5.3 GA (2.6.18-128.el5) + reboot ....... NG (55% fail = 11/20) 5.3 GA (2.6.18-128.el5) + DC on/off .... NG (45% fail = 9/20) 5.3 GA + OEM 1.3.19.3 + ethtool -r ..... OK ( 0% fail = 0/30) 5.3 GA + OEM 1.3.19.3 + reboot ......... NG (10% fail = 3/30) 5.3 GA + OEM 1.3.19.3 + Q443040_igb_1.3.19.3.patch + ethtool -r ... OK ( 0% fail = 0/50) 5.3 GA + OEM 1.3.19.3 + Q443040_igb_1.3.19.3.patch + reboot ....... OK ( 0% fail = 0/50) Note: Q443040_igb_1.3.19.3.patch is a patch from Intel. The comments don't seem to auto-propagate from IT here, so I'd like to speed up things a bit. Background: we haven't told you the whole truth about all our problems with the ARP load test yet, because we were thinking that they might be in part HW or FW problems related to the shared management mode. Altogether there are 4 problems: 1. "Link up but no connection after boot or renegotiate" problem as described here. 2. With shared management LAN, no management LAN connection after DC off (shutdown -h), DC on via LAN is impossible 3. In DC on/off test, the igb driver sporadically reports "error -2" (E1000_ERR_PHY) during HW detection. The shared management port (eth0) is unusable in this case. 4. In a "repeat ethtool -r" test with shared management LAN, the igb driver sometimes reports "Hardware error", or negotiates a wrong speed (10 or 100 MBit instead of 1Gbit). Problem 3.) hasn't been reproduced in the lab so far, we currently think that it's a single system failure. All other problems are solved with OEM driver 1.3.19.3 plus a patch that I'm going to attach right now. Created attachment 351235 [details]
patch that fixed our problems under ARP load
With 1.3.19.3 + this patch we are not seeing any problems in our ARP storm tests any more.
Please consider this patch (and whatever bits you think are appropriate from attachement #349940) for inclusion in 5.4 beta, or at least in a test kernel so that we can give it a try.
This patch was committed to upstream a couple of days ago. The backport to the RHEL kernel should be fairly strait forward. The link to the patch in upstream is: http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=19e588e7d156cc4415585edd8c27c3075f62eaf8 Just let me keep here up-to-date, thanks! 5.4 snapshot 1 (2.6.18-156.el5) + ethtool -r ... OK ( 0% fail = 0/3480) 5.4 snapshot 1 (2.6.18-156.el5) + reboot ....... OK ( 0% fail = 0/ 589) (In reply to comment #21) > Just let me keep here up-to-date, thanks! > > 5.4 snapshot 1 (2.6.18-156.el5) + ethtool -r ... OK ( 0% fail = 0/3480) > 5.4 snapshot 1 (2.6.18-156.el5) + reboot ....... OK ( 0% fail = 0/ 589) Glad to hear this is working. Does this mean we can close this? (Sounds like it to me.) (In reply to comment #22) > Glad to hear this is working. Does this mean we can close this? DC on/off test via iRMC is going to be tested in Paderborn soon. Let me ask you to wait Martin's feedback if we can close this with RHEL5.4 release. (In reply to comment #16) > In regards to handling a stalled queue there should be code in the > igb_watchdog_task that will fire an interrupt every 2 seconds for all of the RX > queues. This should handle the situation in which the RX queue is receiving > packets before the netdevice reports carrier ok. We confirmed that RHEL5.4 beta have the new code, which does not exist in RHEL5.3, and this works fine, even we don't have igb-discard-packet.diff in RHEL5.4 beta. (In reply to comment #15) > The commit that is being requested is this one: > > commit 1a32bcfb5706d06a49904383b02f7c1d24172b96 > Author: Alexander Duyck <alexander.h.duyck> > Date: Tue Aug 26 04:25:11 2008 -0700 > > igb: clean up a stray fake netdev code left in rx path > > Remove code that was in place to support fake netdev > > we already have commit you have linked in RHEL5.4 beta. We also confirmed that RHEL5.4 beta have the code above. (In reply to comment #20) > This patch was committed to upstream a couple of days ago. The backport to the > RHEL kernel should be fairly strait forward. The link to the patch in upstream > is: > > http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=19e588e7d156cc4415585edd8c27c3075f62eaf8 We confirmed that the 'set lan id' patch is already in RHEL5.3 and RHEL5.4 snapshot 1, but we can see the difference where the code is called as below. igb_probe() + igb_get_invariants_82575() ......... 'set lan id' in upstream + igb_get_bus_info_pcie() line 82 ... 'set lan id' in RHEL5.4 beta + igb_get_bus_info_pcie() line 104 ... 'set lan id' in RHEL5.3 GA I am wondering if the difference could result in the following failure, which we have seen on RHEL5.3 with the 1024 node configuration, and if the code in RHEL5.4 beta is early enough to be called. > 3. In DC on/off test, the igb driver sporadically reports "error -2" > (E1000_ERR_PHY). the shared management port (eth0) is unusable. > > Jun 2 22:05:45 mpc0197 kernel: igb: probe of 0000:01:00.0 failed with error -2 It would be highly appreciated if you could give us your insight on this. The failures would be due to the function ordering. 'set lan id' needs to be done before get_phy_id is called. If get_phy_id is called first it will create a lock contention issue as all ports will attempt to use the lock for phy 0. Yes. It looks like that RHEL5.4 beta have the lock contention issue. Does this result in E1000_ERR_PHY when the access to PHY timeouts? Maybe we have E1000_ERR_SWFW_SYNC in this case? We think that the patch fixes most (if not all) of our current problems, therefore, would appreciate for the inclusion in RHEL5.4 very much. It would be better to think about the E1000_ERR_PHY problem again (if it still occurs) after we have the patch in RHEL5.4. Is Fujitsu currently noticing any problems with igb that we can be sure are caused fixed by:
commit 19e588e7d156cc4415585edd8c27c3075f62eaf8
Author: Alexander Duyck <alexander.h.duyck>
Date: Tue Jul 7 13:01:55 2009 +0000
igb: set lan id prior to configuring phy
The igb driver was defaulting to using the lock for pci-e function 0 for
all of the phys due to the fact that the lan id was not being set prior to
initialization. This change makes it so that the function id is set prior
to checking for the phy id.
I would say I am 99% sure that their issue is being caused by the issue addressed by this patch. The problem essentially is that when the nics are coming up they are competing for the PHY0 lock since they all believe they are lan ID 0. By setting the lan id beforehand this is avoided since each lan will go for a seperate lock. Thanks for the help, Alexander. It is greatly appreciated. If I read this correctly, this race is unlikely when there is minimal traffic on the network while the system is booting, but in cases where there is a large amount of traffic coming into both interfaces while booting, right? I don't believe the traffic is even a requirement. It is just unloading/reloading the driver should cause the issue. The problem is essentially that PHY semaphore 0 is used for all the ports during driver load so if the first port is already loaded and happens to be doing something that is reading/writing the PHY as the next port loads then that next port will fail to load since PHY 0 is already locked. (In reply to comment #34) > I don't believe the traffic is even a requirement. It is just > unloading/reloading the driver should cause the issue. The problem is > essentially that PHY semaphore 0 is used for all the ports during driver load > so if the first port is already loaded and happens to be doing something that > is reading/writing the PHY as the next port loads then that next port will fail > to load since PHY 0 is already locked. That is interesting. I've not seen this on any of the dual-port igb cards I've used in the past, so I figured this was something that was specific to the load. We've also never heard anything about it before, so that makes me wonder what really causes this. I'm being particular about this because it's quite difficult to get things into 5.4 at this point and I want to know how severe this problem is before anyone makes a push for inclusion. The most likely contributor to all of this is the fact that the port is a shared management port. I suspect the management engine may be using the PHY lock as well and this is contributing to the issues seen. Ah, yes. IPMI and friends are always giving us trouble. :-) That would make sense too, as I'm often only able to test on add-on cards rather than LOMs.508870 Our customer is requesting a plausible explanation why the patch from comment #30 fixes our problems. From my understanding, problem 1 from comment #18 is fixed by igb_watchdog_task (comment #16), while problems 2-4 from comment #18 are fixed by the patch from comment #30. From the explanation in comment #31, that makes lots of sense for problem 3, which are observed at driver load time. The one that bites us most, though, is problem 2 (management LAN is dead - LEDs off - after shutdown). Alexander, can you explain why the set_lan_id patch fixes that? Comment #31 suggests lock contention at driver load time, whereas we are looking at a problem that happens when Linux is not even running. The experimental results prove that comment #30 fixes this behavior, but I can't see yet what the explanation would be. Problem 4 from comment #18 is also strange because here we have a fully loaded igb driver and we are just doing renegotiations. That also doesn't fit very well with the explanation in comment #31. But that's less important because it's not a customer issue. I can't be certain as there can be multiple factors, and I haven't done any actual work on the hardware in question. The only issue addressed in the patch is that the wrong semaphore was being used when attempting to determine the phy id. Issue 3 was likely the result of a lock contention issue during initialization which was likely due to the problem I resolved, however I am not certain how it would have resolved issues 2 & 4. This fix may have addressed other issues I was not aware of, and so I don't know how this would have resolved issues 2 & 4 specifically. Are you certain those issues are resolved by this patch as well? > Are you certain those issues are resolved by this patch as well?
This is what our lab tests suggest, at least. We can't be 100% certain because the error was already quite rare with the driver 1.3.19.3. It was frequent with th EL5.3 native driver, though.
I think the comment #36 explains why the patch fixes issue 2 and issue 3. If PHY 0 is locked by BMC, unloading/reloading the driver would cause issue 3. If PHY 0 is locked by unloading/reloading the driver, BMC could cause issue 2. I believe all our retest results in Germany and Japan and all discussion in this bugzilla support comment #31. Without the patch, PHY 0 can be locked by mistake, and that cause severe impact on the server operation of the huge system like the customer have. Regarding for the inclusion in 5.4, let me discuss with our partner manager. in kernel-2.6.18-164.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |
Created attachment 349940 [details] proposed patch Description of problem: In a DC power on/off test with shared management/OS LAN port, the network fails to come up correctly sometimes. This was tested on Intel 82575EB NICs. Steps to reproduce: a) On a working system with current connectivity, reboot the machine. b) During the reboot flood the LAN with low-level network traffic (ie, ARP requests for fake IP address) c) Once the tested server has entered runlevel 3 or 5, try to ping other systems in the subnet. d) If it fails, run "ethtool -r ethX" to make traffic commence. In the error situation, ifconfig reports "Link up" and ethtool reports "Link detected" but ping, etc, to any remote destination fails. The system is using only one igb LAN port which is operated in shared mode (the same port listens both for normal OS LAN connections and IPMI traffic for the BMC). The client also suggested a patch.