Bug 495472

Summary:

[Broadcom10gb] daEth stress breaks bnx2x driver in MRG1.1

Product:

Red Hat Enterprise MRG

Reporter:

IBM Bug Proxy <bugproxy>

Component:

realtime-kernel

Assignee:

Arnaldo Carvalho de Melo <acme>

Status:

CLOSED ERRATA

QA Contact:

David Sommerseth <davids>

Severity:

medium

Docs Contact:

Priority:

low

Version:

1.1

CC:

bhu, lgoncalv, ovasik, williams

Target Milestone:

1.1.3

Target Release:

---

Hardware:

x86_64

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-06-03 15:37:30 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
backport 2.6.29 bnx2x driver to 2.6.24.7	none
Patch to allow clean unloading of driver (and clean shutdown of machine)	none
disable preemption around a small critical section in the pulse code	none
disable preempt in critical section in pulse code (with DCO)	none
Remove the pulse code from the driver	none

Description IBM Bug Proxy 2009-04-13 11:10:31 UTC

=Comment: #0=================================================
Gowrishankar Muthukrishnan <gowrishankar.m.com> - 
Problem Description:
--------------------
DaEth network stress breaks with bnx2x (Broadcom 10GbE) driver available
in MRG1.1. 

daETh client on which driver is stressed has been broken as in logs:

tail -3 /var/log/messages :

Apr  4 04:56:41 elm3c196 kernel: [bnx2x_timer:3971(eth2)]drv_pulse (0x220e) != mcp_pulse (0x268)
Apr  4 04:56:42 elm3c196 kernel: [bnx2x_timer:3971(eth2)]drv_pulse (0x220f) != mcp_pulse (0x268)
Apr  4 04:56:43 elm3c196 kernel: [bnx2x_timer:3971(eth2)]drv_pulse (0x2210) != mcp_pulse (0x268)

Above log is growing with a line every sec.

Also network is unreachable at this time.

[root@elm3c196 logs]# ping -I eth2  -w 10 10.1.1.197
PING 10.1.1.197 (10.1.1.197) from 10.1.1.196 eth2: 56(84) bytes of data.

--- 10.1.1.197 ping statistics ---
0 packets transmitted, 0 received


Reported OS:

kernel : 2.6.24.7-101.el5rt (MRG1.1)
base OS: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
Architecture: x86_64

Reported Hardware:

Broadcom 10GbE on HS21XM
=Comment: #2=================================================
Gowrishankar Muthukrishnan <gowrishankar.m.com> - 
Also, ping log in daEth client is:

[root@elm3c196 logs]# tail -3 ping.log 
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
[root@elm3c196 logs]# 

=Comment: #3=================================================
Vernon Mauery <vernux.com> - 
I was hoping that a backport of the 2.6.29 bnx2x driver would help here.  It was significantly
better than the driver currently in MRG, but it only lasted about 4-5 hours in daEth.  While this is
better than a few minutes, it is still not 80 hours.

I am going to kick off a vanilla test to see if the 2.6.29 kernel can handle the stress of daEth. 
This might show us if it is an -rt race window or if the driver really can't handle it.  If the
driver really can't handle it, we may be out of luck, since the next improvements to the bnx2x
driver are to add multiple queues, which the 2.6.24 kernel doesn't support.
=Comment: #6=================================================
Vernon Mauery <vernux.com> - 
After testing a 2.6.29 vanilla kernel (no multi-queue, but updated bnx2x driver), it seems that this
may be an RT specific issue.  In running daEth between the 2.6.29 kernel and the MRG -108 kernel,
the -108 machine's card went to lunch with lots of error messages.

The first one we see is:
[bnx2x_attn_int_deasserted3:2695(eth2)]MCP assert!

and then a firmware dump

and then lots of messages like these:
[bnx2x_timer:3971(eth2)]drv_pulse (0x96c) != mcp_pulse (0x969)
bnx2x: eth3 NIC Link is Up, 10000 Mbps full duplex, receive & transmit flow control ON

When I see these, the NIC sometimes works and sometimes doesn't.  But from that point on, the
network stats are pooched (no more recording of packets/bytes that do get sent).

To test my theory of rt/vanilla, I am currently testing 2.6.29 vs. 2.6.29.1-rt5.  I will check back
in the morning.
=Comment: #7=================================================
Vernon Mauery <vernux.com> - 
No rest for the weary.  It appears that both 2.6.29 and 2.6.29.1-rt5 are pretty solid.  Both
machines have gone 6.5 hours so far without any bnx2x errors.  This is on top of a 8 hour test they
already ran last night with roles (client/server) swapped.  In that one, I did see some OOM messages
on the 2.6.29.1-rt5 machine when it was the server.  They would appear occasionally when there was
insufficient memory for an incoming packet.  One reason I am guessing I have not seen the OOM
messages before is that I usually run 2 -rt machines against each other, whereas with the client
being a vanilla kernel, it was likely able to push packets a lot faster, thus pushing the memory a
little harder on the server.  But I have not seen any of the drv_pulse != mcp_pulse or bnx_assert
messages.

So to recap, the original -108 bnx2x driver fails very quickly under high stress.  The -108 kernel
with a backported 2.6.29 bnx2x driver takes several hours to fail.  The 2.6.29.1-rt5 kernel seems to
be pretty solid.  Let's just rebase to 2.6.29 already. :)

So something in the core networking code or in the core kernel code that is making the 2.6.29 driver
more stable in its native codebase than the backported kernel.  Not being familiar in the least with
the bnx2x driver or hardware, I am thinking a bisection would be the best way to narrow the
difference.  The only thing about this that worries me is the MTBF.  If it takes me 6 hours to
determine success and there are 56875 commits between 2.6.24 and 2.6.29, it will take 16 bisections
or about 4 solid days.   The biggest problem I forsee is that with bisections, I will end up in
random states where the -rt patch won't apply nicely, which will invariably introduce errors into
the test.  So maybe instead of git bisection, I will start with major kernel releases to see what
happens there.  After all, there are 4 major releases between the two I have tested.  But now I am
not sure that all of those have -rt patchsets.... Well, at least it will give me a place to narrow
things down a bit.
=Comment: #8=================================================
Vernon Mauery <vernux.com> - 
Apparently the 2.6.26.8-rt16 kernel with a backported 2.6.29 bnx2x driver seems to work under high
stress as well.  So something between 2.6.24 and 2.6.26 fixed our problem.  The question that I am
looking at now is what made the change that was important?  Was it a change in the base kernel or
was it a change in the -rt patchset?  So far, the only non-rt kernel I have tested is 2.6.29 and it
has been rock solid under stress.

Still to test:
2.6.{24,25,26} mainline kernels
2.6.24-rt + some net/ or kernel/ backports

Comment 1 IBM Bug Proxy 2009-04-14 19:01:09 UTC

------- Comment From vernux.com 2009-04-14 14:59 EDT-------
I just finished a run, testing the -108 vanilla kernel.  By the way, the vanilla config does not compile the bnx2x driver.  So grabbing the source and compiling the driver for vanilla, I set that up on one of the machines.  On the other machine I set up the -108 vanilla kernel with the 2.6.29 bnx2x driver backported to it.  Neither machine showed the

[bnx2x_timer:3971(eth2)]drv_pulse (0x220e) != mcp_pulse (0x268)

messages.  While the machine with the -108 bnx2x driver was the client, it showed some other BUG messages, it did seem to stay up and the interface was still alive (but for some reason the bnx2 driver (1GbE) went bonkers while the test was running).  While the -108 vanilla + 2.6.29 driver was the client, both machines were silent and functional.

This shows that it is something to do with races introduced by CONFIG_PREEMPT.  But, this race window was closed in the 2.6.26-rt patchset.

While I was at it, I thought I might try testing the -108 -rt kernel with msi disabled.  This had no effect at all.  With the default -108 driver, both client and server had drv_pulse != mcp_pulse messages.

Comment 2 IBM Bug Proxy 2009-04-20 18:11:05 UTC

------- Comment From vernux.com 2009-04-20 14:07 EDT-------
Just out of curiosity, I ran daEth without the -p (pktgen) option over the weekend.  This was a 68 hour run that the 2.6.29 version of the bnx2x driver on the -108 kernel completed successfully.

Given this, I am much more confident in the stability of the 2.6.24-rt kernel.  I am much less worried about pktgen's results because, it being an in-kernel packet generator, it can push packets far faster than any userspace application.  Running without pktgen, daEth was still pushing more than 1 GB/s, and for this run, I had 2 instances running, one on each interface.  So the driver still got a very good workout.

I would propose that we consider pulling in the 2.6.29 bnx2x driver for now and lowering the severity/priority of this bug.  If need be, I can still continue to hunt down a better solution, but after the weekend test, I am fairly confident in the driver.

Comment 3 IBM Bug Proxy 2009-04-20 21:50:35 UTC

Created attachment 340442 [details]
backport 2.6.29 bnx2x driver to 2.6.24.7


------- Comment (attachment only) From vernux.com 2009-04-20 17:49 EDT-------

Comment 4 Arnaldo Carvalho de Melo 2009-04-24 19:55:58 UTC

Tested here with the daEth python script, with and without -b (cpu burn) but always without running the in-kernel pktgen. Two machines running as client, bad thing was that one of the _client_ machines, running several kernels, died, but the 2.6.24.7-113.bnx2x.el5rt kernel built with the backport provided in this ticket survived. Thanks Vernon.

Comment 6 IBM Bug Proxy 2009-04-29 19:51:04 UTC

Created attachment 341826 [details]
Patch to allow clean unloading of driver (and clean shutdown of machine)


------- Comment on attachment From vernux.com 2009-04-29 15:45 EDT-------


Do not call napi_disable in the unload path

This patch reverses the changes ported from the 2.6.29
bnx2x driver, which causes a hang on the unload path.

The 2.6.29 driver originally called napi_del, which
is a newer API that 2.6.24 does not have.  I ported
this as napi_disable, which is incorrect.  The calls
to napi_del were added by the commit below.  The patch
below removes the ported calls.

napi_disable is already called in the unload path once
and is not needed these other three times.

Signed-off-by: Vernon Mauery <vernux.com>


commit 7cde1c8b79f913a0158bae4f4c612de2cb98e7e4
Author: Eilon Greenstein <eilong>
Date:   Thu Jan 22 06:01:25 2009 +0000

    bnx2x: Calling napi_del
    
    rmmod might hang without this patch since the reference counter is not going
    down
    
    Signed-off-by: Yitchak Gertner <gertner>
    Signed-off-by: Eilon Greenstein <eilong>
    Signed-off-by: David S. Miller <davem>

Comment 7 IBM Bug Proxy 2009-05-01 00:20:53 UTC

------- Comment From vernux.com 2009-04-30 20:11 EDT-------
I sent  an email to Eilon Greenstein, the bnx2x driver maintainer and this was his reply.

Thank you for the /var/log/messages - now I can see that all my assumptions were wrong...  :)

The FW assert that caused this mess is still related to the FW pulse but from a different angle - function 0 did not post a pulse for more than 5 seconds so the FW assumed it is dead and took over the link. Once the FW took over the link, the function received an interrupt that the link was changing and tried to access it while the FW was messing around with it - which caused the FW to receive invalid response from the HW and hang. From this point, only reboot will help the FW (this issue is also fixed in later FW version).

The solution is still to disable the pulse mechanism - if the driver never uses this pulse, then the FW will not declare timeout (the FW starts with the assumption that this mechanism is disabled - the first pulse by the driver enables it). However, the reason for this issue is different than what I original thought - the reason is that function 0 did not send a pulse for over 5 seconds. This is possible and we are actually thinking about a mechanism to overcome this issue (have the FW send a special interrupt that and the driver will acknowledge it in the ISR) - but I have to say that so far, this was an issue only under Windows in some stress scenarios - and the current solution in Windows is to disable the pulse mechanism. This is the first time I see that Linux delay the timer for so long - but it is possible.

So disabling the pulse is not just an ugly work around - it is actually addressing the root-cause of the issue. Since the pulse is not needed in your setup - it has no side effects.

I hope it helps,
Eilon

======================

I tried this and it appeared to be more stable than with the pulse enabled.  However, despite Eilon saying it is not a hack, it sounds hackish to me that it needs to be removed in -rt kernels, but not in vanilla kernels.  Upon inspection, I saw that there is an IO write and then read that look like they ought to be atomic.  When running in a vanilla kernel, they would be in soft-irq context, which is pretty safe from preemption.  In -rt, soft irqs are not so sacred.  So I threw in a preempt_disable/preempt_enable around the IO and it appears to have helped the situation.  I am running a test overnight to see if it works out.

Comment 8 IBM Bug Proxy 2009-05-01 14:10:54 UTC

Created attachment 342092 [details]
disable preemption around a small critical section in the pulse code


------- Comment on attachment From vernux.com 2009-05-01 10:00 EDT-------


According to Eilon Greenstein, the bnx2x maintainer, the pulse code is strictly not necessary, so we could cut it out altogether.  But this patch allows us to have the best of both worlds.  We get the pulse code, which will at least alert us if something really BAD has happened, and we get to run without falling over.

Comment 9 IBM Bug Proxy 2009-05-01 14:20:57 UTC

Created attachment 342093 [details]
disable preempt in critical section in pulse code (with DCO)


------- Comment on attachment From vernux.com 2009-05-01 10:11 EDT-------


I forgot to add my DCO and a header to the patch.

Comment 10 Luis Claudio R. Goncalves 2009-05-07 21:26:19 UTC

The following patches have been added to kernel 2.6.24.7-115.el5rt: 

    bnx2x_hang_on_unload.patch
    bnx2x_preempt_disable.patch

Comment 11 IBM Bug Proxy 2009-05-19 22:40:56 UTC

------- Comment From vernux.com 2009-05-19 18:38 EDT-------
Reopening bug because we found that the previous patches were not enough after running a longer test.

Comment 12 IBM Bug Proxy 2009-05-19 22:50:53 UTC

Created attachment 344725 [details]
Remove the pulse code from the driver


------- Comment on attachment From vernux.com 2009-05-19 18:41 EDT-------


This patch removes the pulse code from the driver as suggested by Eilon Greenstein, the maintainer.

I have run an 80 hour test and the interface stayed up and stable during the time.  I did see that it stopped reporting statistics during this time, but the packets still went through.  I am looking into the statistics issue now.

Comment 16 errata-xmlrpc 2009-06-03 15:37:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1081.html