Bug 1499321 - atlantic 10G ethernet driver blocks after a few sec's high transfer rate
Summary: atlantic 10G ethernet driver blocks after a few sec's high transfer rate
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 26
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-06 18:13 UTC by Knud Christiansen
Modified: 2018-01-05 12:28 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-05 12:28:48 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
ethtool -S server side before crash just after ifconfig <ifc> down -- up (1.10 KB, text/plain)
2018-01-01 20:59 UTC, Knud Christiansen
no flags Details
just after crash on server side (1.12 KB, text/plain)
2018-01-01 21:00 UTC, Knud Christiansen
no flags Details
clip of dmesg after crash (2.51 KB, text/plain)
2018-01-01 21:01 UTC, Knud Christiansen
no flags Details
patch to fix mapping leak (1.92 KB, patch)
2018-01-02 16:20 UTC, Neil Horman
no flags Details | Diff

Description Knud Christiansen 2017-10-06 18:13:12 UTC
Description of problem:
Aquantia atlantic 10G ethernet driver blocks after a few seconds of high transfer rate

Version-Release number of selected component (if applicable):
Kernel ver 4.13.4-200.x86_64
atlantic driver version 1.5.3  (modinfo)

How reproducible:
Allways

Steps to Reproduce:
1.iperf3 on 10G connection
2.after a few seconds, transfer stops
3.some times connection can be reestablished by ifconfig down/up


Expected results:
connection stays

Additional info:
atlantic driver version 1.6.7 from Aquantia web site works perfect


Knud

Comment 1 Knud Christiansen 2017-10-06 19:08:16 UTC
correction:

With kernel 4.13.4 both drivers stops after a few seconds

With kernel 4.12.14-300 driver 1.6.7 works perfect, driver 1.5.3 stops after a few second

Knud

Comment 2 Knud Christiansen 2017-10-06 21:15:22 UTC
With F28 kernel 4.14: 
Atlantic driver 1.5.3 block also
driver 1.6.7 from Aquantia is working but significant slower than under F26 kernel 4.12.14

Knud

Comment 3 Neil Horman 2017-12-22 18:01:44 UTC
Questions:

1) Is this tx traffic?  rx? both?

2) Do you see any errors in the message log when this occurs?  Does the nic eventually hit a dev_watchdog timeout and reset?

3) Does removing and re-insmodding the module fix the problem?

I don't have this nic handy to test with so you will have to be my hands on debugging this

Comment 4 Knud Christiansen 2017-12-23 08:26:04 UTC
1. tx and rx ..yes no traffic at all it cases where it blocks, only one combination resulted in reduced speed

2. dmesg shows nothing, dev_reset: no idea

3. re'inserting the module some times helps but general is the entire system got unstable, a few times a kernel panic is seen some what later.

Some how has the problem been solved so far that from kernel 4.14.x F26 no problem occurs with driver 1.5.345

Same with kernel 4.15.x in F28 driver 1.5.345

Status is that for kernels older than 4.14.x F26 and older than 4.15.x F28 driver 1.5.345 makes troubles and driver 1.6.7 own build is necessary.

Inconvenience is only that you all the time must rebuild initramfs after a kernel update because the stock driver pops in and replace your own build of the driver.

HW is AMD Threadripper 1920x and Asrock x399 board
I know from a friend that AQC 108 chip (5G) and Ryzen 1700x on x370 chipset has same problem.

For me personal the problem is historic because I need newer kernels anyway for Threadripper.

But I am willing to do tests if you will look into the problem for other reasons

Comment 5 Neil Horman 2017-12-27 11:48:25 UTC
I'm sorry, my first question was meant to ask about what type of traffic triggers the issue.  Is it a high transfer rate during transmission or reception of data?  Or does it happen in both directions?

Comment 6 Knud Christiansen 2017-12-27 17:46:45 UTC
I have just careful retested

kernel 4.13.13-200-F26 stock driver 1.5.345

Client is failing machine
Server is kernel 3.14.27-100-F19 with driver build 1.6.7


Iperf3:

Client ping server  ok for approx 40 pings then host unreachable
Driver dead can not establish any connections anymore

Iperf3 server => client, ok for 2-3 complete runs then dead

Iperf3 client => server, step 1-2 low transfer rate going to zero for step 3-10

After this failure Iperf3 can make 1-2 runs server=> client then dead

Dead means no connections, ifconfig <interface> down ---up > hangs forever
rmmod also hangs

I suspect that the driver is corrupted...as mentioned earlier...system general unstable

Comment 7 Neil Horman 2017-12-28 17:03:34 UTC
Thats a lot of words to not really answer the question.  Is the server the device with the aquantia card in it?  

If pings cause this card to lock up, I'm inclined to think this is a hardware problem.

if ifconfig <ifc> down hangs, please do that, then record the output of a sysrq t, so that we have some idea of where the hang is.  That may help determine whats going on here.

Comment 8 Knud Christiansen 2017-12-28 19:38:17 UTC
server has a Asus PCIe card XG-C100C with a AQC107 chip

Client is a ASrock X399 Threadripper board with onboard AQC107 chip


Failure seems to happends when high data rate is sent from client to server

From fresh reboot: ping in both direction seems to run for ever

Iperf3 with data (approx 9,4 Gbit/s) from server to client can be repeated more than 10 times (approx  100 Gbyte data) (not tested longer times)

One IPerf3 run, direction client to server starts with approx 100 MB/s dropping in a few test rounds to 0 (One Iperf3 test runs for 10 rounds) 

(as root) Ifconfig <ifc> down hangs for ever, can not be killed ((as root)htop hangs also)

htop in a user session shows 2 cpus(number 8 and 14) 100% loaded

ksysguard shows 2 session with high cpu load, "ifconfig <ifc> down" and ksoftirqd/3

sysrq t could I not get working, once I got a fullscreen session showing "sysrq show state" but nothing more

It ends with with a HARD reset, system responds not to CAD or shut down

Comment 9 Neil Horman 2017-12-31 16:33:04 UTC
Ok, so that doesn't sound like the last comment you made, in which pinging from the client to server results in the hang, but perhaps thats ok.  100% load on rx traffic into the server suggests that the napi poll routine somehow never schedules itself off the cpu.  If I had to take a guess, the hardware is reporting some set of hardware level errors, which causes the napi poll routine for the atlantic driver to return from the poll early with 0 work done causing the budget number to never gets reduced.  Are you by any chance able to run ethtool -S on the atlantic interface on the server during the hang period?  I can write a patch to accelerate the budget count when such errors occur, but I would prefer to get some more evidience that this is the case first, and ethtool stats might provide that.

Comment 10 Knud Christiansen 2018-01-01 20:59:27 UTC
Created attachment 1375370 [details]
ethtool -S server side before crash just after ifconfig <ifc> down -- up

Comment 11 Knud Christiansen 2018-01-01 21:00:17 UTC
Created attachment 1375371 [details]
just after crash on server side

Comment 12 Knud Christiansen 2018-01-01 21:01:33 UTC
Created attachment 1375372 [details]
clip of dmesg after crash

Comment 13 Knud Christiansen 2018-01-01 21:12:21 UTC
dmesg clip is from client, sometimes it requires 2 runs of iperf3 -c before someting comes into dmesg.

General the is unstable after crash...so behavior can differ

Comment 14 Neil Horman 2018-01-01 21:48:53 UTC
thank you, that actually helps alot.  Looking at the stats didn't help much as it doesnt show anything overly relevant, but the dmesg log shows that we're getting quite alot of io faults on the iommu from the atlantic device.  Looking at the code, i'm strongly starting to wonder if we don't have a serious memory leak in the driver, due to the fact that we constantly have space to fill in the rx ring, but we never unmap older buffers.  I'll start working on a patch for that, but, could we please confirm this is the case while I do.  Please boot your system with the following kernel command line parameter:
amd_iommu=off


This won't solve the problem, but it will suppress the iommu faults, and allow your system to run a bit longer.  if my hypothesis is correct, your system will take longer to fail, and when it does so, it will be due to oom conditions, resulting from lost memory buffers that never get cleaned up.

Comment 15 Neil Horman 2018-01-01 21:51:22 UTC
To be clear, since both systems are AMD systems with ACQ NICS onboard, please apply the above parameter to both the client and server.

Comment 16 Knud Christiansen 2018-01-02 08:29:47 UTC
I will come back with the result from amd_iommu=off test

To clarify:

Current test setup are NOT both AMD systems

Server setup is asrock Z77 with I5-3570 and the Asus XG-c100c PCIe card with AQC107 chip

Client setup: Threadripper Asrock x399 with AQC107 onboard


I have earlier mentioned:
"I know from a friend that AQC 108 chip (5G) and Ryzen 1700x on x370 chipset has same problem"....the AQC108 is onboard

Comment 17 Knud Christiansen 2018-01-02 09:10:05 UTC
test results with amd_iommu=off on the client:

Iperf3 -c (take 10 rounds each execution, average value end of each execution)
Result after each execution, all in Gbit/s
9,3 - 6,1 - 6,2 - 9,3 - 9,3 - 6,1 - 6,0 - 6,1 - 9,3 - 6,2 - 6,1 - 9,4
Each executions is approx 11 GByte of data transferd at 9,3 Gbit/s

10 individual results from an average of fx. 9,3 are lying between 9,17 and 9,42 

10 individual results from average 6,1 are 5,92 - 6,17


12 executions without crash, here I stopped 

Nothing in dmesg

Comment 18 Knud Christiansen 2018-01-02 09:13:48 UTC
I wonder a little bit why you not are looking into the driver version 1.6.7 from the Aquantia web site.
It shows not this wrong behavior.

Code is complete open as far I can see

Also wondering why version 1.6.7 not has moved into the code tree instead of 1.5.345

Comment 19 Neil Horman 2018-01-02 11:58:04 UTC
I'm not looking into the Aquantia driver for two reasons:

1) Without knowing whats wrong,  I've no idea what to fix - that is to say, I would have no idea what code to port from the aquantia out of tree driver into the upstream driver

2) The out-of-tree driver from the aquantia web site isn't open source. To port anything from that driver is a copyright violation.  As such I won't taint myself by looking at it.  I'm not sure how you see that the code is open, the driver download page for the driver clearly indicates that doing so subjects the end user to a license restriction that includes no right to copy or modify the software.

What you should do, if you want to accelerate this solution is to open a bug with aquantia indicating that you would like them to fix their upstream driver

Until then however, our only recourse is for me to fix this blindly.

Comment 20 Knud Christiansen 2018-01-02 12:37:48 UTC
I fully understand you in respect to 1) and 2) if you believe the 1.6.7 driver code are not free.....just looking at it makes you code questionable "open source"

I am not very family with the rules for open source but statement like this in code:

*
 * aQuantia Corporation Network Driver
 * Copyright (C) 2014-2017 aQuantia Corporation. All rights reserved
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms and conditions of the GNU General Public License,
 * version 2, as published by the Free Software Foundation.
 */

Makes be believe it was open source

But any way the Aquantia people contributing upstreams to the driver support should be the right one's to make the 1.6.7 driver on it's way into the code tree

I don't need to accelerate this solutions, as the problem is gone in 4.14.x and 4.15.x and have also own build 1.6.7 ready in case... 
Maybe the bug comes back or is still there but for some reason it does not make sign of it self.

I am willing to assist on this bug.

Comment 21 Neil Horman 2018-01-02 16:20:59 UTC
Created attachment 1375808 [details]
patch to fix mapping leak

i've not compile tested this yet, but it seems like at the very least this change is needed in the in-tree aquantia driver.  Rx descriptors never seem to get unmapped, but get overwritten on every refill of the rx ring, which is very, very bad.  Please build a kernel with this patch and confirm that your system functions as expected.

Comment 22 Knud Christiansen 2018-01-02 22:33:58 UTC
I will try the patch one of the next days

Comment 23 Knud Christiansen 2018-01-04 21:48:10 UTC
I have not got through a build with your patch, some build issues because I want to use the 4.13.13-200.fc26 src just to be sure that nothing else has changed.

But I have discovered that the fedora 4.15.0.rc6 src now contains the Aquantia driver ver 1.6.13
Even that the vanila kernel 4.15.0-rc6 from kernel.org contains the old Aquantia 1.5.345 driver

Some one has work on this in the fedora tree only

Now you can look at the code with out any problemes.


But does it make sense to continue this BUG ?

Comment 24 Knud Christiansen 2018-01-05 06:42:02 UTC
I don't why  (me eyes or some updates) but vanila kernel 4.15.0-rc6 driver is now 1.6.13

Comment 25 Neil Horman 2018-01-05 12:28:48 UTC
If the latest fedora kernel has the 1.6.13 driver, then aquantia released their code to the upstream community and we picked it up (4.15.0-rc6 suggests this was done very recently).  If it works for you, then no, I would imagine this bug is no longer relevant, just update and move along.


Note You need to log in before you can comment on or make changes to this bug.