Red Hat Bugzilla – Bug 1251434
alx driver doesn't work with kernel >= 4.1.2
Last modified: 2018-05-04 05:35:21 EDT
Created attachment 1060319 [details]
Description of problem:
With kernel <= 4.0.8, alx ethernet driver works automagically. Starting with 4.1.2 and continued to 4.1.3, only the wireless connection works.
Version-Release number of selected component (if applicable):
$ uname -a
Linux hero.x.org 4.1.3-201.fc22.x86_64 #1 SMP Wed Jul 29 19:50:22 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ rpm -qa |grep NetworkManager
Says connected but is apparently not.
Work automagically like older kernels.
See attachments. Please let me know if additional information is required.
Created attachment 1060320 [details]
I have two machines with an Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08). Up to and including kernel 4.0.8, this interface was working fine. With kernel 4.1.3, it has stopped working properly. If I boot back into the 4.0.8 kernel, the interface works fine again. It is not a hardware problem.
Upon booting up with the 4.1.3 kernel, the network works for about a minute and then stops communicating. I can't ping the machine from another machine on the network. After about 25 minutes I get the message:
kernel: alx 0000:06:00.0 p5p1: fatal interrupt 0x8400, resetting
and then the network works again for about a minute and then stops working. This scenario repeats about every 25 minutes. The number sometimes is 0x0400 or 0x0401.
This bug is severe and high priority as these machines are basically unusable with the latest kernel.
@Ldap Tester - thanks for posting. I changed the priority levels; hope it gets some attention now. If my system didn't have a wireless card, I'd be out of luck too.
Same issue with DELL One 27 with Atheros AR8161 PCIe 10/100/1000 Mbps Ethernet.
Currently defaulting to old kernel 4.0 to overcome issue.
I believe Bug report 1249493 is the same bug.
Can you try a bisect using the scripts at https://pagure.io/fedbisect ? That's going to be the easiest way to narrow down the problem
Laura, I looked at the bisect procedure and it seems to be a bit over my head. I would be willing to try it, but the two machines I have with the AR8161 are in a remote location. If I were to reboot into a bad kernel, I would lose contact with the machine. Recovering is logistically problematic. Maybe one of the other posters can do the bisect.
All I can tell you now is that the AR8161 was (and is) working fine with kernel-4.0.8-200.fc21.x86_64 and fails under 4.1.3-100.fc21 or 4.1.3-200.fc22. I realize there are many kernel changes between those two versions, but can there be that many changes to the alx driver? The error message I posted could give you an additional clue. I consider this regression to be urgent in that my only work around is to run a kernel without the latest security fixes.
I have with started fedbisect and gone through three iterations of the kernel at this point. As the computer is at home it may take me some time to complete.
(In reply to Laura Abbott from comment #5)
> Can you try a bisect using the scripts at https://pagure.io/fedbisect ?
> That's going to be the easiest way to narrow down the problem
Some clarification regarding fedbisect, I know simple things but I had trouble with them as a newbie: -
git clone <location of repo>
git clone <location of fedbisect repo>
Missing step after cloning fedbisect repository.
cd <fedbisect repo>
This will clone a kernel tree in <subdir> of the repo.
This will clone a kernel tree in <subdir> of the fedbisect repo.
Here is what I ended up with Bisect. Doesn't mean anything to me.
bash-4.3$ ./fedbisect.sh bad
HEAD is now at 387d375 PCI: Don't clear ASPM bits when the FADT declares it's unsupported
387d37577fdd05e9472c20885464c2a53b3c945f is the first bad commit
Author: Matthew Garrett <firstname.lastname@example.org>
Date: Tue Apr 7 11:07:00 2015 -0700
PCI: Don't clear ASPM bits when the FADT declares it's unsupported
Communications with a hardware vendor confirm that the expected behaviour
on systems that set the FADT ASPM disable bit but which still grant full
PCIe control is for the OS to leave any BIOS configuration intact and
refuse to touch the ASPM bits. This mimics the behaviour of Windows.
Signed-off-by: Matthew Garrett <email@example.com>
Signed-off-by: Bjorn Helgaas <firstname.lastname@example.org>
:040000 040000 733931aa65713217b4ee3c2ffe6952a961229e62 4e55b70cb187a3705734a04d16a675cd21d974ff M drivers
:040000 040000 100d49d0e01d4719f851e9cdb6a4ac6333b4eb01 2f3abbee84c7f4e697993078a80ea3310f3bcded M include
# first bad commit: [387d37577fdd05e9472c20885464c2a53b3c945f] PCI: Don't clear ASPM bits when the FADT declares it's unsupported
Found your commit!
Built a Kernel with this patch reversed but it didn't make any difference, still no Ethernet connection.
If the bisect didn't work then this should be reported to the maintainers upstream. There haven't been any changes to the alx driver recently.
Jay Cliburn <email@example.com> (maintainer:ATLX ETHERNET DRIVERS)
Chris Snook <firstname.lastname@example.org> (maintainer:ATLX ETHERNET DRIVERS)
email@example.com (open list:ATLX ETHERNET DRIVERS)
firstname.lastname@example.org (open list)
Found this as an interim solution. Works for me (currently)
On one of my two machines that use the alx driver, I have been testing kernel 4.1.3-200.fc22.x86_64 with MTU=9000 for the last six days, and I have seen no failures yet. I do not consider this a workaround, but rather a dodge, in that it appears to dodge the bug but we do not know why, and have no assurance that it will continue to do so.
As John said, this severe regression has already been reported to the upstream maintainers, but there has been no response. I think the Fedora maintainers should exert some pressure on the upstream maintainers, because the users aren't getting any attention from them. This regression is urgent. I cannot update my kernel to include the latest security fixes.
(In reply to Ldap Tester from comment #13)
> As John said, this severe regression has already been reported to the
> upstream maintainers, but there has been no response. I think the Fedora
> maintainers should exert some pressure on the upstream maintainers, because
> the users aren't getting any attention from them. This regression is
> urgent. I cannot update my kernel to include the latest security fixes.
Has anyone tried to directly contact the people/group Laura posted in #11? If not, I will.
I have sent email to the people mentioned, but everyone with this problem should do so also. The more voices they hear, the more likely we'll see some action.
Here's the email I received from Jay Cliburn.
From: J. K. Cliburn <email@example.com> Fri, Sep 11, 2015 at 12:40 PM
Cc: Chris Snook <firstname.lastname@example.org>, "Huang, Xiong" <email@example.com>
I'm pretty sure the alx driver is maintained by Qualcomm. Chris Snook and I (neither of us Qualcomm employees) worked to integrate the atl1 and atl2 drivers into the kernel several years ago. Since that time, however, Qualcomm has provided kernel support for all subsequent drivers (atl1c, atl1e, alx, and so on).
I've included on the cc list my only Qualcomm contact, and I'm not sure he's still there. The link below also provides a couple of email addresses that may help.
I received this patch from Matthew Garrett firstname.lastname@example.org. I really can't try the patch myself. My affected machines are in a remote location and in almost constant use. John, can you (or anyone else) test this patch?
diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c
index c8af3ce..fb562cc 100644
@@ -1242,6 +1242,8 @@ static int alx_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
int bars, err;
+ pci_disable_link_state(pdev, PCIE_LINK_STATE_CLKPM);
err = pci_enable_device_mem(pdev);
Standby on the above patch. Matthew says it won't work. He will be sending me a new one.
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs.
Fedora 22 has now been rebased to 4.2.3-200.fc22. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23.
If you experience different issues, please open a new bug report for those.
I can now confirm that this issue is still present under kernel-4.2.3-200.fc22.x86_64.
I also note that the MTU=9000 dodge doesn't always work. While it seems to work for kernel-4.2.3-200.fc22.x86_64, it didn't work for me under kernel-4.1.8-200.fc22.x86_64.
I can also confirm what others have reported. That when this issue occurs, packets are successfully transmitted out of the machine, but none are received into the machine.
I do note that there appears to be some progress in resolving this bug, but we need a kernel maintainer to pick up the ball on this. Please see: https://bugzilla.kernel.org/show_bug.cgi?id=70761
Apparently openSUSE has already produced a patch for this bug and it has been reported to work. Please see https://bugzilla.kernel.org/show_bug.cgi?id=70761 Can we get a fedora kernel maintainer to do the same for us?
(In reply to Ldap Tester from comment #21)
> Apparently openSUSE has already produced a patch for this bug and it has
Actually, that isn't what is in the opensuse tracker. All Takashi did is create a one-off side repo with a separate ALX module that contains the 3 patches from pru in the kernel.org bug. Fedora doesn't provide kmod packages.
As the original patch author says in https://bugzilla.kernel.org/show_bug.cgi?id=70761#c31, the patches need to be reviewed upstream.
Well, upstream isn't doing anything at all about this bug.
It might be because upstream isn't actually aware of it. The kernel.org bugzilla service isn't mandatory and isn't widely followed by a majority of the maintainers. In any case, the patches still need to be sent to the actual maintainers on the netdev list.
On September 14, I did send email to email@example.com, advising them of this bug. I have received no response.
I'm chasing this some upstream now. I've got a laptop with a nic driven by alx, but it has no problems to speak of, so I can really only regression-test, if I can ever get a custom built kernel to actually boot -- currently tripping over a grub2 double-free whenever I try to boot a kernel I built with patches added (which are based on the ones in the upstream bug with some modifications).
I've attached a patch to the upstream bug that could use some testing. I've actually already got an F23 x86_64 build around here with the patch included that I could push somewhere for testing.
(In reply to Jarod Wilson from comment #27)
> actually already got an F23 x86_64 build around here with the patch included
> that I could push somewhere for testing.
If you can get it pushed somewhere (maybe a temporary copr repo?), I can take a look if I can test it.
(In reply to Ville Skyttä from comment #28)
> (In reply to Jarod Wilson from comment #27)
> > I've
> > actually already got an F23 x86_64 build around here with the patch included
> > that I could push somewhere for testing.
> If you can get it pushed somewhere (maybe a temporary copr repo?), I can
> take a look if I can test it.
I've never set up copr before, I just shoved the local mock build onto prc:
(In reply to Jarod Wilson from comment #29)
> I've never set up copr before, I just shoved the local mock build onto prc:
Seems to make a slight difference here, but doesn't seem to be a definitive fix. I did get network working once across several reboots and cable detach/attaches, but only once; otherwise it exhibits the same problems as 4.2.6-301.fc23.x86_64. Setting MTU to 9000 works the same as with other kernels and fixes connectivity.
Does it *only* work with a 9000 MTU, or do other things larger than 1500, smaller than 9000 also work? Wish I actually had affected hardware (and documentation) here... Kind of wondering how big the packets you receive with MTU set to 9000 actually are, might be able to figure out just how much padding might be needed. Might also need to write sans-header MTU into the hardware registers, not sure. Firing from the hip, in the dark, hoping to hit something. :)
I bisected the MTU some (with kernel-4.2.6-300.fc23.x86_64, not yours):
1500 does not work
1501 does not work
2000 does not work
2500 does not work
I can continue testing, just let me know what data you want and how I can get it. Feel also free to contact me in PM if you think that's more appropriate than discussing stuff here.
(In reply to Ville Skyttä from comment #32)
> I bisected the MTU some (with kernel-4.2.6-300.fc23.x86_64, not yours):
> 1500 does not work
> 1501 does not work
> 2000 does not work
> 2500 does not work
> 2999 works
> 3000 works
> 8000 works
> 9000 works
> I can continue testing, just let me know what data you want and how I can
> get it. Feel also free to contact me in PM if you think that's more
> appropriate than discussing stuff here.
Exactly which hardware is it that you have? I've got two reports in the upstream bug that my patch (which is now in net-next, and should make it's way into kernel 4.5) does help them considerably. One of them was an AR8162, not sure what the other was.
Something else you could try is adding that patch from net-next, and increase ALX_FRAME_PAD from 16 to larger values, see if there's a reasonable larger padding that behaves for your hardware. We could then tweak the driver code to set different padding amounts for different hardware, if that solves this particular issue for your hardware.
I've got multiple reports from the upstream bug that the patch that went into 4.5 fixes things for 8161 and 8162 users.
Good to hear. BTW I don't think I ever got a mail for comment 33, so I didn't know there was a question waiting for me. HW here is ASUS N56VZ laptop and the NIC is:
$ lspci -nnv | grep -A 1 Ethernet
04:00.0 Ethernet controller : Qualcomm Atheros AR8161 Gigabit Ethernet [1969:1091] (rev 08)
Subsystem: ASUSTeK Computer Inc. N56VZ [1043:1477]
(In reply to Ville Skyttä from comment #35)
> Good to hear. BTW I don't think I ever got a mail for comment 33, so I
> didn't know there was a question waiting for me. HW here is ASUS N56VZ
> laptop and the NIC is:
> $ lspci -nnv | grep -A 1 Ethernet
> 04:00.0 Ethernet controller : Qualcomm Atheros AR8161 Gigabit Ethernet
> [1969:1091] (rev 08)
> Subsystem: ASUSTeK Computer Inc. N56VZ [1043:1477]
Hm. Just got another report today specifically for an AR8161 that said it worked perfectly without the MTU work-around as of 4.5, so not sure why yours would still be misbehaving. :\
Can you try straight 4.5-rc2 or later, and possibly with an increased ALX_FRAME_PAD, if things still misbehave?
I could really use some hardware in front of me that reproduces the issue...
(In reply to Jarod Wilson from comment #36)
> Hm. Just got another report today specifically for an AR8161 that said it
> worked perfectly without the MTU work-around as of 4.5, so not sure why
> yours would still be misbehaving. :\
Hm, there might be a misunderstanding here. I haven't tried with anything newer than 4.3.4-300.fc23.x86_64, so no idea whether it's misbehaving with 4.5 or not.
> Can you try straight 4.5-rc2 or later, and possibly with an increased
> ALX_FRAME_PAD, if things still misbehave?
I'll see if I can get 4.5.0-0.rc2.git2.1.fc24 from koji installed first, hopefully testing that is useful. Haven't tried building vanilla kernels myself in a long time and would rather stick with packaged ones.
(In reply to Ville Skyttä from comment #37)
> (In reply to Jarod Wilson from comment #36)
> > Hm. Just got another report today specifically for an AR8161 that said it
> > worked perfectly without the MTU work-around as of 4.5, so not sure why
> > yours would still be misbehaving. :\
> Hm, there might be a misunderstanding here. I haven't tried with anything
> newer than 4.3.4-300.fc23.x86_64, so no idea whether it's misbehaving with
> 4.5 or not.
Oh, yes, my mistake, I hadn't read closely enough. I thought you were still seeing the problems while using a build with that patch included.
> > Can you try straight 4.5-rc2 or later, and possibly with an increased
> > ALX_FRAME_PAD, if things still misbehave?
> I'll see if I can get 4.5.0-0.rc2.git2.1.fc24 from koji installed first,
> hopefully testing that is useful. Haven't tried building vanilla kernels
> myself in a long time and would rather stick with packaged ones.
Yes, that should be useful for testing out the fix. Fingers crossed that it works! :)
(In reply to Jarod Wilson from comment #38)
> > I'll see if I can get 4.5.0-0.rc2.git2.1.fc24 from koji installed first,
> > hopefully testing that is useful. Haven't tried building vanilla kernels
> > myself in a long time and would rather stick with packaged ones.
> Yes, that should be useful for testing out the fix. Fingers crossed that it
> works! :)
In its unmodified form, it doesn't help. (I'm just testing switching between MTU automatic and MTU 4000, not anything in between. 4000 works with everything I have.)
Bumping ALX_FRAME_PAD to 32 doesn't help either, but 128 seems to make things work. I'm currently doing a build with it set to 64 and will test that once ready. Builds tend to take quite a long time so bisecting this way quite slow progress. I'm doing it in mock (and ccache does help a lot, but anyway) -- what's the command I could run in the chroot just to get the modified alx.ko recompiled? I'd just modify and recompile it in the last build tree which is still sitting there and could work with just the recompiled module instead of doing a full rpm rebuild.
Maybe we should take the discussion to private mail, wonder if people are getting annoyed already? Anyway I don't mind Bugzilla, but whichever way, let me know the above and how much bisecting to find the right ALX_FRAME_PAD makes sense or where should I stop?
I've tested a bunch of different ALX_FRAME_PAD values now with 4.5.0-0.rc2.git2.1.fc24, results:
32, 64, 96, 112, 128, 160: Works for some time, then stops working, may start working again later, break again etc. 128 seems to work longest (IIRC even some tens of minutes or so), then 112. Others stop working almost immediately (within seconds).
16, 256: Does not seem to work at all.
I have no problem continuing the discussion in bugzilla, it could help others if we continue the discussion out in the open.
I'd have to go poke around for exact steps, but loosely... If you have kernel-devel for the running kernel installed, do an rpmbuild -bp of its src.rpm, and go down into the alx directory of the prepped tree, you can invoke make to build the alx kernel module against kernel-devel, and build an alx.ko that you can load into the running kernel.
Section 2.1 of this actually covers it pretty well:
I believe it's kernel-devel that lays down /lib/modules/`uname -r`/build for you.
(In reply to Ville Skyttä from comment #40)
> I've tested a bunch of different ALX_FRAME_PAD values now with
> 4.5.0-0.rc2.git2.1.fc24, results:
> 32, 64, 96, 112, 128, 160: Works for some time, then stops working, may
> start working again later, break again etc. 128 seems to work longest (IIRC
> even some tens of minutes or so), then 112. Others stop working almost
> immediately (within seconds).
> 16, 256: Does not seem to work at all.
Weird. I'm not quite sure yet what to make of those results... I'll try to take another look at the alx driver tomorrow with that information in mind and see if anything jumps out...
I just tried kernel-4.5.0-0.rc5.git0.2.fc25.x86_64 from fedora rawhide. It did not solve my problem. Same symptoms as I have reported before. I have a Dell 2710 All in One with an AR8161.
Got a little sidetracked. Have been comparing some of the alx code with other drivers, and noticed many other drivers have rx buffer alignments of 1024, 2048 or 4096, while the alx driver is using only 8... So, maybe:
#define ALX_MAX_FRAME_LEN(_mtu) (ALIGN((ALX_RAW_MTU(_mtu) + ALX_FRAME_PAD), 8))
#define ALX_MAX_FRAME_LEN(_mtu) (ALIGN((ALX_RAW_MTU(_mtu) + ALX_FRAME_PAD), 1024))
(or 2048 or 4096 instead of 1024)
Other than that, I've got nothing. Wish I had affected hardware myself to poke at.
Just noticed something else that could be relevant to the AR8161. In alx_init_sw(), there's some junk that sets alx->hw.lnk_patch for the AR8161 with device ID 0091 and revision 0, and causes some different register settings to be applied. That bit has been there since the driver first appeared, wondering if there are additional AR8161 devices that need the same treatment. Ville's appears to be an AR8161 with device ID 1091, revision 8 though, which I suspect is far newer, and probably less likely to need what look like work-arounds for flaky early chip versions, but I'm just theorizing here.
Tried with different ALX_MAX_FRAME_LEN values along with some different ALX_FRAME_PAD combinations, ditto (separately) setting alx->hw.lnk_patch unconditionally to true, but unfortunately still no joy.
(Instructions in comment 41 make testing things a breeze though, thanks for that and the continuous stream of ideas to try out!)
*** Bug 1305243 has been marked as a duplicate of this bug. ***
Found working solution (https://bbs.archlinux.org/viewtopic.php?id=201459). It's most probably caused by setup of Jumbo frames. I followed a setup of MTU to 9000 and my wired connection now work without issues.
Problems persist with 4.5.5-201.fc23.x86_64
The 4.7 kernel should carry two additional patches from Intel's Feng Tang, which will hopefully finally resolve things for everyone:
Author: Feng Tang <firstname.lastname@example.org>
Date: Wed May 25 14:49:54 2016 +0800
net: alx: use custom skb allocator
Author: Feng Tang <email@example.com>
Date: Sun Jun 12 17:36:37 2016 +0800
net: alx: Work around the DMA RX overflow issue
I believe these may also be headed to -stable trees.
(In reply to Jarod Wilson from comment #51)
> The 4.7 kernel should carry two additional patches from Intel's Feng Tang,
> which will hopefully finally resolve things for everyone:
Tested 4.7.0-2.fc25.x86_64 on F-23: seems fixed for me!
Going to go ahead and close this bug then. I see at least one of the two patches already in 4.6.3 (and 4.6.6 was just released), I believe it went to older stable branches too, and it's definitely in 4.7 and later.
Ah yes, seems to work for me with at least 4.6.6-200.fc23.x86_64 too.
Just to clarify, we'll no longer need things like "MTU=9000", right?
(In reply to Dave M from comment #55)
> Just to clarify, we'll no longer need things like "MTU=9000", right?
Yes, that's what I mean by "works".
I have also have same problem with kernel version 3.10.0-693
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-V
02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 10)
below are the logs i am getting
Apr 30 15:56:05 localhost kernel: alx 0000:02:00.0 enp2s0: fatal interrupt 0x4001607, resetting
Apr 30 15:56:05 localhost kernel: alx 0000:02:00.0 enp2s0: fatal interrupt 0x4001607, resetting
Apr 30 15:56:05 localhost kernel: alx 0000:02:00.0 enp2s0: fatal interrupt 0x4001607, resetting