Bug 1412426

Summary:	Ethernet driver e1000e gets confused on ThinkPad T450s on suspend/resume
Product:	[Fedora] Fedora	Reporter:	Bojan Smojver <bojan>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	25	CC:	cz172638, gansalmon, ichavero, itamar, jonathan, kernel-maint, labbott, madhu.chinakonda, mchehab
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-04 06:54:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bojan Smojver 2017-01-12 02:00:08 UTC

Description of problem:
After upgrading my ThinkPad T450s to Fedora 25's latest testing kernel (4.9.2-200.fc25), the e1000e driver stops working after suspend/resume. Only a cold boot of the system gets it working again.

Switching back to kernel 4.8.16 makes everything work again.

PCI info:
-------------------
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (3) I218-LM (rev 03)
        Subsystem: Lenovo Device 2226
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 49
        Region 0: Memory at f1200000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at f123e000 (32-bit, non-prefetchable) [size=4K]
        Region 2: I/O ports at 3080 [size=32]
        Capabilities: <access denied>
        Kernel driver in use: e1000e
        Kernel modules: e1000e
-------------------

A run of ethtool when the machine resumes shows the speed of the ethernet adapter as 10Mb/s, instead of regular 1000Mb/s, which is what the switch supports and what happens with 4.8.x kernels.

Version-Release number of selected component (if applicable):
4.9.2-200

How reproducible:
Always.

Steps to Reproduce:
1. Suspend T450s with ethernet on, connected at 1 Gb/s
2. Resume.
3. Ethernet won't reconnect.

Actual results:
No ethernet.

Expected results:
Works with 4.8.16.

Additional info:
https://bugzilla.kernel.org/show_bug.cgi?id=192401

Comment 1 Bojan Smojver 2017-01-12 04:17:31 UTC

This only happens with 4.9.2, but not with 4.8.16:

Jan 11 19:18:31 <machine> kernel: e1000e 0000:00:19.0 enp0s25: 10/100 speed: disabling TSO

Comment 2 Bojan Smojver 2017-01-13 19:57:05 UTC

Ditto 4.9.3.

Comment 3 Laura Abbott 2017-01-13 20:19:15 UTC

Can you do a bisect to determine which kernel commit broke suspend for you? That's going to be the fastest way to find the issue.

Comment 4 Bojan Smojver 2017-01-13 20:24:01 UTC

(In reply to Laura Abbott from comment #3)
> Can you do a bisect to determine which kernel commit broke suspend for you?
> That's going to be the fastest way to find the issue.

I can give it a try. May take a while. Keep you posted.

Comment 5 Bojan Smojver 2017-01-15 22:09:43 UTC

(In reply to Laura Abbott from comment #3)
> Can you do a bisect to determine which kernel commit broke suspend for you?

Bisect says:
--------------------------
commit e6dce825fba05f447bd22c865e27233182ab3d79
Merge: 9929780 08bf215
Author: Linus Torvalds <torvalds>
Date:   Mon Oct 3 20:11:49 2016 -0700

    Merge tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gre
gkh/tty
    
    Pull tty and serial updates from Greg KH:
     "Here is the big tty and serial patch set for 4.9-rc1.
    
      It also includes some drivers/dma/ changes, as those were needed by
      some serial drivers, and they were all acked by the DMA maintainer.
    
      Also in here is the long-suffering ACPI SPCR patchset, which was
      passed around from maintainer to maintainer like a hot-potato. Seems I
      was the sucker^Wlucky one. All of those patches have been acked by the
      various subsystem maintainers as well.
    
      All of this has been in linux-next with no reported issues"
    
    * tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/
tty: (111 commits)
      Revert "serial: pl011: add console matching function"
      MAINTAINERS: update entry for atmel_serial driver
      serial: pl011: add console matching function
      ARM64: ACPI: enable ACPI_SPCR_TABLE
      ACPI: parse SPCR and enable matching console
      of/serial: move earlycon early_param handling to serial
      Revert "drivers/tty: Explicitly pass current to show_stack"
      tty: amba-pl011: Don't complain on -EPROBE_DEFER when no irq
      nios2: dts: 10m50: Add tx-threshold parameter
      serial: 8250: Set Altera 16550 TX FIFO Threshold
      serial: 8250: of: Load TX FIFO Threshold from DT
      Documentation: dt: serial: Add TX FIFO threshold parameter
      drivers/tty: Explicitly pass current to show_stack
      serial: imx: Fix DCD reading
      serial: stm32: mark symbols static where possible
      serial: xuartps: Add some register initialisation to cdns_early_console_setup()
      serial: xuartps: Removed unwanted checks while reading the error conditions
      serial: xuartps: Rewrite the interrupt handling logic
      serial: stm32: use mapbase instead of membase for DMA
      tty/serial: atmel: fix fractional baud rate computation
      ...
--------------------------

No idea what tty would have with e1000e, but I don't know kernel...

Anyhow, just to be sure, I double checked that revision and indeed my ethernet adapter is stuck in 10 Mb/s mode after resume and the network stops working.

Comment 6 Bojan Smojver 2017-01-16 02:29:07 UTC

In better news, I also compiled the tip as of about a couple of hours ago (i.e. just before 4.10-rc4), at c92816275674c1491ce228ee49aa030a5fa1be04 and that seems to have this issue fixed. What fixed it, I don't know yet.

Comment 7 Laura Abbott 2017-01-16 18:58:22 UTC

That's good to hear the tip is working. Unfortunately I agree that a tty merge commit is weirdly unrelated to your e1000e suspend. It may just be a timing issue. If you can find a single commit from 4.9 to 4.10-rc4 that fixes it for you, we can evaluate if a backport is appropriate.

Comment 8 Bojan Smojver 2017-01-16 21:18:27 UTC

(In reply to Laura Abbott from comment #7)
> If you can find a single commit from 4.9 to 4.10-rc4 that
> fixes it for you, we can evaluate if a backport is appropriate.

I will give it a try by reversing good/bad meaning in bisect, but I don't have a lot of confidence in that process given my last attempt, to be perfectly honest.

In the meantime, I'll run F25 4.9.x series on my T450s, just to make sure everything else is working as expected. I can live without suspend for a few months.

Comment 9 Bojan Smojver 2017-01-17 00:08:23 UTC

In the meantime, if I had to guess, I'd say maybe:
-----------------
commit 311191297125156319be8f86d546ea1c569f7e95
Author: WANG Cong <xiyou.wangcong>
Date:   Sat Dec 10 14:22:42 2016 -0800

    e1000: use disable_hardirq() for e1000_netpoll()
    
    In commit 02cea3958664 ("genirq: Provide disable_hardirq()")
    Peter introduced disable_hardirq() for netpoll, but it is forgotten
    to use it for e1000.
    
    This patch changes disable_irq() to disable_hardirq() for e1000.
    
    Reported-by: Dave Jones <davej.uk>
    Suggested-by: Sabrina Dubroca <sd>
    Cc: Peter Zijlstra (Intel) <peterz>
    Cc: Jeff Kirsher <jeffrey.t.kirsher>
    Signed-off-by: Cong Wang <xiyou.wangcong>
    Signed-off-by: David S. Miller <davem>
-----------------

Absolutely no idea whether it could even be related to my problems, but this commit is not in stable.

The only other one not in stable is:
-----------------
commit 91c527a55664ddf4bee26673a35f91748dae4142
Author: Jarod Wilson <jarod>
Date:   Mon Oct 17 15:54:05 2016 -0400

    ethernet/intel: use core min/max MTU checking
    
    e100: min_mtu 68, max_mtu 1500
    - remove e100_change_mtu entirely, is identical to old eth_change_mtu,
      and no longer serves a purpose. No need to set min_mtu or max_mtu
      explicitly, as ether_setup() will already set them to 68 and 1500.
    
    e1000: min_mtu 46, max_mtu 16110
    
    e1000e: min_mtu 68, max_mtu varies based on adapter
    
    fm10k: min_mtu 68, max_mtu 15342
    - remove fm10k_change_mtu entirely, does nothing now
    
    i40e: min_mtu 68, max_mtu 9706
    
    i40evf: min_mtu 68, max_mtu 9706
    
    igb: min_mtu 68, max_mtu 9216
    - There are two different "max" frame sizes claimed and both checked in
      the driver, the larger value wasn't relevant though, so I've set max_mtu
      to the smaller of the two values here to retain identical behavior.
    
    igbvf: min_mtu 68, max_mtu 9216
    - Same issue as igb duplicated
    
    ixgb: min_mtu 68, max_mtu 16114
    - Also remove pointless old == new check, as that's done in dev_set_mtu
    
    ixgbe: min_mtu 68, max_mtu 9710
    
    ixgbevf: min_mtu 68, max_mtu dependent on hardware/firmware
    - Some hw can only handle up to max_mtu 1504 on a vf, others 9710
    
    CC: netdev.org
    CC: intel-wired-lan.org
    CC: Jeff Kirsher <jeffrey.t.kirsher>
    Signed-off-by: Jarod Wilson <jarod>
    Signed-off-by: David S. Miller <davem>
-----------------

Comment 10 Bojan Smojver 2017-01-17 00:35:39 UTC

Neither commit from comment #9 fixes it, so red herring. It must be some other part of the kernel.

Comment 11 Bojan Smojver 2017-01-17 08:01:11 UTC

According to my totally untrustworthy bisect between 4.9 and 4.10-rc4, this is what fixed the problem (note: bad means good here):
----------------------
a5bc01949e3b19d8a23b5eabc6fc71bb50dc820e is the first bad commit
commit a5bc01949e3b19d8a23b5eabc6fc71bb50dc820e
Author: Johan Hovold <johan>
Date:   Tue Jan 3 16:39:58 2017 +0100

    USB: serial: omninet: fix NULL-derefs at open and disconnect
    
    Fix NULL-pointer dereferences at open() and disconnect() should the
    device lack the expected bulk-out endpoints:
    
    Unable to handle kernel NULL pointer dereference at virtual address 000000b4
    ...
    [c0170ff0>] (__lock_acquire) from [<c0172f00>] (lock_acquire+0x108/0x264)
    [<c0172f00>] (lock_acquire) from [<c06a5090>] (_raw_spin_lock_irqsave+0x58/0x6c)
    [<c06a5090>] (_raw_spin_lock_irqsave) from [<c0470684>] (tty_port_tty_set+0x28/0xa4)
    [<c0470684>] (tty_port_tty_set) from [<bf08d384>] (omninet_open+0x30/0x40 [omninet])
    [<bf08d384>] (omninet_open [omninet]) from [<bf07c118>] (serial_port_activate+0x68/0x98 [usbserial])
    
    Unable to handle kernel NULL pointer dereference at virtual address 00000234
    ...
    [<bf01f418>] (omninet_disconnect [omninet]) from [<bf0016c0>] (usb_serial_disconnect+0xe4/0x100 [usbserial])
    
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Cc: stable <stable.org>
    Signed-off-by: Johan Hovold <johan>

:040000 040000 4558327df56b19a98b8e2a23fc4928017f069855 b9b4e443b280b510e6d3f890912729752cfdcf57 M      drivers
----------------------

I completely don't believe it, but I have no idea what I'm talking about anyway.

So, hope it helps, but I won't hold my breath. :-)

Comment 12 Bojan Smojver 2017-01-17 08:14:03 UTC

(In reply to Bojan Smojver from comment #11)
> According to my totally untrustworthy bisect between 4.9 and 4.10-rc4, this
> is what fixed the problem (note: bad means good here):
> ----------------------
> a5bc01949e3b19d8a23b5eabc6fc71bb50dc820e is the first bad commit
> commit a5bc01949e3b19d8a23b5eabc6fc71bb50dc820e
> Author: Johan Hovold <johan>
> Date:   Tue Jan 3 16:39:58 2017 +0100

Absolutely no chance of this being the fix, given it's already in 4.9.3.

I'm thinking this must be a weirdly intermittent issue on my hardware and my testing efforts after each bisect were just not hitting the issue consistently.

So, my claim that this was fixed in the tip are also probably false.

Comment 13 SP 2017-01-20 16:04:10 UTC

Have you tried installing the latest firmware updates from koji?  The last fedora firmware updates were September '16 and are missing some drivers now required by the 4.9 kernels.  The latest, dated December '16, have being pushed to stable.  Download from here:
https://koji.fedoraproject.org/koji/buildinfo?buildID=822926
Install then remove and reinstall the latest kernel packages.  Not guaranteed to fix this bug but worked for me on a missing amdgpu driver.

Comment 14 Bojan Smojver 2017-01-20 21:27:50 UTC

(In reply to SP from comment #13)
> Have you tried installing the latest firmware updates from koji?

Yes. Made no difference that I could detect.

Comment 15 SP 2017-01-21 02:06:46 UTC

(In reply to Bojan Smojver from comment #14)
> (In reply to SP from comment #13)
> > Have you tried installing the latest firmware updates from koji?
> 
> Yes. Made no difference that I could detect.

 Have you run a dnf update today?  I noticed a lot of wireless driver firmware among the updates.

Comment 16 Bojan Smojver 2017-01-21 11:59:37 UTC

(In reply to SP from comment #15)

> Have you run a dnf update today?  I noticed a lot of wireless driver
> firmware among the updates.

I did, but this is not a wireless adapter.

BTW, the problem is still there in 4.9.4. We'll see whether 4.9.5 is any better.

Comment 17 Bojan Smojver 2017-01-22 05:05:54 UTC

(In reply to Bojan Smojver from comment #16)

> We'll see whether 4.9.5 is any better.

Nope, same.

Comment 18 Bojan Smojver 2017-02-04 06:54:04 UTC

I'm going to close this, because nobody else appears to be able to replicate it. With 4.9.7, if I disconnect the cable, wait a while and reconnect, the connection comes back at 1 Gb/s.

So, maybe some kind of weird interaction between my switch (Asus RT-AC68U) and the ethernet hardware in T450s.