Bug 785806 - e1000e Detected Hardware Unit Hang
Summary: e1000e Detected Hardware Unit Hang
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 17
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-30 16:41 UTC by Scott Shambarger
Modified: 2020-07-06 22:54 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-06 22:54:57 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
dmesg for boot (67.19 KB, text/plain)
2012-01-30 16:41 UTC, Scott Shambarger
no flags Details
sos report on CentOS 8 (6.55 MB, application/x-xz)
2020-03-02 07:15 UTC, Dominik Holler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 205047 0 None None None 2020-04-16 03:12:08 UTC

Description Scott Shambarger 2012-01-30 16:41:13 UTC
Created attachment 558405 [details]
dmesg for boot

Description of problem:
Warnings of hardware unit hang in syslog

Version-Release number of selected component (if applicable):
kernel 3.2.2-1.fc16.x86_64
Intel(R) PRO/1000 Network Driver - 1.5.1-k

How reproducible:
Seems to occur every few hours, even during periods of light traffic on the interface.

Additional info:
This may be related to 

Example error from syslog:

 e1000e 0000:02:00.0: p4p1: Detected Hardware Unit Hang:
   TDH                  <68>
   TDT                  <6a>
   next_to_use          <6a>
   next_to_clean        <68>
 buffer_info[next_to_clean]:
   time_stamp           <10b9bc45b>
   next_to_watch        <68>
   jiffies              <10b9bc8f1>
   next_to_watch.status <0>
 MAC Status             <80383>
 PHY Status             <792d>
 PHY 1000BASE-T Status  <7c00>
 PHY Extended Status    <3000>
 PCI Status             <10>

lspci for device:

02:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
        Subsystem: Intel Corporation PRO/1000 PT Server Adapter
        Flags: bus master, fast devsel, latency 0, IRQ 53
        Memory at fba40000 (32-bit, non-prefetchable) [size=128K]
        Memory at fba20000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at d000 [size=32]
        Expansion ROM at fba00000 [disabled] [size=128K]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [e0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 00-15-17-ff-ff-e9-da-44
        Kernel driver in use: e1000e
        Kernel modules: e1000e

ethtool output:

        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: on
        Supports Wake-on: pumbg
        Wake-on: d
        Current message level: 0x00000001 (1)
                               drv
        Link detected: yes

Comment 1 Scott Shambarger 2012-01-30 16:43:33 UTC
Note, this problem may be related to an issue discussed on the e1000-devel list:

http://www.mail-archive.com/e1000-devel@lists.sourceforge.net/msg04658.html

There's a patch for that issue in bug 746272, but I can't view the bug to see it.

Might be a place to start though...

Comment 2 Marcelo Ricardo Leitner 2012-02-02 18:39:11 UTC
Seems to be the same issue.

Comment 3 Marcelo Ricardo Leitner 2012-02-02 18:42:54 UTC
For reference, this is fixed in upstream already:
http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commitdiff;h=09357b00255c233705b1cf6d76a8d147340545b8

Comment 4 Josh Boyer 2012-02-08 15:55:08 UTC
I've applied the commit from upstream to f15 and f16.  Thanks for the pointer Marcelo.

Comment 5 Scott Shambarger 2012-03-07 05:57:54 UTC
I haven't seen a 'Detected Hardware Unit Hang' for e1000 since updating to 3.2.6-3 (and beyond).  From my perspective at least, the bug appears fixed :)

Thanks! (should I close it, or should you?)

Comment 6 Josh Boyer 2012-03-07 15:30:46 UTC
Oops, yes thank you.

Comment 7 Bojan Smojver 2012-09-19 23:20:21 UTC
Fedora 17, kernel-3.5.4-1.fc17.x86_64:
-----------------------------
[  465.182136] e1000e 0000:00:19.0: eth0: Detected Hardware Unit Hang:
[  465.182136]   TDH                  <0>
[  465.182136]   TDT                  <5>
[  465.182136]   next_to_use          <5>
[  465.182136]   next_to_clean        <0>
[  465.182136] buffer_info[next_to_clean]:
[  465.182136]   time_stamp           <100027df5>
[  465.182136]   next_to_watch        <0>
[  465.182136]   jiffies              <100028749>
[  465.182136]   next_to_watch.status <0>
[  465.182136] MAC Status             <83>
[  465.182136] PHY Status             <796d>
[  465.182136] PHY 1000BASE-T Status  <3800>
[  465.182136] PHY Extended Status    <3000>
[  465.182136] PCI Status             <10>
[  467.177537] e1000e 0000:00:19.0: eth0: Detected Hardware Unit Hang:
[  467.177537]   TDH                  <0>
[  467.177537]   TDT                  <5>
[  467.177537]   next_to_use          <5>
[  467.177537]   next_to_clean        <0>
[  467.177537] buffer_info[next_to_clean]:
[  467.177537]   time_stamp           <100027df5>
[  467.177537]   next_to_watch        <0>
[  467.177537]   jiffies              <100028f18>
[  467.177537]   next_to_watch.status <0>
[  467.177537] MAC Status             <83>
[  467.177537] PHY Status             <796d>
[  467.177537] PHY 1000BASE-T Status  <3800>
[  467.177537] PHY Extended Status    <3000>
[  467.177537] PCI Status             <10>
[  469.174074] e1000e 0000:00:19.0: eth0: Detected Hardware Unit Hang:
[  469.174074]   TDH                  <0>
[  469.174074]   TDT                  <5>
[  469.174074]   next_to_use          <5>
[  469.174074]   next_to_clean        <0>
[  469.174074] buffer_info[next_to_clean]:
[  469.174074]   time_stamp           <100027df5>
[  469.174074]   next_to_watch        <0>
[  469.174074]   jiffies              <1000296e8>
[  469.174074]   next_to_watch.status <0>
[  469.174074] MAC Status             <83>
[  469.174074] PHY Status             <796d>
[  469.174074] PHY 1000BASE-T Status  <3800>
[  469.174074] PHY Extended Status    <3000>
[  469.174074] PCI Status             <10>
[  471.170695] e1000e 0000:00:19.0: eth0: Detected Hardware Unit Hang:
[  471.170695]   TDH                  <0>
[  471.170695]   TDT                  <5>
[  471.170695]   next_to_use          <5>
[  471.170695]   next_to_clean        <0>
[  471.170695] buffer_info[next_to_clean]:
[  471.170695]   time_stamp           <100027df5>
[  471.170695]   next_to_watch        <0>
[  471.170695]   jiffies              <100029eb8>
[  471.170695]   next_to_watch.status <0>
[  471.170695] MAC Status             <83>
[  471.170695] PHY Status             <796d>
[  471.170695] PHY 1000BASE-T Status  <3800>
[  471.170695] PHY Extended Status    <3000>
[  471.170695] PCI Status             <10>
[  473.167229] e1000e 0000:00:19.0: eth0: Detected Hardware Unit Hang:
[  473.167229]   TDH                  <0>
[  473.167229]   TDT                  <5>
[  473.167229]   next_to_use          <5>
[  473.167229]   next_to_clean        <0>
[  473.167229] buffer_info[next_to_clean]:
[  473.167229]   time_stamp           <100027df5>
[  473.167229]   next_to_watch        <0>
[  473.167229]   jiffies              <10002a688>
[  473.167229]   next_to_watch.status <0>
[  473.167229] MAC Status             <83>
[  473.167229] PHY Status             <796d>
[  473.167229] PHY 1000BASE-T Status  <3800>
[  473.167229] PHY Extended Status    <3000>
[  473.167229] PCI Status             <10>
[  473.174670] ------------[ cut here ]------------
[  473.174684] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x250/0x260()
[  473.174688] Hardware name: 4313CTO
[  473.174692] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
[  473.174694] Modules linked in: fuse bnep bluetooth nf_conntrack_ipv4 ip6t_REJECT nf_defrag_ipv4 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_state nf_conntrack ip6_tables binfmt_misc qcserial usb_wwan nfsd nfs_acl auth_rpcgss qmi_wwan uvcvideo lockd sunrpc cdc_wdm videobuf2_vmalloc snd_hda_codec_hdmi uinput snd_hda_codec_conexant videobuf2_memops videobuf2_core usbnet snd_hda_intel coretemp snd_hda_codec mii videodev media arc4 snd_hwdep snd_pcm snd_page_alloc snd_timer iwlwifi mac80211 mei lpc_ich kvm thinkpad_acpi snd soundcore mfd_core cfg80211 intel_ips i2c_i801 microcode e1000e rfkill crc32c_intel ghash_clmulni_intel firewire_ohci sdhci_pci sdhci firewire_core crc_itu_t mmc_core mxm_wmi wmi i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[  473.174780] Pid: 0, comm: swapper/1 Not tainted 3.5.4-1.fc17.x86_64 #1
[  473.174783] Call Trace:
[  473.174785]  <IRQ>  [<ffffffff8105867f>] warn_slowpath_common+0x7f/0xc0
[  473.174801]  [<ffffffff81058776>] warn_slowpath_fmt+0x46/0x50
[  473.174808]  [<ffffffff8151cb20>] dev_watchdog+0x250/0x260
[  473.174813]  [<ffffffff8151c8d0>] ? dev_deactivate_queue.constprop.30+0x80/0x80
[  473.174819]  [<ffffffff81068e31>] run_timer_softirq+0x141/0x340
[  473.174825]  [<ffffffff81061050>] __do_softirq+0xc0/0x1e0
[  473.174830]  [<ffffffff8101a953>] ? native_sched_clock+0x13/0x80
[  473.174838]  [<ffffffff8161621c>] call_softirq+0x1c/0x30
[  473.174844]  [<ffffffff81015225>] do_softirq+0x75/0xb0
[  473.174848]  [<ffffffff81061425>] irq_exit+0xb5/0xc0
[  473.174852]  [<ffffffff81616b5e>] smp_apic_timer_interrupt+0x6e/0x99
[  473.174857]  [<ffffffff816158ca>] apic_timer_interrupt+0x6a/0x70
[  473.174859]  <EOI>  [<ffffffff8132b83a>] ? intel_idle+0xea/0x150
[  473.174870]  [<ffffffff8132b81b>] ? intel_idle+0xcb/0x150
[  473.174877]  [<ffffffff814bad99>] cpuidle_enter+0x19/0x20
[  473.174882]  [<ffffffff814bb3b9>] cpuidle_idle_call+0xa9/0x240
[  473.174886]  [<ffffffff8101c55f>] cpu_idle+0xaf/0x120
[  473.174893]  [<ffffffff815fb565>] start_secondary+0x248/0x24a
[  473.174897] ---[ end trace d9255ac21b1ad160 ]---
[  473.174917] e1000e 0000:00:19.0: eth0: Reset adapter
[  476.563243] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
-----------------------------

Never seen this before the latest kernel.

Comment 8 Bojan Smojver 2012-09-19 23:22:07 UTC
BTW, all this happened on resume from suspend to memory.

Comment 10 Scott Shambarger 2012-09-23 02:08:50 UTC
This bug does appear to be back.... Linux version 3.5.3-1.fc17.x86_64:

[202841.509024] e1000e 0000:03:00.0: p4p1: Detected Hardware Un
it Hang:
[202841.509024]   TDH                  <50>
[202841.509024]   TDT                  <52>
[202841.509024]   next_to_use          <52>
[202841.509024]   next_to_clean        <50>
[202841.509024] buffer_info[next_to_clean]:
[202841.509024]   time_stamp           <10c158680>
[202841.509024]   next_to_watch        <50>
[202841.509024]   jiffies              <10c158b41>
[202841.509024]   next_to_watch.status <0>
[202841.509024] MAC Status             <80383>
[202841.509024] PHY Status             <792d>
[202841.509024] PHY 1000BASE-T Status  <3c00>
[202841.509024] PHY Extended Status    <3000>
[202841.509024] PCI Status             <10>

Comment 11 Odin Trisk 2012-11-10 20:53:41 UTC
For me... it seems all e1000 card reset cause an immediate system lockup.

So auto-recovery (or a card problem by the driver) causes a complete system freeze where the kernel eventually spits out a line about hung tasks for over 120 seconds.

INFO: task events/1:12 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Immediately upon seeing the message "e1000 0000:02:02.0: eth2: Reset adapter"
the console does not accept/echo keyboard input.

I can reliably cause a e1000 card reset by turning off "ethtool -K eth2 rx off" this results in a reproducible hang due to the reset.

The system has been hanging for sometime intermittently or under network load, the cause of the hang was the automatic reset after this kind of message:
 e1000 0000:00:04.0: eth0: Detected Hardware Unit Hang:


This is a very good link of exactly the kind of kernel backtrace output I see:

https://www.centos.org/modules/newbb/print.php?form=1&topic_id=38658&forum=55&order=ASC&start=0

It looks like the e1000 driver uses a worker thread to handle the  e1000_reset_task getting to e1000_down_and_stop  this then goes on to call "schedule" is this allowed.  It then looks like it ends up processing an interrupt. 
Might the 'work queue' usage in the e1000 driver have a race (and when schedule goes to re-evaluate it, this causes hang) ?


driver: e1000
version: 7.3.21-k8-NAPI

Comment 12 Odin Trisk 2013-01-15 04:24:01 UTC
I also report recently in other places...

http://sourceforge.net/p/e1000/bugs/370/

http://bugs.centos.org/view.php?id=6187


   commit 8ce6909f77ba1b7bcdea65cc2388fd1742b6d669
   Author: Tushar Dave tushar.n.dave
   Date: Thu May 17 01:04:50 2012 +0000


http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=3a58107e4ed76e3a314002233a600234e0785aa1

This made it into 3.2.18 so would need explicit backport by RH. I'm sure we can confirm if this patch is in the kernel patch at this time ?

Comment 13 Fedora End Of Life 2013-07-04 05:59:11 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 14 Fedora End Of Life 2013-08-01 17:19:22 UTC
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 15 Peter Maloney 2015-02-16 11:55:59 UTC
FYI since this bug appears high in the results when searching for this issue, and still happens on various systems despite this bug being closed, here is what I did to fix this problem:

    sudo ethtool -K com gso off gro off tso off

Which I got from:

http://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang

Comment 16 Dominik Holler 2020-03-02 07:15:39 UTC
Created attachment 1666919 [details]
sos report on CentOS 8

This bug occurs on CentOS 8.

Comment 17 Steven Ellis 2020-04-16 03:08:35 UTC
This is still an ongoing issue - I can see it today with RHEL 8.1

Upstream Ref
 - https://bugzilla.kernel.org/show_bug.cgi?id=205047

Comment 18 Marcelo Ricardo Leitner 2020-07-06 22:54:57 UTC
Please report a new bug against RHEL8 then. This bz was about a bug on F16!
It is very likely that the resulting effect is the same, but the root cause is very different.


Note You need to log in before you can comment on or make changes to this bug.