Bug 1487844 - igb: Detected Tx Unit Hang, transmit queue timed out, Reset adapter
Summary: igb: Detected Tx Unit Hang, transmit queue timed out, Reset adapter
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-02 14:23 UTC by Michael Cronenworth
Modified: 2018-08-10 18:18 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-10 15:39:20 UTC


Attachments (Terms of Use)

Description Michael Cronenworth 2017-09-02 14:23:58 UTC
This issue has occurred since I've owned the adapter, but it typically happens once a month (or two to three months) so I would just reboot the server and forget about it.

After the kernel 4.12 update the issue starts anywhere from 1 day to 1 week so now it is becoming a real problem.

When the issue starts the network adapters start resetting in a loop. This causes brief, and annoying, network interruptions. The only workaround is to reboot.

$ lspci -nn | grep -i network
02:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
02:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)

[16851.131490] NETDEV WATCHDOG: p4p2 (igb): transmit queue 5 timed out
[16851.131535] ------------[ cut here ]------------
[16851.131547] WARNING: CPU: 7 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x215/0x220
[16851.131549] Modules linked in: sch_htb act_mirred cls_u32 sch_ingress ifb sit tunnel4 ip_tunnel cfg80211 rfkill xt_recent ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_ipv6header nf_nat_ipv4 nf_nat xt_conntrack xt_mark nf_conntrack iptable_mangle ip6table_filter ip6_tables w83627ehf hwmon_vid intel_powerclamp coretemp kvm_intel kvm raid456 irqbypass async_raid6_recov async_memcpy async_pq async_xor async_tx xor intel_cstate raid6_pq intel_uncore libcrc32c joydev gpio_ich iTCO_wdt iTCO_vendor_support cdc_acm tpm_infineon shpchp i7core_edac lpc_ich i2c_i801 acpi_cpufreq tpm_tis tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt raid1 mgag200 drm_kms_helper ttm igb drm crc32c_intel
[16851.131616]  ptp serio_raw pps_core dca i2c_algo_bit
[16851.131626] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.12.8-200.fc25.x86_64 #1
[16851.131629] Hardware name: Supermicro X8SIL/X8SIL, BIOS 1.2a       06/27/2012
[16851.131632] task: ffff89e2f6264900 task.stack: ffffb52300cf8000
[16851.131637] RIP: 0010:dev_watchdog+0x215/0x220
[16851.131640] RSP: 0018:ffff89e2ffdc3e60 EFLAGS: 00010286
[16851.131644] RAX: 0000000000000037 RBX: 0000000000000005 RCX: 0000000000000000
[16851.131646] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 0000000000000300
[16851.131649] RBP: ffff89e2ffdc3e80 R08: 0000000000000001 R09: 00000000000003a3
[16851.131652] R10: ffff89e2ffdd0430 R11: 00000000000003a3 R12: ffff89e2ef580000
[16851.131655] R13: 0000000000000007 R14: 0000000000000008 R15: ffff89e2ef580000
[16851.131659] FS:  0000000000000000(0000) GS:ffff89e2ffdc0000(0000) knlGS:0000000000000000
[16851.131662] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16851.131665] CR2: 0000004a120daeb8 CR3: 000000003fe09000 CR4: 00000000000006e0
[16851.131668] Call Trace:
[16851.131672]  <IRQ>
[16851.131678]  ? qdisc_rcu_free+0x50/0x50
[16851.131685]  call_timer_fn+0x35/0x130
[16851.131691]  run_timer_softirq+0x1d1/0x420
[16851.131698]  ? sched_clock+0x9/0x10
[16851.131703]  ? sched_clock+0x9/0x10
[16851.131711]  ? sched_clock_cpu+0x11/0xb0
[16851.131718]  __do_softirq+0x10c/0x2a5
[16851.131728]  irq_exit+0xff/0x110
[16851.131734]  smp_apic_timer_interrupt+0x3d/0x50
[16851.131739]  apic_timer_interrupt+0x93/0xa0
[16851.131749] RIP: 0010:cpuidle_enter_state+0x11e/0x2c0
[16851.131753] RSP: 0018:ffffb52300cfbe60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[16851.131759] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 000000000000001f
[16851.131763] RDX: 00000f5375b4a4d6 RSI: ffff89e2ffdd7a98 RDI: 0000000000000000
[16851.131767] RBP: ffffb52300cfbe98 R08: 0000000000000101 R09: 0000000000000018
[16851.131771] R10: ffffb52300cfbe30 R11: 00000000000007ef R12: ffff89e2ffde2e00
[16851.131776] R13: ffffffffa0f81a58 R14: 00000f5375b4a4d6 R15: ffffffffa0f81a40
[16851.131779]  </IRQ>
[16851.131789]  cpuidle_enter+0x17/0x20
[16851.131797]  call_cpuidle+0x23/0x40
[16851.131803]  do_idle+0x189/0x1e0
[16851.131810]  cpu_startup_entry+0x71/0x80
[16851.131819]  start_secondary+0x154/0x190
[16851.131826]  secondary_startup_64+0x9f/0x9f
[16851.131831] Code: 8c 24 64 04 00 00 eb 8f 4c 89 e7 c6 05 7f 74 89 00 01 e8 bf 6a fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 b0 11 d2 a0 e8 92 43 a5 ff <0f> ff eb c1 0f 1f 80 00 00 00 00 66 66 66 66 90 48 c7 47 08 00 
[16851.131907] ---[ end trace 45d6353762f8cb8e ]---
[16851.135755] igb 0000:02:00.1 p4p2: Reset adapter
[16851.160112] igb 0000:02:00.0 p4p1: Reset adapter
[16851.163531] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[16853.283967] igb 0000:02:00.0: exceed max 2 second
[16853.284227] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <3>
                 TDH                  <8>
                 TDT                  <c>
                 next_to_use          <c>
                 next_to_clean        <8>
               buffer_info[next_to_clean]
                 time_stamp           <100fc8985>
                 next_to_watch        <ffff89e2f12f3090>
                 jiffies              <100fc9068>
                 desc.status          <17a8200>
[16853.284331] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <1>
                 TDH                  <2>
                 TDT                  <4>
                 next_to_use          <4>
                 next_to_clean        <2>
               buffer_info[next_to_clean]
                 time_stamp           <100fc8987>
                 next_to_watch        <ffff89e2e1e2d030>
                 jiffies              <100fc9068>
                 desc.status          <1758200>
[16853.284767] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Down
[16853.284783] igb 0000:02:00.0 p4p1: Reset adapter
[16855.029771] igb 0000:02:00.1 p4p2: igb: p4p2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[16856.920001] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[16915.137059] igb 0000:02:00.1 p4p2: Reset adapter
[16915.161571] igb 0000:02:00.0 p4p1: Reset adapter
[16915.164810] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[16917.281476] igb 0000:02:00.0: exceed max 2 second
[16917.281710] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <1>
                 TDH                  <2>
                 TDT                  <4>
                 next_to_use          <4>
                 next_to_clean        <2>
               buffer_info[next_to_clean]
                 time_stamp           <100fd8440>
                 next_to_watch        <ffff89e2e1e2d030>
                 jiffies              <100fd8a60>
                 desc.status          <1758200>
[16917.281716] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <5>
                 TDH                  <4>
                 TDT                  <8>
                 next_to_use          <8>
                 next_to_clean        <4>
               buffer_info[next_to_clean]
                 time_stamp           <100fd8421>
                 next_to_watch        <ffff89e2e1dd1050>
                 jiffies              <100fd8a60>
                 desc.status          <17a8200>
[16917.281775] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <0>
                 TDH                  <c>
                 TDT                  <10>
                 next_to_use          <10>
                 next_to_clean        <c>
               buffer_info[next_to_clean]
                 time_stamp           <100fd8420>
                 next_to_watch        <ffff89e2f12750d0>
                 jiffies              <100fd8a60>
                 desc.status          <17a8200>
[16917.282279] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Down
[16917.282300] igb 0000:02:00.0 p4p1: Reset adapter
[16917.536247] CE: hpet2 increased min_delta_ns to 45258 nsec
[16919.040270] igb 0000:02:00.1 p4p2: igb: p4p2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[16920.919430] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[17172.200716] igb 0000:02:00.1 p4p2: Reset adapter
[17172.260263] igb 0000:02:00.0 p4p1: Reset adapter
[17172.262718] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[17174.431286] igb 0000:02:00.0: exceed max 2 second
[17174.432218] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <2>
                 TDH                  <4>
                 TDT                  <6>
                 next_to_use          <6>
                 next_to_clean        <4>
               buffer_info[next_to_clean]
                 time_stamp           <101016f0a>
                 next_to_watch        <ffff89e2e1efa050>
                 jiffies              <1010176c9>
                 desc.status          <17a8200>
[17174.432225] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <7>
                 TDH                  <2>
                 TDT                  <6>
                 next_to_use          <6>
                 next_to_clean        <2>
               buffer_info[next_to_clean]
                 time_stamp           <101016f06>
                 next_to_watch        <ffff89e2f4195030>
                 jiffies              <1010176c9>
                 desc.status          <17a8200>
[17174.432231] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <1>
                 TDH                  <6>
                 TDT                  <a>
                 next_to_use          <a>
                 next_to_clean        <6>
               buffer_info[next_to_clean]
                 time_stamp           <101016f07>
                 next_to_watch        <ffff89e2e1e2d070>
                 jiffies              <1010176c9>
                 desc.status          <1788200>
[17174.432280] igb 0000:02:00.0: Detected Tx Unit Hang
                 Tx Queue             <4>
                 TDH                  <2>
                 TDT                  <6>
                 next_to_use          <6>
                 next_to_clean        <2>
               buffer_info[next_to_clean]
                 time_stamp           <101016f08>
                 next_to_watch        <ffff89e2f4181030>
                 jiffies              <1010176c9>
                 desc.status          <17a8200>
[17174.432914] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Down
[17174.432930] igb 0000:02:00.0 p4p1: Reset adapter
[17176.585167] igb 0000:02:00.1 p4p2: igb: p4p2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[17178.015141] igb 0000:02:00.0 p4p1: igb: p4p1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Comment 1 Michael Cronenworth 2017-09-02 14:36:32 UTC
Also filed upstream: https://sourceforge.net/p/e1000/bugs/574/

Comment 2 Michael Cronenworth 2018-04-10 15:39:20 UTC
Closing this. I've done an RMA on the original card and the new card has the same issue. The issue is either due to a driver bug or a hardware incompatibility with the Supermicro motherboard I'm using. I've pulled the card out of the server and went back to the onboard chips. No issues now. Anyone want an i350 network card?

Comment 3 Bobby Hakimi 2018-07-25 21:22:18 UTC
im having the exact same issue, i even upgraded to kernel 4.14.57 same problem, the 4.9 branch also has this issue. replace the card multiple times same issue. it seems to come up quicker the more network traffic the box has.

Comment 4 Major Hayden 2018-08-09 13:10:55 UTC
I'm seeing the same issue on 4.17.11-200 right now with Fedora 28.

Comment 5 Bobby Hakimi 2018-08-10 06:20:00 UTC
looks like this maybe caused by a CPU spike due to our application and when the CPU hits 100% it causes the driver to break

Comment 6 Michael Cronenworth 2018-08-10 18:18:25 UTC
I recommend you guys post to the Linux kernel mailing list for network devices: netdev. There is more of an active community there. Kernel developers never reply to these Bugzilla bugs.


Note You need to log in before you can comment on or make changes to this bug.