Bug 1442638 - igb ... PCIe link lost, device now detached, but reloading the igb module fixes things
Summary: igb ... PCIe link lost, device now detached, but reloading the igb module fix...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-04-16 21:59 UTC by Richard W.M. Jones
Modified: 2024-02-11 16:05 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg (91.89 KB, text/plain)
2019-11-05 18:16 UTC, Richard W.M. Jones
no flags Details

Description Richard W.M. Jones 2017-04-16 21:59:06 UTC
Description of problem:

At boot, the Intel igb card fails with:

[   35.883590] igb 0000:04:00.0 enp4s0: PCIe link lost, device now detached
[   35.891333] br0: port 1(enp4s0) entered blocking state
[   35.891338] br0: port 1(enp4s0) entered disabled state
[   35.891645] device enp4s0 entered promiscuous mode
[   35.904155] igb 0000:04:00.0 enp4s0: failed to initialize vlan filtering on this port
[   35.915012] br0: port 1(enp4s0) entered blocking state
[   35.915017] br0: port 1(enp4s0) entered disabled state
[   35.931059] igb 0000:04:00.0 enp4s0: failed to initialize vlan filtering on this port

It was suggested to me that this indicates a hardware failure.
However this is unlikely, as simply reloading the igb module
fixes the problem.  I now have a script which does this after boot:

modprobe -r igb
sleep 1
modprobe igb
sleep 1
systemctl restart network

So it looks much more likely that the driver is just broken.

Version-Release number of selected component (if applicable):

Currently 4.11.0-0.rc4.git1.1.fc27.x86_64, but this has
been happening since I bought the machine a year ago.

How reproducible:

100%

Steps to Reproduce:
1. Boot.

Comment 1 Andrea Perotti 2019-11-05 15:44:43 UTC
Hi Richard, is that the only output you got, or do you have also a splat like:

[  471.537833] ------------[ cut here ]------------
[  471.537849] igb: Failed to read reg 0x8!
[  471.537904] WARNING: CPU: 1 PID: 9497 at drivers/net/ethernet/intel/igb/igb_main.c:756 igb_rd32.cold+0x30/0x3b [igb]
[...]
[  471.538638] Call Trace:
[  471.538654]  igb_get_link_ksettings+0x20/0x200 [igb]
[  471.538674]  duplex_show+0x6e/0xc0
[  471.538689]  dev_attr_show+0x19/0x40
[  471.538704]  sysfs_kf_seq_show+0x9b/0xf0
[  471.538720]  seq_read+0xcd/0x400
[  471.538734]  vfs_read+0x9d/0x150
[  471.538746]  ksys_read+0x5f/0xe0
[  471.538761]  do_syscall_64+0x5f/0x1a0
[  471.538776]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  471.538795] RIP: 0033:0x7ff5a09383c2
[  471.538808] Code: c0 e9 c2 fe ff ff 50 48 8d 3d c2 0d 0a 00 e8 b5 f1 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[  471.538862] RSP: 002b:00007ffe3e6fd9d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  471.538887] RAX: ffffffffffffffda RBX: 00000000021442e0 RCX: 00007ff5a09383c2
[  471.538910] RDX: 0000000000001000 RSI: 000000000215a350 RDI: 0000000000000004
[  471.538932] RBP: 00007ff5a0a0a300 R08: 0000000000000004 R09: 0000000000000070
[  471.538955] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000021442e0
[  471.538977] R13: 00007ff5a0a09700 R14: 0000000000000d68 R15: 0000000000000d68
[  471.539000] ---[ end trace 0aea06ceef9e275e ]---

Have you already had the opportunity to try kernel 5.3.7-301.fc31 without your workaround?
I've found this commit that worked on that part of the code: 94bc1e522b32c866d85b5af0ede55026b585ae73
maybe may be relevant for you as well.

Comment 2 Richard W.M. Jones 2019-11-05 18:15:32 UTC
It still happens on this same hardware with every kernel I've tried since around 2016.
This machine is using the Rawhide kernel.  I don't know if there's something
particular about 5.3.7-301.fc31, but there's is nothing for the latest Rawhide
(5.4.0-0.rc6.git0.1.fc32.x86_64).  In case I missed something I will attach the
complete log.

Comment 3 Richard W.M. Jones 2019-11-05 18:16:13 UTC
Created attachment 1633038 [details]
dmesg

Comment 4 adrian14 2024-01-12 08:55:42 UTC
Relevant dmesg output:

[31370.350858] ------------[ cut here ]------------
[31370.350859] igc: Failed to read reg 0xc030!
[31370.350888] WARNING: CPU: 1 PID: 76852 at drivers/net/ethernet/intel/igc/igc_main.c:6641 igc_rd32+0x8d/0xa0 [igc]
[31370.350897] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc vfat fat iwlmvm snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg mac80211 snd_usb_audio snd_intel_sdw_acpi intel_rapl_msr snd_hda_codec intel_rapl_common snd_usbmidi_lib snd_ump libarc4 edac_mce_amd snd_rawmidi snd_hda_core btusb mc xfs btrtl snd_hwdep iwlwifi kvm_amd btintel snd_seq btbcm asus_nb_wmi eeepc_wmi snd_seq_device asus_wmi btmtk ledtrig_audio snd_pcm kvm uas cfg80211 sparse_keymap bluetooth irqbypass snd_timer platform_profile usb_storage pcspkr rapl wmi_bmof joydev snd i2c_piix4 k10temp rfkill soundcore gpio_amdpt gpio_generic loop zram amdgpu i2c_algo_bit drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni drm_exec polyval_generic
[31370.350949]  drm_suballoc_helper amdxcp nvme drm_buddy ghash_clmulni_intel gpu_sched sha512_ssse3 sha256_ssse3 sha1_ssse3 ccp nvme_core drm_display_helper sp5100_tco igc cec nvme_common video wmi ip6_tables ip_tables fuse
[31370.350964] CPU: 1 PID: 76852 Comm: kworker/1:0 Not tainted 6.6.9-200.fc39.x86_64 #1
[31370.350966] Hardware name: ASUS System Product Name/ROG STRIX X670E-E GAMING WIFI, BIOS 1709 09/28/2023
[31370.350968] Workqueue: events igc_watchdog_task [igc]
[31370.350974] RIP: 0010:igc_rd32+0x8d/0xa0 [igc]
[31370.350979] Code: 48 c7 c6 58 29 3c c0 e8 f1 5b 9c c4 48 8b bb 28 ff ff ff e8 b5 52 55 c4 84 c0 74 bc 89 ee 48 c7 c7 80 29 3c c0 e8 63 08 d7 c3 <0f> 0b eb aa b8 ff ff ff ff e9 15 83 c5 c4 0f 1f 44 00 00 90 90 90
[31370.350981] RSP: 0018:ffffc90021affdc8 EFLAGS: 00010286
[31370.350983] RAX: 0000000000000000 RBX: ffff88810fc3ccb8 RCX: 0000000000000027
[31370.350984] RDX: ffff88883e461588 RSI: 0000000000000001 RDI: ffff88883e461580
[31370.350985] RBP: 000000000000c030 R08: 0000000000000000 R09: ffffc90021affc50
[31370.350986] R10: 0000000000000003 R11: ffffffff86346508 R12: ffff88810fc3c000
[31370.350988] R13: 0000000000000000 R14: ffff88810bdb8d40 R15: 000000000000c030
[31370.350989] FS:  0000000000000000(0000) GS:ffff88883e440000(0000) knlGS:0000000000000000
[31370.350990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[31370.350992] CR2: 00001ad003e86000 CR3: 0000000375222000 CR4: 0000000000f50ee0
[31370.350993] PKRU: 55555554
[31370.350994] Call Trace:
[31370.350996]  <TASK>
[31370.350997]  ? igc_rd32+0x8d/0xa0 [igc]
[31370.351003]  ? __warn+0x81/0x130
[31370.351008]  ? igc_rd32+0x8d/0xa0 [igc]
[31370.351015]  ? report_bug+0x171/0x1a0
[31370.351018]  ? prb_read_valid+0x1b/0x30
[31370.351021]  ? srso_alias_return_thunk+0x5/0x7f
[31370.351025]  ? handle_bug+0x3c/0x80
[31370.351027]  ? exc_invalid_op+0x17/0x70
[31370.351029]  ? asm_exc_invalid_op+0x1a/0x20
[31370.351034]  ? igc_rd32+0x8d/0xa0 [igc]
[31370.351039]  ? igc_rd32+0x8d/0xa0 [igc]
[31370.351044]  igc_update_stats+0x8a/0x6d0 [igc]
[31370.351050]  igc_watchdog_task+0x9d/0x4a0 [igc]
[31370.351056]  process_one_work+0x171/0x340
[31370.351060]  worker_thread+0x27b/0x3a0
[31370.351063]  ? __pfx_worker_thread+0x10/0x10
[31370.351064]  kthread+0xe5/0x120
[31370.351068]  ? __pfx_kthread+0x10/0x10
[31370.351070]  ret_from_fork+0x31/0x50
[31370.351074]  ? __pfx_kthread+0x10/0x10
[31370.351076]  ret_from_fork_asm+0x1b/0x30
[31370.351081]  </TASK>
[31370.351082] ---[ end trace 0000000000000000 ]---

On AMD Ryzen 7 7700 8-Core Processor running Fedora 39 (6.6.9-200.fc39.x86_64)

As said previously, reload the driver (echo  1 > /sys/bus/pci/devices/<deviceId>/remove && sleep 1 && /sys/bus/pci/devices/<deviceId>/rescan) "fixes" the problem and then the problem does not seem to happen again (albeit it is seemingly random, so can not be 100% sure if the problem does not happen again)

Comment 5 adrian14 2024-02-11 16:05:42 UTC
The vendor (ASUS in my case) has published new firmware, which seems to have resolved the issue.


Note You need to log in before you can comment on or make changes to this bug.