Bug 2226912

Summary: Solarflare SFN8522 adapter loses physical link if tuned is started with profile powersave
Product: [Fedora] Fedora Reporter: Trevor Hemsley <trevor.hemsley>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 38CC: jskarvad, jzerdik, olysonek-foss
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Trevor Hemsley 2023-07-27 00:21:46 UTC
I have recently swapped network cards from a dual port Solarflare SFN6122F  (uses an sfc_siena driver) to a Solarflare SFN8522 (uses sfc) adapter. Shut down, swapped cards, rebooted and have just spent the last 3 hours working out why my network card  no longer has a link detected for more than a few seconds after power on. All cables are the same between the two adapter cards and I have also tried with different cables.

kernel-6.4.4-200.fc38.x86_64
tuned-2.20.0-1.fc38.noarch

This is possibly a tuned bug or maybe it's a kernel bug. Thoguth I'd start with tuned as not running that fixes the immediate problem.

Reproducible: Always

Steps to Reproduce:
1. Install Solarflare SFN8522 and connect using Direct Attach cable to a switch or to another machine (I do both)
2. Install tuned and set to use profile 'powersave'
3. May need a reboot to activate tuned
Actual Results:  
A cold boot (power on) comes up as normal, link is detected, connection established for anything between 1s and about 1 minute then it goes away. Running `ethtool enp9s0f0np0` shows "Link detected: No". Syslog shows

Jul 26 21:48:29 trevor4 kernel: [    5.263140] sfc 0000:09:00.0: Solarflare NIC detected
Jul 26 21:48:29 trevor4 kernel: [    5.269260] sfc 0000:09:00.0: Part Number : SFN8522
Jul 26 21:48:29 trevor4 kernel: [    5.498217] sfc 0000:09:00.1: Solarflare NIC detected
Jul 26 21:48:29 trevor4 kernel: [    5.501776] sfc 0000:09:00.1: Part Number : SFN8522
Jul 26 21:48:29 trevor4 kernel: [    5.522425] sfc 0000:09:00.0 enp9s0f0np0: renamed from eth0
Jul 26 21:48:29 trevor4 kernel: [    5.617192] sfc 0000:09:00.1 enp9s0f1np1: renamed from eth1
Jul 26 21:48:27 trevor4 kernel: sfc 0000:09:00.0: Solarflare NIC detected
Jul 26 21:48:27 trevor4 kernel: sfc 0000:09:00.0: Part Number : SFN8522
Jul 26 21:48:27 trevor4 kernel: sfc 0000:09:00.1: Solarflare NIC detected
Jul 26 21:48:27 trevor4 kernel: sfc 0000:09:00.1: Part Number : SFN8522
Jul 26 21:48:27 trevor4 kernel: sfc 0000:09:00.0 enp9s0f0np0: renamed from eth0
Jul 26 21:48:28 trevor4 kernel: sfc 0000:09:00.1 enp9s0f1np1: renamed from eth1
Jul 26 21:48:29 trevor4 kernel: [    7.167012] sfc 0000:09:00.0 enp9s0f0np0: link up at 10000Mbps full-duplex (MTU 1500)
Jul 26 21:48:29 trevor4 kernel: sfc 0000:09:00.0 enp9s0f0np0: link up at 10000Mbps full-duplex (MTU 1500)
Jul 26 21:48:29 trevor4 kernel: [    7.340488] sfc 0000:09:00.1 enp9s0f1np1: link up at 10000Mbps full-duplex (MTU 1500)
Jul 26 21:48:29 trevor4 kernel: sfc 0000:09:00.1 enp9s0f1np1: link up at 10000Mbps full-duplex (MTU 1500)
Jul 26 21:49:20 trevor4 kernel: [   57.446270] sfc 0000:09:00.0 enp9s0f0np0: link down
Jul 26 21:49:20 trevor4 kernel: [   57.446440] sfc 0000:09:00.0 enp9s0f0np0: link down
Jul 26 21:49:20 trevor4 kernel: [   57.488974] sfc 0000:09:00.1 enp9s0f1np1: link down
Jul 26 21:49:20 trevor4 kernel: [   57.489024] sfc 0000:09:00.1 enp9s0f1np1: link down


# sfctool enp9s0f0npo0
Settings for enp9s0f0np0:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseT/Full 
	                        1000baseX/Full 
	                        10000baseCR/Full 
	                        10000baseSR/Full 
	                        10000baseLR/Full 
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  1000baseT/Full 
	                        1000baseX/Full 
	                        10000baseCR/Full 
	                        10000baseSR/Full 
	                        10000baseLR/Full 
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Link partner advertised link modes:  Not reported
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: No
	Link partner advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Port: FIBRE
	PHYAD: 255
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x000020f7 (8439)
			       drv probe link ifdown ifup rx_err tx_err hw
	Link detected: no

At this point I found that the only way to get the Link Detected: yes back was to cold boot the machine using the power button. Ctrl-Alt-Del sometimes seemed to work but the only reliable way to get it working again was to power off/on. This was repeatable on every boot, link would connect, things would work for some time - never more than about one minute, sometimes going away before I could even login to ping things.

I have two of these cards installed, one in a machine with tuned set to profile powersave which exhibits the problem. The other is set to profile virtual-host and does not. I have swapped SFN8522 cards between the 2 systems and both work in the virtual-host system and both fail in the one in powersave mode.

After many many reboots into single user, multi-user, and emergency targets and activating the network manually with `ip`, then bringing up services one by one I found that `systemctl mask tuned` will stop this. I haven't experimented with tuned settings to see if I can get it to stop doing whatever it is that it's doing that breaks this.  I'm just thankful to have a working network connection again!

Expected Results:  
Network connection works reliably without needing power off/on!

	Link detected: yes


While debugging this problem I have used sfboot to reset the dual port adapter to all default settings, upgraded to latest Solarflare firmware:

    Firmware version:   v8.5.2
    Controller type:    Solarflare SFC9200 family
    Controller version: v8.5.0.1002
    Boot ROM version:   v5.2.2.1006
    UEFI ROM version:   v2.9.6.3

I actually use a profile called powersave-nodisk which is set to use the following 

# cat /etc/tuned/powersave-nodisk/tuned.conf 
[main]
summary=Optimize for low power consumption but leave storage alone
include=powersave

[disk]
devices=sdz

I have no sdz, I just wanted it to stop powering down the two spinning rust devices in my mdadm array.