Bug 1476266

Summary: team device and vanishing eth interface(s) on Dell r815
Product: Red Hat Enterprise Linux 7 Reporter: lejeczek <peljasz>
Component: NetworkManagerAssignee: sushil kulkarni <sukulkar>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Desktop QE <desktop-qa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.3CC: atragler, bgalvani, fgiudici, lrintel, peljasz, rkhan, sukulkar, thaller
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-06 09:51:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description lejeczek 2017-07-28 12:57:19 UTC
Description of problem:

First & foremost - I'm seeing it with elrepo's kernel-lt 4.4.7x
Thus if here is not the best bug place then please push it upstream or let me know how I'll do it.

I have a net team device, a typical one I'd think, two slaves.

team.config:                            {"runner": {"name": "lacp", "active": true, "fast_rate": true, "tx_hash": ["eth", "ipv4", "ipv6"]}, "link_watch": {"name": "ethtool"}}

After a reboot I see one slaves vanishes? but it is there @boot time:

[    4.021684] tg3 0000:06:00.0 p2p1: renamed from eth8

but soon after:

$ nmcli c u 172.24.154.202-slave-p2p1-3g9
Error: Connection activation failed: No suitable device found for this connection.

and there actually is no p2p1, gone. Other three ehts (out of four of Broadcom Limited NetXtreme BCM5719) are still there.
And the eth also is still there in the kernel space:

$ lspci -vv -s 06:00.0
06:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
	Subsystem: Broadcom Limited Device 1904
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 33
	NUMA node: 0
	Region 0: Memory at e5840000 (64-bit, prefetchable) [size=64K]
	Region 2: Memory at e5850000 (64-bit, prefetchable) [size=64K]
	Region 4: Memory at e5860000 (64-bit, prefetchable) [size=64K]
	Expansion ROM at ef2c0000 [disabled] [size=256K]
	Capabilities: [48] Power Management version 3

To fix this problem I have to delete all three, net team device and both slaves , reboot, interface is there again, and create the connection anew. Then it works until next reboot.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Beniamino Galvani 2017-07-28 13:15:20 UTC
If the interface is not there in the 'ip l' output, it's  a driver/kernel problem.

Can you please attach the 'dmesg' output?

Comment 3 lejeczek 2017-07-28 16:04:12 UTC
I did not try ip ls - cannot provoke a situation easily again as it's a production box - ethtool tool did not see the device.
I a snippet of the dmesg I put already there:

[    4.021684] tg3 0000:06:00.0 p2p1: renamed from eth8

I could not see anything unusual, alarming in dmesg output.

So device was there at boot and seems to have vanished at some point soon after.

When it happens again I'll include more output.

Comment 4 lejeczek 2017-08-01 13:15:29 UTC
Why is it always the same net team connection? Though I have three in total and these use the same phys net card.

like I said earlier, when vanished it does not appear as a net iface but still is there in the kernel space:

$ lspci -vvv -s 06:00.0
06:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
	Subsystem: Broadcom Limited Device 1904
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 33
	NUMA node: 0
	Region 0: Memory at e5840000 (64-bit, prefetchable) [size=64K]
	Region 2: Memory at e5850000 (64-bit, prefetchable) [size=64K]
	Region 4: Memory at e5860000 (64-bit, prefetchable) [size=64K]
	Expansion ROM at ef2c0000 [disabled] [size=256K]
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
		Product Name: Broadcom NetXtreme Gigabit Ethernet
		Read-only fields:



$ dmesg | grep -B5 -A5 -i eth 
[    2.947091] usb 3-2: Manufacturer: Avocent
[    2.947093] usb 3-2: SerialNumber: 20120430
[    2.966479] FUJITSU Extended Socket Network Device Driver - version 1.0 - Copyright (c) 2015 FUJITSU LIMITED
[    2.971038] usb 2-2.3: new high-speed USB device number 3 using ehci-pci
[    2.971738] input: Avocent USB Composite Device-0 as /devices/pci0000:00/0000:00:12.0/usb3/3-2/3-2:1.0/0003:0624:0248.0001/input/input4
[    2.977631] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[    2.977765] ACPI: PCI Interrupt Link [LN24] enabled at IRQ 24
[    2.979595] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem e6000000, IRQ 44, node addr f0:4d:a2:40:c1:d2
[    2.980117] ACPI: PCI Interrupt Link [LN25] enabled at IRQ 25
[    2.982690] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem e8000000, IRQ 45, node addr f0:4d:a2:40:c1:d4
[    2.983019] ACPI: PCI Interrupt Link [LN28] enabled at IRQ 28
[    2.983206] pps_core: LinuxPPS API ver. 1 registered
[    2.983211] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti>
[    2.985117] bnx2 0000:02:00.0 eth2: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem ea000000, IRQ 46, node addr f0:4d:a2:40:c1:d6
[    2.985409] ACPI: PCI Interrupt Link [LN29] enabled at IRQ 29
[    2.988210] bnx2 0000:02:00.1 eth3: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem ec000000, IRQ 47, node addr f0:4d:a2:40:c1:d8
[    2.988731] ACPI: PCI Interrupt Link [LN58] enabled at IRQ 58
[    2.989786] megasas: 06.808.16.00-rc1
[    2.989848] PTP clock support registered
[    2.990427] bnx2 0000:23:00.0 eth4: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 48, node addr 00:10:18:96:4f:50
[    2.990829] ACPI: PCI Interrupt Link [LN59] enabled at IRQ 59
[    2.991260] megaraid_sas 0000:08:00.0: FW now in Ready state
[    2.991366] megaraid_sas 0000:08:00.0: firmware supports msix	: (0)
[    2.991369] megaraid_sas 0000:08:00.0: current msix/online cpus	: (1/64)
[    2.992257] bnx2 0000:23:00.1 eth5: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem de000000, IRQ 49, node addr 00:10:18:96:4f:52
[    2.992386] ACPI: PCI Interrupt Link [LN56] enabled at IRQ 56
[    2.993313] bnx2 0000:24:00.0 eth6: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem e0000000, IRQ 51, node addr 00:10:18:96:4f:54
[    2.993389] mpt3sas version 09.102.00.00 loaded
[    2.993414] ACPI: PCI Interrupt Link [LN57] enabled at IRQ 57
[    2.993827] mpt3sas 0000:05:00.0: can't disable ASPM; OS doesn't have ASPM control
[    2.994319] mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (132002732 kB)
[    2.994800] bnx2 0000:24:00.1 eth7: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem e2000000, IRQ 52, node addr 00:10:18:96:4f:56
[    2.998489] tg3.c:v3.137 (May 11, 2014)
[    2.999826] mpt2sas_cm0: MSI-X vectors supported: 1, no of cores: 64, max_msix_vectors: -1
[    2.999914] mpt2sas0-msix0: PCI-MSI-X enabled: IRQ 53
[    2.999916] mpt2sas_cm0: iomem(0x00000000ef1f0000), mapped(0xffffc9000d340000), size(65536)
[    2.999918] mpt2sas_cm0: ioport(0x000000000000fc00), size(256)
--
[    3.023813] ata2: SATA max UDMA/133 abar m1024@0xef6ff800 port 0xef6ff980 irq 22
[    3.023863] ata3: SATA max UDMA/133 abar m1024@0xef6ff800 port 0xef6ffa00 irq 22
[    3.023908] ata4: SATA max UDMA/133 abar m1024@0xef6ff800 port 0xef6ffa80 irq 22
[    3.030388] input: Avocent USB Composite Device-0 as /devices/pci0000:00/0000:00:12.0/usb3/3-2/3-2:1.1/0003:0624:0248.0002/input/input5
[    3.030614] hid-generic 0003:0624:0248.0002: input,hidraw1: USB HID v1.00 Mouse [Avocent USB Composite Device-0] on usb-0000:00:12.0-2/input1
[    3.038776] tg3 0000:06:00.0 eth8: Tigon3 [partno(BCM95719) rev 5719001] (PCI Express) MAC address 00:0a:f7:7d:6b:58
[    3.038780] tg3 0000:06:00.0 eth8: attached PHY is 5719C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    3.038782] tg3 0000:06:00.0 eth8: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    3.038785] tg3 0000:06:00.0 eth8: dma_rwctrl[00000001] dma_mask[64-bit]
[    3.039141] ACPI: PCI Interrupt Link [LN46] enabled at IRQ 46
[    3.041988] [TTM] Zone  kernel: Available graphics memory: 66001366 kiB
[    3.041992] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[    3.041994] [TTM] Initializing pool allocator
[    3.042003] [TTM] Initializing DMA pool allocator
[    3.051019] tg3 0000:06:00.1 eth9: Tigon3 [partno(BCM95719) rev 5719001] (PCI Express) MAC address 00:0a:f7:7d:6b:59
[    3.051024] tg3 0000:06:00.1 eth9: attached PHY is 5719C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    3.051026] tg3 0000:06:00.1 eth9: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    3.051029] tg3 0000:06:00.1 eth9: dma_rwctrl[00000001] dma_mask[64-bit]
[    3.062588] tg3 0000:06:00.2 eth10: Tigon3 [partno(BCM95719) rev 5719001] (PCI Express) MAC address 00:0a:f7:7d:6b:5a
[    3.062593] tg3 0000:06:00.2 eth10: attached PHY is 5719C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    3.062596] tg3 0000:06:00.2 eth10: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    3.062598] tg3 0000:06:00.2 eth10: dma_rwctrl[00000001] dma_mask[64-bit]
[    3.065140] usb 2-2.3: New USB device found, idVendor=090c, idProduct=1000
[    3.065143] usb 2-2.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    3.065146] usb 2-2.3: Product: Flash Drive FIT
[    3.065148] usb 2-2.3: Manufacturer: Samsung
[    3.065150] usb 2-2.3: SerialNumber: 0305115120014813
--
[    3.077869] megaraid_sas 0000:08:00.0: Online Controller Reset(OCR)	: Enabled
[    3.077870] megaraid_sas 0000:08:00.0: Secure JBOD support	: No
[    3.077875] megaraid_sas 0000:08:00.0: megasas_init_mfi: fw_support_ieee=67108864
[    3.077881] megaraid_sas 0000:08:00.0: INIT adapter done
[    3.077883] megaraid_sas 0000:08:00.0: Jbod map is not supported megasas_setup_jbod_map 4610
[    3.083993] tg3 0000:06:00.3 eth11: Tigon3 [partno(BCM95719) rev 5719001] (PCI Express) MAC address 00:0a:f7:7d:6b:5b
[    3.083997] tg3 0000:06:00.3 eth11: attached PHY is 5719C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    3.084000] tg3 0000:06:00.3 eth11: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    3.084002] tg3 0000:06:00.3 eth11: dma_rwctrl[00000001] dma_mask[64-bit]
[    3.098024] tsc: Refined TSC clocksource calibration: 2300.026 MHz
[    3.098032] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x21274dd24f4, max_idle_ns: 440795262072 ns
[    3.108086] usb-storage 2-2.3:1.0: USB Mass Storage device detected
[    3.108560] scsi host7: usb-storage 2-2.3:1.0
[    3.109637] usbcore: registered new interface driver usb-storage
--
[    3.548243] sd 2:0:0:0: [sda] Write Protect is off
[    3.548245] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    3.548256] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    3.557127]  sda: sda1 sda2
[    3.574159] sd 2:0:0:0: [sda] Attached SCSI disk
[    3.608195] bnx2 0000:01:00.1 em2: renamed from eth1
[    3.654158] [drm] Initialized mgag200 1.0.0 20110418 for 0000:0a:03.0 on minor 0
[    3.676930] usb 4-1: new low-speed USB device number 2 using ohci-pci
[    3.703245] scsi 0:0:32:0: Enclosure         DP       BACKPLANE        1.07 PQ: 0 ANSI: 5
[    3.712889] bnx2 0000:24:00.1 p3p4: renamed from eth7
[    3.736399] bnx2 0000:24:00.0 p3p3: renamed from eth6
[    3.751462] tg3 0000:06:00.1 p2p2: renamed from eth9
[    3.772445] tg3 0000:06:00.0 p2p1: renamed from eth8
[    3.784392] tg3 0000:06:00.3 p2p4: renamed from eth11
[    3.808399] bnx2 0000:02:00.1 em4: renamed from eth3
[    3.829333] usb 4-1: New USB device found, idVendor=0624, idProduct=0294
[    3.829337] usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[    3.829340] usb 4-1: Product: Dell 03R874
[    3.829342] usb 4-1: Manufacturer: Avocent
[    3.832433] bnx2 0000:01:00.0 em1: renamed from eth0
[    3.843644] input: Avocent Dell 03R874 as /devices/pci0000:00/0000:00:12.1/usb4/4-1/4-1:1.0/0003:0624:0294.0003/input/input6
[    3.850470] bnx2 0000:23:00.1 p3p2: renamed from eth5
[    3.868495] bnx2 0000:23:00.0 p3p1: renamed from eth4
[    3.894384] hid-generic 0003:0624:0294.0003: input,hidraw2: USB HID v1.10 Keyboard [Avocent Dell 03R874] on usb-0000:00:12.1-1/input0
[    3.903779] input: Avocent Dell 03R874 as /devices/pci0000:00/0000:00:12.1/usb4/4-1/4-1:1.1/0003:0624:0294.0004/input/input7
[    3.954465] hid-generic 0003:0624:0294.0004: input,hidraw3: USB HID v1.10 Mouse [Avocent Dell 03R874] on usb-0000:00:12.1-1/input1
[    3.977508] scsi 0:2:0:0: Direct-Access     DELL     PERC H700        2.10 PQ: 0 ANSI: 5
[    3.978001] sd 0:2:0:0: [sdb] 19529728000 512-byte logical blocks: (10.00 TB/9.09 TiB)
[    3.978254] sd 0:2:0:0: [sdb] Write Protect is off
[    3.978258] sd 0:2:0:0: [sdb] Mode Sense: 1f 00 00 08
[    3.978426] sd 0:2:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    3.980129] sd 0:2:0:0: [sdb] Attached SCSI disk
[    4.014492] bnx2 0000:02:00.0 em3: renamed from eth2
[    4.038969] tg3 0000:06:00.2 p2p3: renamed from eth10
[    4.098921] clocksource: Switched to clocksource tsc
[    4.305370] mpt2sas_cm0: diag reset: SUCCESS
[    4.343463] mpt2sas_cm0: Allocated physical memory: size(5649 kB)
[    4.343465] mpt2sas_cm0: Current Controller Queue Depth(2508),Max Controller Queue Depth(2607)
[    4.343467] mpt2sas_cm0: Scatter Gather Elements per IO(128)
--
[    4.772522] sd 7:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    4.779158]  sdc: sdc1
[    4.783521] sd 7:0:0:0: [sdc] Attached SCSI removable disk
[    5.346945] mlx4_core 0000:09:00.0: PCIe link speed is 2.5GT/s, device supports 2.5GT/s
[    5.346948] mlx4_core 0000:09:00.0: PCIe link width is x8, device supports x8
[    5.523927] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014)
[    5.524039] mlx4_en 0000:09:00.0: UDP RSS is not supported on this device
[    5.987476] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x5782bcb015ef5300), phys(8)
[    8.367741] qla2xxx [0000:25:00.0]-00fb:5: QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA.
[    8.367755] qla2xxx [0000:25:00.0]-00fc:5: ISP2432: PCIe (2.5GT/s x4) @ 0000:25:00.0 hdma+ host#=5 fw=8.03.00 (9496).
[    8.367964] ACPI: PCI Interrupt Link [LN61] enabled at IRQ 61
--
[   16.880670] device-mapper: multipath: version 1.10.0 loaded
[   17.075991] acpi-cpufreq: overriding BIOS provided _PSD data
[   17.088554] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[   17.110269] ACPI Error: No handler for Region [IPMI] (ffff881c1f802120) [IPMI] (20150930/evregion-163)
[   17.110278] ACPI Error: Region IPMI (ID=7) has no handler (20150930/exfldio-297)
[   17.110283] ACPI Error: Method parse/execution failed [\_SB.PMI0._GHL] (Node ffff88141f803168), AE_NOT_EXIST (20150930/psparse-542)
[   17.110294] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMC] (Node ffff88141f8030c8), AE_NOT_EXIST (20150930/psparse-542)
[   17.110303] ACPI Exception: AE_NOT_EXIST, Evaluating _PMC (20150930/power_meter-755)
[   17.136724] ipmi message handler version 39.2
[   17.142904] IPMI System Interface driver.
[   17.143038] ipmi_si: probing via SMBIOS
[   17.143426] ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 irq 0

Comment 5 lejeczek 2017-08-01 13:20:18 UTC
to be more specific:
One network card: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

$ nmcli c s
10.5.6.100                     db71a057-dc0b-4057-900b-9c1f6c6f5e5e  team            nm-team1 
10.5.6.100-slave-p3p1-2g1      724ea470-f735-4bc8-ad18-6b46d64db95b  802-3-ethernet  p3p1     
10.5.6.100-slave-p3p2-3g2      1fb939db-134f-4461-a6d4-a25ae02fa7d8  802-3-ethernet  p3p2  
...
xx.202                 387ed09d-f2de-4d68-8d23-03525adc15ba  team            --       
xx.202-slave-p2p1-3g9  15f8e42b-0087-44de-b92a-3eef7308a0b1  802-3-ethernet  -- p2p1 ->  vanished
xx.202-slave-p2p2-2g9  108333d6-1477-42dd-a8ff-49788c7e3401  802-3-ethernet  -- p2p2 -> exists

This is one card. No connection overlaps/shares devices with another.

Comment 6 lejeczek 2017-08-01 13:21:29 UTC
ups.. not 10.5.6.100.x but:

10.5.7.5                       17ba6305-a4bc-4eee-9c55-b812d9fafa9c  802-3-ethernet  p2p3     
10.5.7.6                       a9550bbe-f022-4496-9fdf-43db8e010b63  802-3-ethernet  p2p4

Comment 7 lejeczek 2017-08-01 13:52:51 UTC
probably unrelated, but thought should not harm to ask - why this:

<info>  [1501595118.6202] device (p3p1): state change: disconnected -> prepare (reason 'none') [30 40 0]
<info>  [1501595118.6213] device (p3p2): state change: disconnected -> prepare (reason 'none') [30 40 0]
<info>  [1501595118.6229] device (p3p1): state change: prepare -> config (reason 'none') [40 50 0]
<info>  [1501595118.6254] device (p3p2): state change: prepare -> config (reason 'none') [40 50 0]
<info>  [1501595118.7853] device (p3p1): state change: config -> ip-config (reason 'none') [50 7
<info>  [1501595118.7868] dhcp4 (p3p1): activation: beginning transaction (timeout in 45 seconds)
<info>  [1501595118.8195] dhcp4 (p3p1): dhclient started with pid 2210676

it is dhclient working on those ifs, right?
Above is a snippet from: $ sudo journalctl -lf -o cat -u NetworkManager

But why is dhclient even bothering if master interface is as:
$ nmcli c s 10.5.6.100
connection.id:                          10.5.6.100
connection.uuid:                        db71a057-dc0b-4057-900b-9c1f6c6f5e5e
connection.stable-id:                   --
connection.interface-name:              nm-team1
....
connection.autoconnect:                 yes
connection.autoconnect-slaves:          1 (yes)
ipv4.method:                            manual

Comment 8 lejeczek 2017-08-01 13:56:37 UTC
ok, please ignore/delete my comment #7

Comment 9 lejeczek 2017-08-01 15:03:25 UTC
I think it's reproducible, do use:

set 802-3-ethernet.mac-address with nm-team device.

I've just removed it, had been using all this time, and now this connection survived first reboot.

Seem rather serious NM's problem, no?

Comment 10 Beniamino Galvani 2017-08-03 18:38:03 UTC
Sorry, I did not understand how you determine that p2p1 is gone... From the 'ip link' output or in other ways?

If it's not present in 'ip link' output then it's a kernel issue and I expect that 'dmesg' should have traces of the interface disappearing.

Comment 11 lejeczek 2017-08-04 09:56:14 UTC
gee man, you have dmesg output - from there you see that device is present when the system boots: [    3.772445] tg3 0000:06:00.0 p2p1: renamed from eth8
and that is all you get about p2p1 from dmesg.

ifconfig, ethtool, ip - these do not see the device.

I'll try again - if nm connection was deleted, and there is no nm team(I have not try regular, eth connection type) connection using this device then device is there, still after reboot. Problem does not exists.

And again, this is one interface/port on four-port Broadcom, so same one driver for all four ifaces, right?

And what about my last comment? Did you try to reproduce it?

I think it's reproducible, do use/set:

set 802-3-ethernet.mac-address with a nm-team device.

When I reset, removed a mac I put there, simply set it to "" then nm-team connection now survived first reboot and the problem does not occur.

Comment 12 lejeczek 2017-08-07 15:37:39 UTC
I can confirm without any doubts, on my installations (only kernel-lt from elrepo) it is - 802-3-ethernet.mac-address - which when used, set, then causes iface to vanish from the system and subsequently fails nm-team device.

Comment 13 Beniamino Galvani 2017-08-07 16:42:06 UTC
(In reply to lejeczek from comment #12)
> I can confirm without any doubts, on my installations (only kernel-lt from
> elrepo) it is - 802-3-ethernet.mac-address - which when used, set, then
> causes iface to vanish from the system and subsequently fails nm-team device.

I tried to reproduce it without luck, but I'm not sure I understood well the procedure. Can you please paste the command you use to create the team connection and the output of "ip a; nmcli d; nmcli c" before and after the problem? Also, if you perform other steps like restarting NM or anything else, please specify it.

Note that 802-3-ethernet.mac-address should not have any effect on a team connection, as it is only used to match physical devices. If you want to specify a MAC address to set on the team, you should use ethernet.cloned-mac-address, but this works starting from RHEL 7.4.

Comment 14 lejeczek 2017-08-08 08:16:06 UTC
I'll skip bits which I think are irrelevant.

$ nmcli c add type team con-name 10.5.6.100 autoconnect no
$ nmcli c add type team-slave ifname p3p3 master 10.5.6.100 con-name-10.5.6.100-p3p3 autoconnect no
$ nmcli c add type team-slave ifname p3p4 master 10.5.6.100 con-name-10.5.6.100-p3p4 autoconnect no

$ nmcli c m 10.5.6.100 team.config '{"runner": {"name": "lacp", "active": true, "fast_rate": true, "tx_hash": ["eth", "ipv4", "ipv6"]}, "link_watch": {"name": "ethtool"}}'

.. then change autoconnect to yes, give it IP address, ipv4.method=manual, ipv6.method=ignore.

$ nmcli c m 10.5.6.100 802-3-ethernet.mac-address $_aMAC # so this is master, and for master only.

!! so here might be were I'm doing something wrong. I read, and I'm pretty sure that was manual/docs that: .. 802-3-ethernet.mac-address one would use to tell which MAC net-team connection should use when identifies/broadcasts itself and resolves to an IP. 
And it seems NM does this, because when nm-team is configured as above then both slaves of that nm-team master have the same MAC. (as I see with regular utils from user space)

As _aMAC I'd usually use lower address of the two interfaces, lets say
p2p1 => xx.7d:6b:58
p2p2 => xx.7d:6b:59
I would do:
nmcli c m 10.5.6.100 802-3-ethernet.mac-address xx.7d:6b:58
But, I don't think above matters because now when 802-3-ethernet.mac-address is NOT set manually I see that NM chooses that address too.

When I said I could confirm - I did: just removed, unset 802-3-ethernet.mac-address (without deleting nm-team & slave connections like as was doing before) and rebooted the system and the problem was gone.

Once again: 
system is Dell r815

]$ ethtool -i p2p1
driver: tg3
version: 3.137
firmware-version: FFV7.10.64 bc 5719-v1.45
expansion-rom-version: 
bus-info: 0000:06:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

and as earlier, I did not try stock' kernel, where I see this problem is with kernel-lt-4.4.7x from elrepo.

If I misunderstand/misuse 802-3-ethernet.mac-address then you have a test case where such misuse messes things up quite a bit.

But then, I'd have a question - how does one chooses MAC for nm-team? MAC of which slaves NM will use and why?

Comment 15 Beniamino Galvani 2017-08-31 14:24:54 UTC
(In reply to lejeczek from comment #14)
> I'll skip bits which I think are irrelevant.
> 
> $ nmcli c add type team con-name 10.5.6.100 autoconnect no
> $ nmcli c add type team-slave ifname p3p3 master 10.5.6.100
> con-name-10.5.6.100-p3p3 autoconnect no
> $ nmcli c add type team-slave ifname p3p4 master 10.5.6.100
> con-name-10.5.6.100-p3p4 autoconnect no
> 
> $ nmcli c m 10.5.6.100 team.config '{"runner": {"name": "lacp", "active":
> true, "fast_rate": true, "tx_hash": ["eth", "ipv4", "ipv6"]}, "link_watch":
> {"name": "ethtool"}}'
> 
> .. then change autoconnect to yes, give it IP address, ipv4.method=manual,
> ipv6.method=ignore.
> 
> $ nmcli c m 10.5.6.100 802-3-ethernet.mac-address $_aMAC # so this is
> master, and for master only.
> 
> !! so here might be were I'm doing something wrong. I read, and I'm pretty
> sure that was manual/docs that: .. 802-3-ethernet.mac-address one would use
> to tell which MAC net-team connection should use when identifies/broadcasts
> itself and resolves to an IP. 
> And it seems NM does this, because when nm-team is configured as above then
> both slaves of that nm-team master have the same MAC. (as I see with regular
> utils from user space)
> 
> As _aMAC I'd usually use lower address of the two interfaces, lets say
> p2p1 => xx.7d:6b:58
> p2p2 => xx.7d:6b:59
> I would do:
> nmcli c m 10.5.6.100 802-3-ethernet.mac-address xx.7d:6b:58
> But, I don't think above matters because now when 802-3-ethernet.mac-address
> is NOT set manually I see that NM chooses that address too.
> 
> When I said I could confirm - I did: just removed, unset
> 802-3-ethernet.mac-address (without deleting nm-team & slave connections
> like as was doing before) and rebooted the system and the problem was gone.

Hi,

can you please also provide the output of command "ip a; nmcli d; nmcli c" before and after the problem as requested in comment 13? Thanks

Comment 16 lejeczek 2017-09-06 09:09:57 UTC
Hi, I'm sorry I cannot tamper with it any longer, only systems I could do it with it are now in live production environment.

But I'm honestly surprised redhat team cannot help you reproduce the case, it should be simple, as earlier...

a) network card, I use four-port Broadcom Limited NetXtreme BCM5719 so one kernel driver to confirm it's not the driver because only one iface/eth disappears.

b) create a net-team connection, take just(or at least, I did not test net-team with > 2 ports) ifaces and configure net-team with:
team.config '{"runner": {"name": "lacp", "active": true, "fast_rate": true, "tx_hash": ["eth", "ipv4", "ipv6"]}, "link_watch": {"name": "ethtool"}}'

c) then on that net-team connection set - 802-3-ethernet.mac-address - to a MAC of one of the two iface/eths that comprise this net-team.

d) reboot

I only used 802-3-ethernet.mac-address manually because I thought this would give me MAC of my choice on the net-team connection. I realize now, you said this not what this option does and that there will be a new option for that in new NM versions - great.

I still think that the problem I describe should be solved as it might leave users/admins stranded if they like me, misinterpret that 802-3-ethernet.mac-address option. Vanishing eth devs in OS after boot is certainly frustrating thing.

But feel free to close this bug if you cannot reproduce this problem case.

many thanks.

Comment 17 lejeczek 2017-09-06 09:12:45 UTC
I forgot...

e) kernel-lt from elrepo, I did not try default centos 3.10.x kernels

Comment 18 Thomas Haller 2017-09-06 09:51:51 UTC
Hi,

Are you aware that for NetworkManager there is an important distinction between

  - a device, that is a networking interface that you see in `ip link` and is 
    known to kernel. bluetooth and WWAN modems are also a kind of devices.
    Check `nmcli device`
  - a connection. That is a profile, a bunch of settings that is applied when
    activating a connection on a device.
    Check `nmcli connection`.


802-3-ethernet.mac-address is probably not something you want to set. If you want configure the MAC address of the team interface, set instead ethernet.cloned-mac-address (in RHEL-7.4/nm-1-8).


>  $ nmcli c u 172.24.154.202-slave-p2p1-3g9
>  Error: Connection activation failed: No suitable device found for this 
>  connection.

well, you don't show the details of this connection (`nmcli connection show "$NAME"`). Neither is there the output of `ip link` as requested. Presumably the connection matches no existing device. Configure the connection accordingly, take especially note of the "connection.interface-name" and "ethernet.mac-address" properties, that affect how the device matches.
And what's the output of `nmcli device` at that point?



It is very unclear what is going on here. I am closing this bug as of comment 16. Please reopen, if the issue still exists and you can provide a better description of that is going on.


Thank you.

Comment 19 lejeczek 2017-09-06 13:11:38 UTC
gee man, we are wasting valuable time here, why you did not try to reproduce this simple case is beyond me.
Lets close this bug.
thanks.