2095172 – Node validation fails on packet loss with two connected NICs

Bug 2095172 - Node validation fails on packet loss with two connected NICs

Summary: Node validation fails on packet loss with two connected NICs

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Infrastructure Operator
Sub Component:
Version:	rhacm-2.4.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ori Amizur
QA Contact:	Chad Crum
Docs Contact:
URL:
Whiteboard:
Depends On:	2095173
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-09 08:07 UTC by Vitaly Grinberg
Modified:	2024-01-18 22:24 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	stolostron backlog issues 23076	None	None	None	2022-06-09 11:25:56 UTC
Red Hat Bugzilla	2095173	unspecified	CLOSED	Nmstatectl gc should not generate config for down/absent interfaces	2023-07-18 12:12:26 UTC
Red Hat Issue Tracker	MGMTBUGSM-425	None	None	None	2022-06-09 08:21:58 UTC

Description Vitaly Grinberg 2022-06-09 08:07:45 UTC

# Description of the problem:
While installing a 3-node cluster on Supermicro servers, my nodes are bouncing between "insufficient" and "ready". The nodes go to insufficient due to a packet loss.

# Additional symptoms:
1. If 'agent.service' is stopped on the nodes, no packet loss occurs in manual test
2. If tests are done with 'arping', no packet loss occurs, but sometimes there are delays :

[core@cnfdf12 ~]$ arping 10.8.34.31 -I eno1
ARPING 10.8.34.31 from 10.8.34.30 eno1
Unicast reply from 10.8.34.31 [3C:EC:EF:5F:E0:D6]  0.543ms
Unicast reply from 10.8.34.31 [3C:EC:EF:5F:E0:D6]  641.860ms
Unicast reply from 10.8.34.31 [3C:EC:EF:5F:E0:D6]  0.549ms

The problem seems to be associated with ARP table garbage, for example: 

[core@cnfdf12 ~]$ arp -a
cnfdf13.telco5gran.eng.rdu2.redhat.com (10.8.34.31) at 3c:ec:ef:5f:e0:d6 [ether] on eno1
hv6.telco5gran.eng.rdu2.redhat.com (10.8.34.25) at b8:ce:f6:44:19:5e [ether] on eno1
cnfdf14.telco5gran.eng.rdu2.redhat.com (10.8.34.32) at 3c:ec:ef:5f:5d:37 [ether] on eno2

(The third entry is clearly points to a wrong interface)

# Workaround (on each host):
1. manually switch off all the interfaces besides the relevant one (eno1):
[core@cnfdf13 ~]$ sudo ip link set eno2 down
[core@cnfdf13 ~]$ sudo ip link set ens2f0 down
[core@cnfdf13 ~]$ sudo ip link set eth3  down
[core@cnfdf13 ~]$ sudo ip link set eth4  down
[core@cnfdf13 ~]$ sudo ip link set ens2f1  down
[core@cnfdf13 ~]$ sudo ip link set ens1f1  down
[core@cnfdf13 ~]$ sudo ip link set ens1f0  down
2. Clean the ARP table
[core@cnfdf13 ~]$ sudo ip -s -s neigh flush all

Versions:
Server Version: 4.9.21
Kubernetes Version: v1.22.3+fdba464
ACM version: 2.4.4
Hardware Info:
Manufacturer: Supermicro
Product Name: Super Server


Steps to reproduce:
1. Try installing a three-node cluster with the above servers. The servers must have several NICs connected to the same network.

Actual results:
Validation fails due to packet loss

Expected results:
Installation  proceeds

Additional info:

Comment 1 Ori Amizur 2022-07-06 08:54:25 UTC

Depends on https://bugzilla.redhat.com/show_bug.cgi?id=2095173

Note You need to log in before you can comment on or make changes to this bug.