Bug 2095172 - Node validation fails on packet loss with two connected NICs
Summary: Node validation fails on packet loss with two connected NICs
Keywords:
Status: NEW
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Infrastructure Operator
Version: rhacm-2.4.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Ori Amizur
QA Contact: Chad Crum
URL:
Whiteboard:
Depends On: 2095173
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-09 08:07 UTC by Vitaly Grinberg
Modified: 2024-01-18 22:24 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github stolostron backlog issues 23076 0 None None None 2022-06-09 11:25:56 UTC
Red Hat Bugzilla 2095173 0 unspecified CLOSED Nmstatectl gc should not generate config for down/absent interfaces 2023-07-18 12:12:26 UTC
Red Hat Issue Tracker MGMTBUGSM-425 0 None None None 2022-06-09 08:21:58 UTC

Description Vitaly Grinberg 2022-06-09 08:07:45 UTC
# Description of the problem:
While installing a 3-node cluster on Supermicro servers, my nodes are bouncing between "insufficient" and "ready". The nodes go to insufficient due to a packet loss.

# Additional symptoms:
1. If 'agent.service' is stopped on the nodes, no packet loss occurs in manual test
2. If tests are done with 'arping', no packet loss occurs, but sometimes there are delays :

[core@cnfdf12 ~]$ arping 10.8.34.31 -I eno1
ARPING 10.8.34.31 from 10.8.34.30 eno1
Unicast reply from 10.8.34.31 [3C:EC:EF:5F:E0:D6]  0.543ms
Unicast reply from 10.8.34.31 [3C:EC:EF:5F:E0:D6]  641.860ms
Unicast reply from 10.8.34.31 [3C:EC:EF:5F:E0:D6]  0.549ms

The problem seems to be associated with ARP table garbage, for example: 

[core@cnfdf12 ~]$ arp -a
cnfdf13.telco5gran.eng.rdu2.redhat.com (10.8.34.31) at 3c:ec:ef:5f:e0:d6 [ether] on eno1
hv6.telco5gran.eng.rdu2.redhat.com (10.8.34.25) at b8:ce:f6:44:19:5e [ether] on eno1
cnfdf14.telco5gran.eng.rdu2.redhat.com (10.8.34.32) at 3c:ec:ef:5f:5d:37 [ether] on eno2

(The third entry is clearly points to a wrong interface)

# Workaround (on each host):
1. manually switch off all the interfaces besides the relevant one (eno1):
[core@cnfdf13 ~]$ sudo ip link set eno2 down
[core@cnfdf13 ~]$ sudo ip link set ens2f0 down
[core@cnfdf13 ~]$ sudo ip link set eth3  down
[core@cnfdf13 ~]$ sudo ip link set eth4  down
[core@cnfdf13 ~]$ sudo ip link set ens2f1  down
[core@cnfdf13 ~]$ sudo ip link set ens1f1  down
[core@cnfdf13 ~]$ sudo ip link set ens1f0  down
2. Clean the ARP table
[core@cnfdf13 ~]$ sudo ip -s -s neigh flush all

Versions:
Server Version: 4.9.21
Kubernetes Version: v1.22.3+fdba464
ACM version: 2.4.4
Hardware Info:
Manufacturer: Supermicro
Product Name: Super Server


Steps to reproduce:
1. Try installing a three-node cluster with the above servers. The servers must have several NICs connected to the same network.

Actual results:
Validation fails due to packet loss

Expected results:
Installation  proceeds

Additional info:

Comment 1 Ori Amizur 2022-07-06 08:54:25 UTC
Depends on https://bugzilla.redhat.com/show_bug.cgi?id=2095173


Note You need to log in before you can comment on or make changes to this bug.