Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2068303

Summary: MetalLB (layer2 with OpenShift SDN) stops working with multiple NICs on external network
Product: OpenShift Container Platform Reporter: Ian Pilcher <ipilcher>
Component: NetworkingAssignee: Periyasamy Palanisamy <pepalani>
Networking sub component: Metal LB QA Contact: Arti Sood <asood>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bschmaus, ddharwar, elevin, fpaoline, pepalani, vlaad
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-26 15:33:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from controller container in MetalLB controller pod
none
Logs from kube-rbac-proxy container in MetalLB controller pod
none
Logs from frr container in MetalLB speaker pod
none
Logs from frr-metrics container in MetalLB speaker pod
none
Logs from kube-rbac-proxy container in MetalLB speaker pod
none
Logs from kube-rbac-proxy-frr container in MetalLB speaker pod
none
Logs from reloader container in MetalLB speaker pod
none
Logs from speaker container in MetalLB speaker pod none

Description Ian Pilcher 2022-03-24 21:13:38 UTC
* Bare metal IPI installation
* Each node has 2 NICs connected to the external/baremetal network - enp2s0 and enp3s0
* Node IPs assigned to enp2s0 on each node
* MetalLB deployed in layer 2 mode

MetalLB load balancer works when first created.  Can ping load balancer IP, connect to service, etc.  Some time later, MetalLB IP stops working; can't ping or connect.  Disconnecting enp3s0 from the network (link down) restores functionality.

Some details:

* Environment is KVM VMs (using vBMC for IPI).  Virtual network is a Linux bridge.

* Load balancer IP is 192.168.123.23

* Load balancer IP is assigned to ocp4-worker3
  - enp2s0 (52:54:00:00:00:06) w/ node IP 192.168.123.106
  - enp3s0 (52:54:00:00:01:06) "unused" (but connected) interface

* Provisioning node has IP 192.168.123.100 & MAC 52:54:00:00:00:00

After deleting the ARP entry for 192.168.123.23 and attempting to ping that IP from the provisioner, tcpdump (on the hypervisor) shows:

13:18:31.132696 52:54:00:00:00:00 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.123.23 tell 192.168.123.100, length 28
13:18:31.132962 52:54:00:00:00:06 > 52:54:00:00:00:00, ethertype ARP (0x0806), length 60: Reply 192.168.123.23 is-at 52:54:00:00:00:06, length 46
13:18:31.133089 52:54:00:00:00:00 > 52:54:00:00:00:06, ethertype IPv4 (0x0800), length 98: 192.168.123.100 > 192.168.123.23: ICMP echo request, id 11, seq 1, length 64
13:18:31.133092 52:54:00:00:01:06 > 52:54:00:00:00:00, ethertype ARP (0x0806), length 60: Reply 192.168.123.23 is-at 52:54:00:00:01:06, length 46
13:18:31.133348 52:54:00:00:00:06 > 52:54:00:00:01:06, ethertype IPv4 (0x0800), length 98: 192.168.123.100 > 192.168.123.23: ICMP echo request, id 11, seq 1, length 64
13:18:36.436691 52:54:00:00:00:06 > 52:54:00:00:01:06, ethertype ARP (0x0806), length 42: Request who-has 192.168.123.23 tell 192.168.123.106, length 28
13:18:36.438117 52:54:00:00:01:06 > 52:54:00:00:00:06, ethertype ARP (0x0806), length 60: Reply 192.168.123.23 is-at 52:54:00:00:01:06, length 46

This shows that an ARP response is being sent from both enp2s0 and enp3s0, with each containing the MAC of the interface from which it was sent.  It also shows (oddly?) that echo requests are actually sent to *both* MACs, but no echo reply is sent.  Finally, the last two lines actually show an ARP request and reply between the 2 interfaces in the worker node.

It's probably also worth noting that a static ARP entry does not fix the issue.  Here is the tcpdump output (on the hypervisor) when I attempt the ping after creating a static ARP entry (192.168.123.23 --> 52:54:00:00:00:06):

17:08:58.835651 52:54:00:00:00:00 > 52:54:00:00:00:06, ethertype IPv4 (0x0800), length 98: 192.168.123.100 > 192.168.123.23: ICMP echo request, id 18, seq 1, length 64
17:08:58.835863 52:54:00:00:00:06 > 52:54:00:00:01:06, ethertype IPv4 (0x0800), length 98: 192.168.123.100 > 192.168.123.23: ICMP echo request, id 18, seq 1, length 64
17:09:04.020764 52:54:00:00:00:06 > 52:54:00:00:01:06, ethertype ARP (0x0806), length 42: Request who-has 192.168.123.23 tell 192.168.123.106, length 28
17:09:04.021144 52:54:00:00:01:06 > 52:54:00:00:00:06, ethertype ARP (0x0806), length 60: Reply 192.168.123.23 is-at 52:54:00:00:01:06, length 46

I will attach logs from the MetalLB controller pod and the speaker pod on the worker to which the LB IP is assigned (ocp4-worker3).

Comment 1 Ian Pilcher 2022-03-24 21:15:49 UTC
Created attachment 1868207 [details]
Logs from controller container in MetalLB controller pod

Comment 2 Ian Pilcher 2022-03-24 21:16:37 UTC
Created attachment 1868208 [details]
Logs from kube-rbac-proxy container in MetalLB controller pod

Comment 3 Ian Pilcher 2022-03-24 21:17:23 UTC
Created attachment 1868209 [details]
Logs from frr container in MetalLB speaker pod

Comment 4 Ian Pilcher 2022-03-24 21:18:17 UTC
Created attachment 1868210 [details]
Logs from frr-metrics container in MetalLB speaker pod

Comment 5 Ian Pilcher 2022-03-24 21:19:18 UTC
Created attachment 1868211 [details]
Logs from kube-rbac-proxy container in MetalLB speaker pod

Comment 6 Ian Pilcher 2022-03-24 21:20:35 UTC
Created attachment 1868212 [details]
Logs from kube-rbac-proxy-frr container in MetalLB speaker pod

Comment 7 Ian Pilcher 2022-03-24 21:21:52 UTC
Created attachment 1868213 [details]
Logs from reloader container in MetalLB speaker pod

Comment 8 Ian Pilcher 2022-03-24 21:22:39 UTC
Created attachment 1868214 [details]
Logs from speaker container in MetalLB speaker pod

Comment 13 elevin 2022-05-13 20:04:46 UTC
*** Bug 2078939 has been marked as a duplicate of this bug. ***

Comment 16 Red Hat Bugzilla 2023-09-15 01:53:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days