Bug 2063720
| Summary: | Metallb EBGP neighbor stuck in active until adding ebgp-multihop (directly connected neighbors) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Greg Kopels <gkopels> | ||||
| Component: | Networking | Assignee: | Mohamed Mahmoud <mmahmoud> | ||||
| Networking sub component: | Metal LB | QA Contact: | Arti Sood <asood> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | high | CC: | cgoncalves, fpaoline | ||||
| Version: | 4.10 | Flags: | mmahmoud:
needinfo-
|
||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.11.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Telco:Core | Doc Type: | No Doc Update | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-08-10 10:54:04 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Would it be possible to share the before/after FRR configuration file? I add the ebgp-multihop command directly to the FRR configuration in the Speaker container router bgp 64501 no bgp ebgp-requires-policy no bgp default ipv4-unicast no bgp network import-check neighbor 10.46.56.75 remote-as 64500 neighbor 10.46.56.75 password bgp-test neighbor 10.46.56.75 timers 30 90 neighbor 2620:52:0:2e38::705 remote-as 64500 neighbor 2620:52:0:2e38::705 password bgp-test neighbor 2620:52:0:2e38::705 ebgp-multihop 2 neighbor 2620:52:0:2e38::705 timers 30 90 I attached a oc describe bgppeer for before and after. Thank you for the additional information. Could you please run an IP traceroute from the FRR container to 2620:52:0:2e38::705 to confirm whether it is single or multihop? I believe there is test coverage for single-hop IPv6 BGP neighbors in upstream MetalLB CI, and those are passing. It might be that there is a bug in the FRR version RHEL 8.4 ships, while upstream runs latest stable 7.5 FRR. Hi
1. Running traceroute from the external FRR container on the master node (using a macvlan - not network host true) I get the following:
frr-pod# traceroute ipv6 2620:52:0:2e38::113
traceroute to 2620:52:0:2e38::113 (2620:52:0:2e38::113), 30 hops max, 72 byte packets
1 helix13.hlxcl7.lab.eng.tlv2.redhat.com (2620:52:0:2e38::113) 1.830 ms 0.511 ms 0.141 ms
2. * Unable to run traceroute from inside the speaker FRR container I receive the following error traceroute helix13.lab.eng.tlv2.redhat.com# traceroute ipv6 2620:52:0:2e38::131
Can't execute traceroute6: No such file or directory
3. Running tracepath from the speaker pod
[root@helix13 ~]# tracepath -6 2620:52:0:2e38::13
1?: [LOCALHOST] 0.035ms pmtu 1500
1: helix13 3072.059ms !H
Resume: pmtu 1500
Expected output for directly connected.
by default IPv6 will try to use linklocal address to do bgp peering, to change this behavior and allow using global ipv6 for peering we need to have this configuration knob set ipv6 next-hop prefer-global, the external frr container image doesn't allow us to have set config frr-pod(config)# set % There is no matched command. next will be trying different frr image for the external test container that has this config option and retry the test Hi all,
1. I was able to add the route-map with set ipv6 next-hop prefer-global
route-map RMAP permit 10
set ipv6 next-hop prefer-global
exit
address-family ipv6 unicast
neighbor 2620:52:0:2e38::113 activate
neighbor 2620:52:0:2e38::113 route-map RMAP in
neighbor 2620:52:0:2e38::114 activate
neighbor 2620:52:0:2e38::114 route-map RMAP in
exit-address-family
2. I was able to remove the no ipv6 forwarding by adding sysctl net.ipv6.conf.all.forwarding=1
router-master# sh ipv6 forwarding
ipv6 forwarding is on
However the IPv6 neighbors are still stuck in connect.
router-master# sh bgp summary
IPv4 Unicast Summary (VRF default):
BGP router identifier 10.10.10.10, local AS number 64500 vrf-id 0
BGP table version 2
RIB entries 1, using 184 bytes of memory
Peers 2, using 1433 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
10.46.56.13 4 64501 21 21 0 0 0 00:08:33 1 1 N/A
10.46.56.14 4 64501 21 21 0 0 0 00:08:33 1 1 N/A
Total number of neighbors 2
IPv6 Unicast Summary (VRF default):
BGP router identifier 10.10.10.10, local AS number 64500 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 2, using 1433 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
2620:52:0:2e38::113 4 64501 0 0 0 0 0 never Connect 0 N/A
2620:52:0:2e38::114 4 64501 0 0 0 0 0 never Connect 0 N/A
Total number of neighbors 2
router-master# sh run
Building configuration...
Current configuration:
!
frr version 8.3-dev_git
frr defaults traditional
hostname router-master
log file /tmp/frr.log
log timestamp precision 3
!
debug bgp neighbor-events
debug bgp nht
debug bgp updates in
debug bgp updates out
debug bfd peer
!
router bgp 64500
bgp router-id 10.10.10.10
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor 10.46.56.13 remote-as 64501
neighbor 10.46.56.13 password bgp-test
neighbor 10.46.56.14 remote-as 64501
neighbor 10.46.56.14 password bgp-test
neighbor 2620:52:0:2e38::113 remote-as 64501
neighbor 2620:52:0:2e38::113 password bgp-test
neighbor 2620:52:0:2e38::114 remote-as 64501
neighbor 2620:52:0:2e38::114 password bgp-test
!
address-family ipv4 unicast
neighbor 10.46.56.13 activate
neighbor 10.46.56.14 activate
exit-address-family
!
address-family ipv6 unicast
neighbor 2620:52:0:2e38::113 activate
neighbor 2620:52:0:2e38::113 route-map RMAP in
neighbor 2620:52:0:2e38::114 activate
neighbor 2620:52:0:2e38::114 route-map RMAP in
exit-address-family
exit
!
route-map RMAP permit 10
set ipv6 next-hop prefer-global
exit
!
ipv6 nht resolve-via-default
!
end
router-master# sh ipv6 route
Codes: K - kernel route, C - connected, S - static, R - RIPng,
O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
C>* 2620:52:0:2e38::/64 is directly connected, eth1, 00:11:46
C>* fd01:0:0:1::/64 is directly connected, eth0, 00:11:46
C * fe80::/64 is directly connected, eth1, 00:11:46
C>* fe80::/64 is directly connected, eth0, 00:11:46
helix14.lab.eng.tlv2.redhat.com# sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
K>* 0.0.0.0/0 [0/49] via 10.46.56.254, br-ex, 00:20:06
C>* 10.46.56.0/24 is directly connected, br-ex, 00:20:06
K>* 10.128.0.0/14 [0/0] via 10.131.0.1, ovn-k8s-mp0, 00:20:06
C>* 10.131.0.0/23 is directly connected, ovn-k8s-mp0, 00:20:06
K>* 169.254.169.0/30 [0/0] via 10.46.56.254, br-ex, 00:20:06
K>* 169.254.169.3/32 [0/0] via 10.131.0.1, ovn-k8s-mp0, 00:20:06
K>* 172.30.0.0/16 [0/0] via 10.46.56.254, br-ex, 00:20:06
helix14.lab.eng.tlv2.redhat.com# sh ip nht
10.46.56.131(Connected)
resolved via connected
is directly connected, br-ex
Client list: bgp(fd 15)
sh ip bgp neighbors 10.46.56.131
Connections established 1; dropped 0
Last reset 00:07:26, Waiting for peer OPEN
Local host: 10.46.56.14, Local port: 42532
Foreign host: 10.46.56.131, Foreign port: 179
Nexthop: 10.46.56.14
Nexthop global: 2620:52:0:2e38::114
Nexthop local: fe80::e4c8:7f3b:bb3f:c714
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Estimated round trip time: 2 ms
Peer Authentication Enabled
Read thread: on Write thread: on FD used: 22
are we able to repro this issue using frr 7.5 stable image ? QE Validation of work around fix: Tested both DualStack and SingleStack IPv6. In both cases the IPv6 ebgp peer created an adjacency with the external FRR container as expected. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |
Created attachment 1865830 [details] Show commands and adding ebgp-multihop Description of problem: Hybrid Cluster with 2 BM worker nodes (Dual Stack) Cluster version is 4.10.3 metallb-operator.4.10.0-202203081809 BGP Neighbors are directly connected When create EBGP neighbors IPv6 neighbors stay in active until I add the command ebgp-multihop. How reproducible: Medium requires dual stack cluster Steps to Reproduce: 1. Bring up external FRR container configured with worker node neighbors. 2. Configure EBGP BGPPeer with the single external FRR container as neighbor. 3. Verify IPv6 neighbors are stuck in Active 4. Add the ebgp-multihop 5. Verify that IPv6 neighbor is Established Actual results: 1. IPv4 neighbors are established and IPv6 neighbors are stuck in Active Expected results: In a directly connected environment I expect the IPv6 neighbor to come up without the ebgp multihop command. Additional info: