Bug 2218291 - [OSP Tracker] Routes not recovered after frr restart for a long time
Summary: [OSP Tracker] Routes not recovered after frr restart for a long time
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ovn-bgp-agent
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z2
: ---
Assignee: Eduardo Olivares
QA Contact: Candido Campos
URL:
Whiteboard:
Depends On: 2193145
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-28 15:21 UTC by Eduardo Olivares
Modified: 2023-06-29 08:35 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
In Red Hat OpenStack Platform (RHOSP) 17.1 environments that use BGP dynamic routing, there is currently a known issue that may occur when the tripleo_frr service is restarted on any of the overcloud ndoes. + If the tripleo_frr is immediately restarted (systemctl restart tripleo_frr) on an overcloud nodes, the routes that this node was advertising via BGP may be lost on the peer nodes/routers. This is due to a known bug in the FRR package: https://bugzilla.redhat.com/show_bug.cgi?id=2193145 When this happens, it could take several minutes to recover those routes. + Workaround: This issue does not occur if the tripleo_frr service is restarted in this way: systemctl stop tripleo_frr; sleep 1; systemctl start tripleo_frr
Clone Of: 2193145
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-26205 0 None None None 2023-06-28 15:25:11 UTC

Description Eduardo Olivares 2023-06-28 15:21:38 UTC
This bug is created on RHOSP product simply to track when the fix of the issue reported on RHEL is available in RHOSP.
Besides, we can use this to document this bug as a known issue in RHOSP17.1 Release Notes. 



+++ This bug was initially created as a clone of Bug #2193145 +++

Description of problem:
FRR is running on two peer nodes (frr-8.3.1-5.el9.x86_64).

One of them exposes via BGP a route to an IP from its loopback interface:
[root@cmp-1-1 ~]# ip a s lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 172.30.1.3/32 brd 172.30.1.3 scope host lo
       valid_lft forever preferred_lft forever

The frr config running on this node can be found attached.



The peer node, leaf-0, receives a route to 172.30.1.3 via BGP (100.65.1.10 is another IP on a physical interface at cmp-1-1):
[root@leaf-0 ~]# vtysh -c 'show ip route 172.30.1.3'
Routing entry for 172.30.1.3/32
  Known via "bgp", distance 200, metric 0, best
  Last update 00:01:14 ago
  * 100.65.1.10, via eth5, weight 1


When frr is restarted on cmp-1-1, the route to 172.30.1.3 is removed from leaf-0 and it takes several minutes to recover it, more than 5 minutes, which is not acceptable.
When frr is stopped on cmp-1-1, wait 1 second, and started again, the route to 172.30.1.3 is removed from leaf-0, but it is recovered almost immediately, which is fine.


The frr version frr-8.3.1-5.el9.x86_64 has been updated on leaf-0 to frr-8.5.1-02.el9.x86_64 (it has been obtained from [1]).
The test is repeated and now the routes are recovered immediately after restarting frr on cmp-1-1.


We believe this is the PR that fixes this issue: https://github.com/FRRouting/frr/pull/12034
Can this patch be backported to RHEL9.2 frr version? (or can it be rebased? whatever is usually done in these cases).


[1] https://rpm.frrouting.org/


Version-Release number of selected component (if applicable):
frr-8.3.1-5.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. restart frr on one node
2. check the BGP routes on its peer node
3.

Actual results:
BGP routes are not recovered after the frr restart

Expected results:
BGP routes should be recovered after the frr restart

Additional info:

--- Additional comment from Michal Ruprich on 2023-05-04 15:02:52 UTC ---

Thanks Eduardo for the report and for trying to identify fixing commit. Unfortunately, we already have this commit in our code:

https://pkgs.devel.redhat.com/cgit/rpms/frr/commit/?h=rhel-9-main&id=9d939262e3a1057c264bb642137d970f48730875

I've added the patch 0006-graceful-restart together with the rebase. But nevermind that, at least we know that the latest version works and I can go on from there. I will get back to you as soon as I know more.

Regards,
Michal

--- Additional comment from Michal Ruprich on 2023-05-15 17:42:47 UTC ---

Hi Eduardo,

I would like to ask you for the config on on of the nodes, maybe you forgot to add it here? Are you using graceful restart? Without it the change would not have been able to be transferred to the neighbor so quickly. frr-8.3.1-5 works for me when using graceful restart. The route disappears for a second but than it is added back. Maybe something in the config could help.

Thanks,
Michal

--- Additional comment from Eduardo Olivares on 2023-06-05 10:22:05 UTC ---

(In reply to Michal Ruprich from comment #2)
> Hi Eduardo,
> 
> I would like to ask you for the config on on of the nodes, maybe you forgot
> to add it here? Are you using graceful restart? Without it the change would
> not have been able to be transferred to the neighbor so quickly. frr-8.3.1-5
> works for me when using graceful restart. The route disappears for a second
> but than it is added back. Maybe something in the config could help.
> 
> Thanks,
> Michal



Apologies for the delayed reply, Michal.

I repeated the test. Same nodes involved and same frr version with identical result (leaf-0 and cmp-1-1, see comment#0).
You can find frr logs and running configuration here: http://file.mad.redhat.com/~eolivare/BZ2193145/

1) First I stopped frr on cmp-1-1, slept 1 sec and started frr. This happened between 09:41:43 and 09:41:48 (times are synchronized in both nodes).
You can see the following logs on cmp-1-1_frr.log:
2023/06/05 09:41:53.071 BGP: [ZWCSR-M7FG9] enp2s0 [FSM] BGP_Start (Idle->Connect), fd -1
2023/06/05 09:41:53.071 BGP: [ZWCSR-M7FG9] enp3s0 [FSM] BGP_Start (Idle->Connect), fd -1

After this, the route to 172.30.1.3 (which is an IP on the cmp-1-1's loopback interface) was immediately recovered on leaf-0.

2) Later I stopped frr on cmp-1-1 and started it again immediately. This happened at 09:43:12
You can see the following logs on cmp-1-1_frr.log:
2023/06/05 09:43:16.785 BGP: [ZWCSR-M7FG9] enp2s0 [FSM] BGP_Start (Idle->Connect), fd -1
2023/06/05 09:43:16.785 BGP: [ZWCSR-M7FG9] enp3s0 [FSM] BGP_Start (Idle->Connect), fd -1

After this, the route to 172.30.1.3 has not been recovered on leaf-0 after ~30 min.

--- Additional comment from Eduardo Olivares on 2023-06-05 10:29:00 UTC ---

Adding some more info. BFD sessions are successfully established between leaf-0 and cmp-1-1.

[root@leaf-0 ~]# vtysh -c 'show bfd peer 100.65.1.10'
BFD Peer:
        peer 100.65.1.10 local-address 100.65.1.9 vrf default interface eth5
                ID: 3474991286
                Remote ID: 788172463
                Active mode
                Status: up
                Uptime: 43 minute(s), 0 second(s)
                Diagnostics: ok
                Remote diagnostics: ok
                Peer Type: dynamic
                Local timers:
                        Detect-multiplier: 10
                        Receive interval: 500ms
                        Transmission interval: 500ms
                        Echo receive interval: 50ms
                        Echo transmission interval: disabled
                Remote timers:
                        Detect-multiplier: 10
                        Receive interval: 500ms
                        Transmission interval: 500ms
                        Echo receive interval: 50ms


[root@cmp-1-1 /]# vtysh -c "show bfd peer 100.65.1.9"
BFD Peer:
        peer 100.65.1.9 local-address 100.65.1.10 vrf default interface enp2s0
                ID: 788172463
                Remote ID: 3474991286
                Active mode
                Status: up
                Uptime: 45 minute(s), 10 second(s)
                Diagnostics: ok
                Remote diagnostics: ok
                Peer Type: dynamic
                Local timers:
                        Detect-multiplier: 10
                        Receive interval: 500ms
                        Transmission interval: 500ms
                        Echo receive interval: 50ms
                        Echo transmission interval: disabled
                Remote timers:
                        Detect-multiplier: 10
                        Receive interval: 500ms
                        Transmission interval: 500ms
                        Echo receive interval: 50ms


Note You need to log in before you can comment on or make changes to this bug.