Bug 1948422

Summary: BGP incorrectly withdraws routes on graceful restart capable routers
Product: Red Hat Enterprise Linux 8 Reporter: Carlos Goncalves <cgoncalves>
Component: frrAssignee: Michal Ruprich <mruprich>
Status: CLOSED ERRATA QA Contact: FrantiĊĦek Hrdina <fhrdina>
Severity: high Docs Contact:
Priority: high    
Version: 8.3CC: fhrdina, michele, mruprich, rkhan
Target Milestone: betaKeywords: AutoVerified, Patch, Reopened, Reproducer, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: frr-7.5.1-5.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2127494 (view as bug list) Environment:
Last Closed: 2023-05-16 08:30:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2127494    

Description Carlos Goncalves 2021-04-12 07:10:55 UTC
BGP advertised route prefixes are sometimes (~20% of the time) removed on peer BGP routers when the local BGP router is restarted even when BGP graceful restart is enabled. This is known as route flapping causing recalculation of the topology by all participating routers.

Version-Release number of selected component (if applicable):
- frr-7.0-10.el8.x86_64
- frr-7.5-4.el8.x86_64

How reproducible:
~20%

Steps to Reproduce:
1. Create two routers: router-1 (10.20.30.43) and router-2 (10.20.30.44) with the following FRR configuration:

	frr version 7.5
	frr defaults traditional
	hostname router-2.localdomain
	log file /var/log/frr/frr.log
	no ip forwarding
	no ipv6 forwarding
	service integrated-vtysh-config
	!
	debug bgp keepalives
	debug bgp neighbor-events
	debug bgp updates in
	debug bgp updates out
	debug bgp zebra
	!
	router bgp 64999
	 bgp log-neighbor-changes
	 no bgp suppress-duplicates
	 bgp graceful-shutdown
	 bgp graceful-restart
	 bgp graceful-restart preserve-fw-state
	 neighbor 10.20.30.43 remote-as 64999
	 !
	 address-family ipv4 unicast
	  redistribute connected
	 exit-address-family
	!
	line vty

2. On router-1, add dummy route:
	$ sudo ip a a 10.20.50.98/32 dev lo

3. Verify router-2 received route prefix and installed it in the kernel routing table:
	$ sudo ip r | grep 10.20.50.98
	10.20.50.98 nhid 16 via 10.20.30.43 dev eth1 proto bgp metric 20

4. Restart FRR on router-1:
	$ sudo systemctl restart frr

5. Check /var/frr/frr.log in router-2 and note that route 10.20.50.98/32 was deleted as soon as FRR@router-1 was stopped ("Tx route delete VRF 0 10.20.50.98/32"):

	BGP: 10.20.30.41 [Event] BGP connection closed fd 23
	BGP: %NOTIFICATION: received from neighbor 10.20.30.41 6/3 (Cease/Peer Unconfigured) 0 bytes
	BGP: 10.20.30.41 [FSM] Receive_NOTIFICATION_message (Established->Clearing), fd 23
	BGP: %ADJCHANGE: neighbor 10.20.30.41(router-1.localdomain) in vrf default Down BGP Notification received
	BGP: 10.20.30.41 graceful restart stalepath timer stopped
	BGP: bgp_fsm_change_status : vrf default(0), Status: Clearing established_peers 0
	BGP: RID change : vrf VRF default(0), RTR ID 192.168.121.66
	BGP: 10.20.30.41 went from Established to Clearing
	BGP: 10.20.30.41 [FSM] Clearing_Completed (Clearing->Idle), fd -1
	BGP: bgp_fsm_change_status : vrf default(0), Status: Idle established_peers 0
	BGP: 10.20.30.41 went from Clearing to Idle
	BGP: Tx route delete VRF 0 10.20.50.99/32
	BGP: [Event] BGP connection from host 10.20.30.41 fd 23
	BGP: bgp_fsm_change_status : vrf default(0), Status: Active established_peers 0
	BGP: 10.20.30.41 went from Idle to Active
	BGP: 10.20.30.41 [FSM] TCP_connection_open (Active->OpenSent), fd 23
	BGP: 10.20.30.41 passive open
	BGP: 10.20.30.41 Sending hostname cap with hn = router-2.localdomain, dn = (null)
	BGP: 10.20.30.41 sending OPEN, version 4, my as 64999, holdtime 180, id 192.168.121.66
	BGP: bgp_fsm_change_status : vrf default(0), Status: OpenSent established_peers 0
	BGP: 10.20.30.41 went from Active to OpenSent

Comment 1 Carlos Goncalves 2021-05-12 07:04:43 UTC
Is there an update on this bug? Has it been triaged by the RHEL team?
Please let me know if there is any additional information I could provide, including setting up a lab for testing and development.

I would like to highlight that this bug causes route flapping, interrupting data plane forwarding in network routers.

Comment 2 Michal Ruprich 2021-05-18 09:48:03 UTC
Sorry Carlos,

this seems to be solved in current upstream version. I reproduced it in RHEL8 but not in Fedora. I am looking for the fix, the upstream issue is not very specific on the details.

I'll keep you posted.

Michal

Comment 3 Carlos Goncalves 2021-05-19 06:52:56 UTC
Thanks, Michal. Were you able to find the mentioned upstream fix?
I have been somewhat closely following upstream commits and issues, and have not flagged any potential one addressing this issue.
I installed FRR from source (master, 9d78be6) on Fedora and was still able to reproduce the same issue with the same reproducer steps.

Comment 4 Michal Ruprich 2021-06-10 08:54:20 UTC
Hi Carlos,

TBH I did not find any particular commit that would fix this but for some reason with 7.5.1 I don't see the error. Nevermind, I will query the upstream again for possible solution.

Comment 5 Carlos Goncalves 2021-07-13 08:55:05 UTC
Any update? I have reproduced this issue with 7.5.1 as well, so all versions at least from 7.0 (up to master) seem to be impacted.
Have you queried the upstream project as suggested in comment #4? Is there a place (e.g. Github, email list) where one could follow the discussion?
Thank you.

Comment 6 Michal Ruprich 2021-08-19 05:30:19 UTC
Hi Carlos,

sorry, missed the needinfo on this one. Thanks for filing the bug upstream. Seems like no solution so far.

Comment 11 Carlos Goncalves 2022-04-22 12:29:07 UTC
This issue was reported to have been fixed upstream and my Github issue was closed.
Please see and consider porting the patch back to all supported RHEL 8.x versions.

https://github.com/FRRouting/frr/commit/aa24a36a2d1814c8a1844465b8ff73e54cb85b45

Comment 20 RHEL Program Management 2022-11-01 07:28:56 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 30 errata-xmlrpc 2023-05-16 08:30:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: frr security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2801