The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1828637 - [Telco] northd: detect and correct DB consistency and referential integrity issues
Summary: [Telco] northd: detect and correct DB consistency and referential integrity i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.12
Version: FDP 20.B
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: ---
Assignee: Dumitru Ceara
QA Contact: Jianlin Shi
URL:
Whiteboard: Telco
: 1828343 (view as bug list)
Depends On: 1828639 1836305
Blocks: 1837257
TreeView+ depends on / blocked
 
Reported: 2020-04-28 01:40 UTC by Dan Williams
Modified: 2023-10-06 19:47 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1837257 (view as bug list)
Environment:
Last Closed: 2020-05-26 14:07:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2317 0 None None None 2020-05-26 14:07:35 UTC

Comment 1 Dan Williams 2020-04-28 14:40:41 UTC
Numan said:

Looks like the IDL contents and the actual db are out of sync. The
IDL will not try to create a port binding row which causes referential
integrity issues unless there is a mismatch.
Also when ovn-northd tries to delete a datapath, the deletion fails as some
port binding rows of the datapath exist in the db which ovn-northd is
unaware of.

We should work on detecting this in ovn-northd
and force a complete re-sync of the DB.

Since this is the root cause of the issue, I'm repurposing this bug to track detecting and correcting IDL consistency issues in northd automatically.

Comment 3 Dumitru Ceara 2020-04-29 16:29:27 UTC
Patch series to make ovn-northd solve such inconsistency automatically sent upstream for review:

https://patchwork.ozlabs.org/project/openvswitch/list/?series=173578

Comment 8 Ben Bennett 2020-05-08 19:12:49 UTC
*** Bug 1828343 has been marked as a duplicate of this bug. ***

Comment 11 Jianlin Shi 2020-05-12 02:54:44 UTC
reproduced first situation of inconsistency on ovn2.13.0-21 with following script:

/usr/share/ovn/scripts/ovn-ctl start_northd 

# Create two logical switches with one port each.
ovn-nbctl ls-add ls1
ovn-nbctl lsp-add ls1 p1                         
ovn-nbctl ls-add ls2
ovn-nbctl lsp-add ls2 p2
ovn-nbctl --wait=sb sync
                        
# At this point PB for p1 has tunnel_key=1
ovn-sbctl list datapath
# At this point PB for p2 has tunnel_key=2
                       
# Simulate the SB db going away (could be network
# issues or crash or some other event).
/usr/share/ovn/scripts/ovn-ctl stop_sb_ovsdb     
                                       
# CMS decides to move p2 from ls2 to ls1 and removes
# ls2 completely.
ovn-nbctl ls-del ls2                                
ovn-nbctl lsp-add ls1 p2
                    
# Simulate SB DB coming back online.
/usr/share/ovn/scripts/ovn-ctl  start_sb_ovsdb
[root@kvm-04-guest09 bz1828637]# cat rep1.sh
                                              
/usr/share/ovn/scripts/ovn-ctl start_northd

# Create two logical switches with one port each.
ovn-nbctl ls-add ls1
ovn-nbctl lsp-add ls1 p1
ovn-nbctl ls-add ls2
ovn-nbctl lsp-add ls2 p2
ovn-nbctl --wait=sb sync

# At this point PB for p1 has tunnel_key=1
ovn-sbctl list datapath
# At this point PB for p2 has tunnel_key=2

# Simulate the SB db going away (could be network
# issues or crash or some other event).
/usr/share/ovn/scripts/ovn-ctl stop_sb_ovsdb

# CMS decides to move p2 from ls2 to ls1 and removes
# ls2 completely.
ovn-nbctl ls-del ls2
ovn-nbctl lsp-add ls1 p2

# Simulate SB DB coming back online.
/usr/share/ovn/scripts/ovn-ctl  start_sb_ovsdb

cat /var/log/ovn/ovn-northd.log

[root@kvm-04-guest09 bz1828637]# rpm -qa | grep -E "openvswitch|ovn"
ovn2.13-host-2.13.0-21.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-central-2.13.0-21.el8fdp.x86_64
ovn2.13-2.13.0-21.el8fdp.x86_64
openvswitch2.13-2.13.0-18.el8fdp.x86_64

error in /var/log/ovn/ovn-northd.log:

2020-05-12T02:49:51.819Z|00013|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Port_Binding\" table to have identical values (ad24f326-5d37-4ad0-966f-a07121e0b4bb and 1) for index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID ee1c4a5e-502f-4555-b5e3-c30240ef2719, had the following index values before the transaction: 79ce696f-6579-4aad-93f5-00a70c5140e2 and 1.  Second row, with UUID 44631d9e-ada8-4266-8bfb-60e87533d264, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}

reproduced second situation of inconsistency on ovn2.13.0-21 with following script:

/usr/share/ovn/scripts/ovn-ctl start_northd

# Create a logical router with on router port.
ovn-nbctl lr-add lr
ovn-nbctl lrp-add lr p 00:00:00:00:00:01 1.1.1.1/24

# Simulate that a mac binding was created for the router
# port.
dp=$(ovn-sbctl --bare --columns _uuid list datapath .)
ovn-sbctl create mac_binding logical_port="p" ip="1.1.1.2" datapath="$dp"
ovn-nbctl --wait=sb sync

# Simulate the SB db going away (could be network
# issues or crash or some other event).
/usr/share/ovn/scripts/ovn-ctl stop_sb_ovsdb

# CMS decides to delete lr.
ovn-nbctl lr-del lr

# CMS decides to readd lr and router port.
ovn-nbctl lr-add lr
ovn-nbctl lrp-add lr p 00:00:00:00:00:01 1.1.1.1/24

# Simulate SB DB coming back online.
/usr/share/ovn/scripts/ovn-ctl start_sb_ovsdb

cat /var/log/ovn/ovn-northd.log

error in /var/log/ovn/ovn-northd.log:

2020-05-12T02:52:29.274Z|00013|ovsdb_idl|WARN|transaction error: {"details":"cannot delete Datapath_Binding row 63dfd8b4-e904-4004-a7ed-2068593c6de4 because of 1 remaining reference(s)","error":"referential integrity violation"}


Verified on ovn2.13.0-27, no error in /var/log/ovn/ovn-northd.log after running the two scripts:

[root@kvm-04-guest09 bz1828637]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-2.13.0-27.el8fdp.x86_64
ovn2.13-host-2.13.0-27.el8fdp.x86_64
openvswitch2.13-2.13.0-18.el8fdp.x86_64
ovn2.13-central-2.13.0-27.el8fdp.x86_64

Comment 12 Jianlin Shi 2020-05-12 03:15:39 UTC
reproduced on rhel7 version:

2020-05-12T03:12:56.526Z|00013|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Port_Binding\" table to have identical values (05fa7425-20c5-49a0-8cf9-d75e189af722 and 1) for index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID eb25335d-e058-4f7b-8daf-bfe9dd8c9415, had the following index values before the transaction: 4134a573-df11-415f-bcf1-ff9f0312fbe9 and 1.  Second row, with UUID 4e82b8f7-d2e8-4cbb-9221-1d8e5a4c50e1, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}
[root@hpe-dl380pgen8-02-vm-13 bz1828637]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.13-2.13.0-17.el7fdp.x86_64
ovn2.13-2.13.0-21.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-host-2.13.0-21.el7fdp.x86_64
ovn2.13-central-2.13.0-21.el7fdp.x86_64

2020-05-12T03:13:55.476Z|00013|ovsdb_idl|WARN|transaction error: {"details":"cannot delete Datapath_Binding row d2759a31-9239-43a4-9371-4a25a259d4ca because of 1 remaining reference(s)","error":"referential integrity violation"}


Verified on ovn2.13.0-27:

[root@hpe-dl380pgen8-02-vm-13 bz1828637]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.13-2.13.0-17.el7fdp.x86_64
ovn2.13-central-2.13.0-27.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
ovn2.13-2.13.0-27.el7fdp.x86_64
ovn2.13-host-2.13.0-27.el7fdp.x86_64

Comment 17 errata-xmlrpc 2020-05-26 14:07:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2317


Note You need to log in before you can comment on or make changes to this bug.