The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1818754 - [OVN SCALE][OVN 2.13] ovn-controller with monitor-all misses port bindings sometimes
Summary: [OVN SCALE][OVN 2.13] ovn-controller with monitor-all misses port bindings so...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.13
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Dumitru Ceara
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On: 1808125
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-30 09:38 UTC by Dumitru Ceara
Modified: 2020-07-10 05:38 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1808125
Environment:
Last Closed: 2020-04-20 19:43:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
OVS conf.db (196.88 KB, text/plain)
2020-04-09 12:10 UTC, Dumitru Ceara
no flags Details
NB DB. (241.47 KB, text/plain)
2020-04-09 14:21 UTC, Dumitru Ceara
no flags Details
SB DB. (708.02 KB, text/plain)
2020-04-09 14:22 UTC, Dumitru Ceara
no flags Details
standalone/simplifie NB DB (13.42 KB, text/plain)
2020-04-10 08:40 UTC, Dumitru Ceara
no flags Details
standalone/simplifie SB DB (67.80 KB, text/plain)
2020-04-10 08:41 UTC, Dumitru Ceara
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:1501 0 None None None 2020-04-20 19:43:43 UTC

Description Dumitru Ceara 2020-03-30 09:38:52 UTC
+++ This bug was initially created as a clone of Bug #1808125 +++

It seems that ovn-controller sometimes misses port binding notifications.

We're interested in why ovn-controller doesn't set up any port bindings for k8s-ci-op-7pxtd-m-2.c.openshift-gce-devel-ci.internal even though that interface has been added to OVS itself with the right iface-id.

--- Additional comment from Dan Williams on 2020-02-27 21:18:14 UTC ---

ovn2.12.x86_64 0:2.12.0-32.el7fdn

--- Additional comment from Dan Williams on 2020-02-27 21:18:39 UTC ---

openvswitch2.12.x86_64 0:2.12.0-21.el7fdn

--- Additional comment from Dumitru Ceara on 2020-02-28 10:13:23 UTC ---



--- Additional comment from Dumitru Ceara on 2020-02-28 10:52:16 UTC ---



--- Additional comment from Dumitru Ceara on 2020-02-28 12:11:30 UTC ---

From the sbdb_leader.log we see that the ovn-controller where the issue is seen ("server13" in the logs) sends an incorrect "monitor_cond_since" request:

2020-02-27T21:03:38Z|02948|jsonrpc|DBG|ssl:10.0.0.3:38012: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"where":
                [
                    ["type","==","patch"],
                    ["type","==","chassisredirect"],
                    ["type","==","external"]
                ],
            "columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],
        [...]
After this point the ovn-controller facing the issue ("server13") doesn't update the monitor condition, i.e., it doesn't send any other monitor_cond_since requests to OVSDB SB server.

From the sbdb_follower2.log where the ovn-controller that doesn't hit the issue is connected ("server9" in the logs), we see the correct "monitor_cond_since" request, specifically no "where" clause for Port_Binding:

2020-02-27T21:03:38Z|01610|jsonrpc|DBG|ssl:10.0.0.4:42758: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],

This indicates that ovn-controller sometimes fails to properly disable conditional monitoring when ovn-monitor-all=true.

--- Additional comment from Dumitru Ceara on 2020-02-28 16:01:05 UTC ---

Steps to replicate the issue using the attached OVS, OVN SB, OVN NB database files:

1. start OVS with the ovsdb.db database.
2. start OVN northd and NB/SB databases. E.g.:
/usr/share/ovn/scripts/ovn-ctl start_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1
3. Attach the following gdb script to ovsdb-server SB:

$ cat gdb-commands
b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database"))
commands
shell ovs-vsctl set open . external-ids:ovn-monitor-all=true
print sleep(1)
dis
c
end
c

$ ovs-vsctl set open . external-ids:ovn-monitor-all=false
$ ovn-sbctl destroy chassis .
$ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
$ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
$ gdb -p $spid -x gdb-commands

This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues.

4. Start ovn-controller:
/usr/share/ovn/scripts/ovn-ctl start_controller

5. Check that the port binding wasn't claimed by ovn-controller:
$ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

Root cause:
ovn-controller shouldn't call update_sb_monitors if the SB IDL is in state IDL_S_DATA_MONITOR_COND_SINCE_REQUESTED because when ovsdb-server replies the IDL code will:
- call ovsdb_idl_db_parse_monitor_reply() [1]
- call ovsdb_idl_db_clear() [2]
- which sets table->cond_changed = false; for every table monitored by the IDL. [3]

[1] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L767
[2] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L2153
[3] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L613

--- Additional comment from Dumitru Ceara on 2020-02-28 23:02:04 UTC ---

Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1246910/

--- Additional comment from Dumitru Ceara on 2020-03-30 09:38:02 UTC ---

Fix merged upstream: https://github.com/openvswitch/ovs/commit/2b7e536fa5e20be10e620b959e05557f88862d2c

Comment 4 Jianlin Shi 2020-04-09 00:05:06 UTC
hi Dumitru, the db files in https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6 doesn't work on ovn2.13, could you help to provides db files for 2.13, thanks

Comment 5 Dumitru Ceara 2020-04-09 12:10:27 UTC
Created attachment 1677528 [details]
OVS conf.db

Comment 6 Dumitru Ceara 2020-04-09 14:21:50 UTC
Created attachment 1677557 [details]
NB DB.

NB DB taken from BZ 1808125 and fixed.

Comment 7 Dumitru Ceara 2020-04-09 14:22:21 UTC
Created attachment 1677558 [details]
SB DB.

SB DB taken from BZ 1808125 and fixed.

Comment 8 Jianlin Shi 2020-04-10 00:50:50 UTC
/usr/share/ovn/scripts/ovn-ctl restart_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1 still failed with nbdb and sbdb attached:

Exiting ovn-northd (19184)                                 [  OK  ]                                   
ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed ()
Waiting for OVN_Northbound to come up 2020-04-10T00:48:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2020-04-10T00:48:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected                  
                                                           [  OK  ]                                   
Upgrading database OVN_Northbound from schema version 5.18.0 to 5.20.0 2020-04-10T00:48:50Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-nb.ovsschema: changed 2 columns in 'OVN_Northbound' database from ephemeral
to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns
2020-04-10T00:49:20Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)                 
/usr/share/openvswitch/scripts/ovs-lib: line 602: 19488 Alarm clock             "$@"                  
                                                           [FAILED]                                   
ovsdb-tool: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format
ovsdb-server: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format

Comment 9 Dumitru Ceara 2020-04-10 08:40:47 UTC
Created attachment 1677739 [details]
standalone/simplifie NB DB

Comment 10 Dumitru Ceara 2020-04-10 08:41:13 UTC
Created attachment 1677740 [details]
standalone/simplifie SB DB

Comment 11 Dumitru Ceara 2020-04-10 08:43:39 UTC
Hi Jianlin,

I simplified the NB/SB dbs to include only the relevant configurations. Also, they're not clustered DBs anymore so the steps to replicate the issue are:

1. start OVS with the attached conf.db database.
2. start OVN northd and NB/SB with the attached databases. E.g.:
/usr/share/ovn/scripts/ovn-ctl start_northd
3. Attach the following gdb script to ovsdb-server SB:

$ cat gdb-commands
b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database"))
commands
shell ovs-vsctl set open . external-ids:ovn-monitor-all=true
print sleep(1)
dis
c
end
c

$ ovs-vsctl set open . external-ids:ovn-monitor-all=false
$ ovn-sbctl destroy chassis .
$ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
$ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
$ gdb -p $spid -x gdb-commands

This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues.

4. Start ovn-controller:
/usr/share/ovn/scripts/ovn-ctl start_controller

5. Check that the port binding wasn't claimed by ovn-controller:
$ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

Thanks,
Dumitru

Comment 12 Jianlin Shi 2020-04-10 10:08:41 UTC
reproduced with following steps:

openvswitch-debuginfo and openvswitch-debugsource should be installed.
and yum debuginfo-install glibc-2.28-101.el8.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 libibverbs-26.0-8.el8.x86_64 libmnl-1.0.4-6.el8.x86_64 libnl3-3.5.0-1.el8.x86_64 numactl-libs-2.0.12-9.el8.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 python3-libs-3.6.8-23.el8.x86_64 unbound-libs-1.7.3-10.el8.x86_64 zlib-1.2.11-13.el8.x86_64

[root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn"                                          
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-host-2.13.0-7.el8fdp.x86_64                                                                   
ovn2.13-2.13.0-7.el8fdp.x86_64
openvswitch2.13-2.13.0-13.el8fdp.x86_64                                                               
openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64
ovn2.13-central-2.13.0-7.el8fdp.x86_64                                                                
openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64


1. systemctl start openvswitch
cp conf.db /etc/openvswitch
systemctl restart openvswitch

2. /usr/share/ovn/scripts/ovn-ctl start_northd
cp nbdb.db /etc/ovn/ovnnb_db.db
cp sbdb.db  /etc/ovn/ovnsb_db.db
/usr/share/ovn/scripts/ovn-ctl restart_northd

3. ovs-vsctl set open . external-ids:ovn-monitor-all=false
ovn-sbctl destroy chassis .
ovn-sbctl clear port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis
spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
gdb -p $spid -x gdb-commands

4. /usr/share/ovn/scripts/ovn-ctl start_controller
ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal                                                                               
chassis             : []

<=== no chassis

5. quit gdb

6. [root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

<=== still no chassis

Verified on ovn2.13.0-11:


=== the gdb output

0x00007f5fad77bee8 in __GI___poll (fds=fds@entry=0x557d37c51c00, nfds=nfds@entry=5,                   
    timeout=timeout@entry=2500) at ../sysdeps/unix/sysv/linux/poll.c:29                               
29        return SYSCALL_CANCEL (poll, fds, nfds, timeout);                                           
Breakpoint 1 at 0x557d370a1870: file ../ovsdb/jsonrpc-server.c, line 1276.                            
                                                                                                      
Breakpoint 1, ovsdb_jsonrpc_parse_monitor_request (dbmon=0x557d37caf6a0,                              
    table=table@entry=0x557d37c510f0, cond=0x557d37caf760, monitor_request=0x557d37c2dee0)            
    at ../ovsdb/jsonrpc-server.c:1276                                                                 
1276    {                                                                                             
$1 = 0                                                                                                
                                                                                                      
Program received signal SIGPIPE, Broken pipe.                                                         
0x00007f5fae31df5e in __libc_send (fd=22, buf=0x557d37cab980, len=33, flags=flags@entry=0)            
    at ../sysdeps/unix/sysv/linux/send.c:28                                                           
--Type <RET> for more, q to quit, c to continue without paging--q                                     
Quit                                                                                                  
(gdb) quit                                                                                            
A debugging session is active.                                                                        
                                                                                                      
        Inferior 1 [process 33544] will be detached.                                                  
                                                                                                      
Quit anyway? (y or n) y                                                                               
Detaching from program: /usr/sbin/ovsdb-server, process 33544                                         
[Inferior 1 (process 33544) detached] 


[root@hp-dl380pg8-12 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller
Starting ovn-controller                                    [  OK  ]
[root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : ba234fd5-784f-4363-8c92-be28df81df6c

<=== chassis is claimed

[root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : ba234fd5-784f-4363-8c92-be28df81df6c
[root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-central-2.13.0-11.el8fdp.x86_64
openvswitch2.13-2.13.0-13.el8fdp.x86_64
openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64                                                   
ovn2.13-host-2.13.0-11.el8fdp.x86_64
openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64
ovn2.13-2.13.0-11.el8fdp.x86_64

Comment 13 Jianlin Shi 2020-04-10 10:24:40 UTC
reproduced on ovn2.13.0-7 rhel7 version:

[root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller                              
Starting ovn-controller                                    [  OK  ]                                   
[root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

<=== not claimed

Verified on 2.13.0-11:

[root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller                              
Starting ovn-controller                                    [  OK  ]                                   
[root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : b1ff50fd-4be8-410e-af3d-1169cf913be8
[root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn"                                          
ovn2.13-central-2.13.0-11.el7fdp.x86_64                                                               
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch                                                 
openvswitch2.13-devel-2.13.0-9.el7fdp.x86_64
ovn2.13-2.13.0-11.el7fdp.x86_64                                                                       
ovn2.13-host-2.13.0-11.el7fdp.x86_64                                                                  
openvswitch2.13-2.13.0-9.el7fdp.x86_64                                                                
openvswitch2.13-debuginfo-2.13.0-9.el7fdp.x86_64

Comment 15 errata-xmlrpc 2020-04-20 19:43:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1501


Note You need to log in before you can comment on or make changes to this bug.