Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1818754

Summary: [OVN SCALE][OVN 2.13] ovn-controller with monitor-all misses port bindings sometimes
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dumitru Ceara <dceara>
Component: ovn2.13Assignee: Dumitru Ceara <dceara>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: urgent Docs Contact:
Priority: urgent    
Version: RHEL 8.0CC: ctrautma, dcbw, dceara, jishi, mmichels, ralongi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1808125 Environment:
Last Closed: 2020-04-20 19:43:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1808125    
Bug Blocks:    
Attachments:
Description Flags
OVS conf.db
none
NB DB.
none
SB DB.
none
standalone/simplifie NB DB
none
standalone/simplifie SB DB none

Description Dumitru Ceara 2020-03-30 09:38:52 UTC
+++ This bug was initially created as a clone of Bug #1808125 +++

It seems that ovn-controller sometimes misses port binding notifications.

We're interested in why ovn-controller doesn't set up any port bindings for k8s-ci-op-7pxtd-m-2.c.openshift-gce-devel-ci.internal even though that interface has been added to OVS itself with the right iface-id.

--- Additional comment from Dan Williams on 2020-02-27 21:18:14 UTC ---

ovn2.12.x86_64 0:2.12.0-32.el7fdn

--- Additional comment from Dan Williams on 2020-02-27 21:18:39 UTC ---

openvswitch2.12.x86_64 0:2.12.0-21.el7fdn

--- Additional comment from Dumitru Ceara on 2020-02-28 10:13:23 UTC ---



--- Additional comment from Dumitru Ceara on 2020-02-28 10:52:16 UTC ---



--- Additional comment from Dumitru Ceara on 2020-02-28 12:11:30 UTC ---

From the sbdb_leader.log we see that the ovn-controller where the issue is seen ("server13" in the logs) sends an incorrect "monitor_cond_since" request:

2020-02-27T21:03:38Z|02948|jsonrpc|DBG|ssl:10.0.0.3:38012: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"where":
                [
                    ["type","==","patch"],
                    ["type","==","chassisredirect"],
                    ["type","==","external"]
                ],
            "columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],
        [...]
After this point the ovn-controller facing the issue ("server13") doesn't update the monitor condition, i.e., it doesn't send any other monitor_cond_since requests to OVSDB SB server.

From the sbdb_follower2.log where the ovn-controller that doesn't hit the issue is connected ("server9" in the logs), we see the correct "monitor_cond_since" request, specifically no "where" clause for Port_Binding:

2020-02-27T21:03:38Z|01610|jsonrpc|DBG|ssl:10.0.0.4:42758: received request, method="monitor_cond_since", params=[
    "OVN_Southbound",
    ["monid","OVN_Southbound"],
    [...]
    "Port_Binding":
        [
            {"columns":
                [
                    "chassis",
                    "datapath",
                    "encap",
                    "gateway_chassis",
                    "ha_chassis_group",
                    "logical_port",
                    "mac",
                    "nat_addresses",
                    "options",
                    "parent_port",
                    "tag",
                    "tunnel_key",
                    "type",
                    "virtual_parent"
                ]
            }
        ],

This indicates that ovn-controller sometimes fails to properly disable conditional monitoring when ovn-monitor-all=true.

--- Additional comment from Dumitru Ceara on 2020-02-28 16:01:05 UTC ---

Steps to replicate the issue using the attached OVS, OVN SB, OVN NB database files:

1. start OVS with the ovsdb.db database.
2. start OVN northd and NB/SB databases. E.g.:
/usr/share/ovn/scripts/ovn-ctl start_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1
3. Attach the following gdb script to ovsdb-server SB:

$ cat gdb-commands
b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database"))
commands
shell ovs-vsctl set open . external-ids:ovn-monitor-all=true
print sleep(1)
dis
c
end
c

$ ovs-vsctl set open . external-ids:ovn-monitor-all=false
$ ovn-sbctl destroy chassis .
$ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
$ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
$ gdb -p $spid -x gdb-commands

This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues.

4. Start ovn-controller:
/usr/share/ovn/scripts/ovn-ctl start_controller

5. Check that the port binding wasn't claimed by ovn-controller:
$ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

Root cause:
ovn-controller shouldn't call update_sb_monitors if the SB IDL is in state IDL_S_DATA_MONITOR_COND_SINCE_REQUESTED because when ovsdb-server replies the IDL code will:
- call ovsdb_idl_db_parse_monitor_reply() [1]
- call ovsdb_idl_db_clear() [2]
- which sets table->cond_changed = false; for every table monitored by the IDL. [3]

[1] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L767
[2] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L2153
[3] https://github.com/openvswitch/ovs/blob/master/lib/ovsdb-idl.c#L613

--- Additional comment from Dumitru Ceara on 2020-02-28 23:02:04 UTC ---

Fix sent upstream for review: https://patchwork.ozlabs.org/patch/1246910/

--- Additional comment from Dumitru Ceara on 2020-03-30 09:38:02 UTC ---

Fix merged upstream: https://github.com/openvswitch/ovs/commit/2b7e536fa5e20be10e620b959e05557f88862d2c

Comment 4 Jianlin Shi 2020-04-09 00:05:06 UTC
hi Dumitru, the db files in https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6 doesn't work on ovn2.13, could you help to provides db files for 2.13, thanks

Comment 5 Dumitru Ceara 2020-04-09 12:10:27 UTC
Created attachment 1677528 [details]
OVS conf.db

Comment 6 Dumitru Ceara 2020-04-09 14:21:50 UTC
Created attachment 1677557 [details]
NB DB.

NB DB taken from BZ 1808125 and fixed.

Comment 7 Dumitru Ceara 2020-04-09 14:22:21 UTC
Created attachment 1677558 [details]
SB DB.

SB DB taken from BZ 1808125 and fixed.

Comment 8 Jianlin Shi 2020-04-10 00:50:50 UTC
/usr/share/ovn/scripts/ovn-ctl restart_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1 still failed with nbdb and sbdb attached:

Exiting ovn-northd (19184)                                 [  OK  ]                                   
ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed ()
Waiting for OVN_Northbound to come up 2020-04-10T00:48:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2020-04-10T00:48:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected                  
                                                           [  OK  ]                                   
Upgrading database OVN_Northbound from schema version 5.18.0 to 5.20.0 2020-04-10T00:48:50Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-nb.ovsschema: changed 2 columns in 'OVN_Northbound' database from ephemeral
to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns
2020-04-10T00:49:20Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)                 
/usr/share/openvswitch/scripts/ovs-lib: line 602: 19488 Alarm clock             "$@"                  
                                                           [FAILED]                                   
ovsdb-tool: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format
ovsdb-server: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format

Comment 9 Dumitru Ceara 2020-04-10 08:40:47 UTC
Created attachment 1677739 [details]
standalone/simplifie NB DB

Comment 10 Dumitru Ceara 2020-04-10 08:41:13 UTC
Created attachment 1677740 [details]
standalone/simplifie SB DB

Comment 11 Dumitru Ceara 2020-04-10 08:43:39 UTC
Hi Jianlin,

I simplified the NB/SB dbs to include only the relevant configurations. Also, they're not clustered DBs anymore so the steps to replicate the issue are:

1. start OVS with the attached conf.db database.
2. start OVN northd and NB/SB with the attached databases. E.g.:
/usr/share/ovn/scripts/ovn-ctl start_northd
3. Attach the following gdb script to ovsdb-server SB:

$ cat gdb-commands
b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database"))
commands
shell ovs-vsctl set open . external-ids:ovn-monitor-all=true
print sleep(1)
dis
c
end
c

$ ovs-vsctl set open . external-ids:ovn-monitor-all=false
$ ovn-sbctl destroy chassis .
$ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
$ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
$ gdb -p $spid -x gdb-commands

This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues.

4. Start ovn-controller:
/usr/share/ovn/scripts/ovn-ctl start_controller

5. Check that the port binding wasn't claimed by ovn-controller:
$ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

Thanks,
Dumitru

Comment 12 Jianlin Shi 2020-04-10 10:08:41 UTC
reproduced with following steps:

openvswitch-debuginfo and openvswitch-debugsource should be installed.
and yum debuginfo-install glibc-2.28-101.el8.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 libibverbs-26.0-8.el8.x86_64 libmnl-1.0.4-6.el8.x86_64 libnl3-3.5.0-1.el8.x86_64 numactl-libs-2.0.12-9.el8.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 python3-libs-3.6.8-23.el8.x86_64 unbound-libs-1.7.3-10.el8.x86_64 zlib-1.2.11-13.el8.x86_64

[root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn"                                          
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-host-2.13.0-7.el8fdp.x86_64                                                                   
ovn2.13-2.13.0-7.el8fdp.x86_64
openvswitch2.13-2.13.0-13.el8fdp.x86_64                                                               
openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64
ovn2.13-central-2.13.0-7.el8fdp.x86_64                                                                
openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64


1. systemctl start openvswitch
cp conf.db /etc/openvswitch
systemctl restart openvswitch

2. /usr/share/ovn/scripts/ovn-ctl start_northd
cp nbdb.db /etc/ovn/ovnnb_db.db
cp sbdb.db  /etc/ovn/ovnsb_db.db
/usr/share/ovn/scripts/ovn-ctl restart_northd

3. ovs-vsctl set open . external-ids:ovn-monitor-all=false
ovn-sbctl destroy chassis .
ovn-sbctl clear port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis
spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
gdb -p $spid -x gdb-commands

4. /usr/share/ovn/scripts/ovn-ctl start_controller
ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal                                                                               
chassis             : []

<=== no chassis

5. quit gdb

6. [root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

<=== still no chassis

Verified on ovn2.13.0-11:


=== the gdb output

0x00007f5fad77bee8 in __GI___poll (fds=fds@entry=0x557d37c51c00, nfds=nfds@entry=5,                   
    timeout=timeout@entry=2500) at ../sysdeps/unix/sysv/linux/poll.c:29                               
29        return SYSCALL_CANCEL (poll, fds, nfds, timeout);                                           
Breakpoint 1 at 0x557d370a1870: file ../ovsdb/jsonrpc-server.c, line 1276.                            
                                                                                                      
Breakpoint 1, ovsdb_jsonrpc_parse_monitor_request (dbmon=0x557d37caf6a0,                              
    table=table@entry=0x557d37c510f0, cond=0x557d37caf760, monitor_request=0x557d37c2dee0)            
    at ../ovsdb/jsonrpc-server.c:1276                                                                 
1276    {                                                                                             
$1 = 0                                                                                                
                                                                                                      
Program received signal SIGPIPE, Broken pipe.                                                         
0x00007f5fae31df5e in __libc_send (fd=22, buf=0x557d37cab980, len=33, flags=flags@entry=0)            
    at ../sysdeps/unix/sysv/linux/send.c:28                                                           
--Type <RET> for more, q to quit, c to continue without paging--q                                     
Quit                                                                                                  
(gdb) quit                                                                                            
A debugging session is active.                                                                        
                                                                                                      
        Inferior 1 [process 33544] will be detached.                                                  
                                                                                                      
Quit anyway? (y or n) y                                                                               
Detaching from program: /usr/sbin/ovsdb-server, process 33544                                         
[Inferior 1 (process 33544) detached] 


[root@hp-dl380pg8-12 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller
Starting ovn-controller                                    [  OK  ]
[root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : ba234fd5-784f-4363-8c92-be28df81df6c

<=== chassis is claimed

[root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : ba234fd5-784f-4363-8c92-be28df81df6c
[root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-central-2.13.0-11.el8fdp.x86_64
openvswitch2.13-2.13.0-13.el8fdp.x86_64
openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64                                                   
ovn2.13-host-2.13.0-11.el8fdp.x86_64
openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64
ovn2.13-2.13.0-11.el8fdp.x86_64

Comment 13 Jianlin Shi 2020-04-10 10:24:40 UTC
reproduced on ovn2.13.0-7 rhel7 version:

[root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller                              
Starting ovn-controller                                    [  OK  ]                                   
[root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : []

<=== not claimed

Verified on 2.13.0-11:

[root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller                              
Starting ovn-controller                                    [  OK  ]                                   
[root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis             : b1ff50fd-4be8-410e-af3d-1169cf913be8
[root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn"                                          
ovn2.13-central-2.13.0-11.el7fdp.x86_64                                                               
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch                                                 
openvswitch2.13-devel-2.13.0-9.el7fdp.x86_64
ovn2.13-2.13.0-11.el7fdp.x86_64                                                                       
ovn2.13-host-2.13.0-11.el7fdp.x86_64                                                                  
openvswitch2.13-2.13.0-9.el7fdp.x86_64                                                                
openvswitch2.13-debuginfo-2.13.0-9.el7fdp.x86_64

Comment 15 errata-xmlrpc 2020-04-20 19:43:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1501