Bug 1818754
| Summary: | [OVN SCALE][OVN 2.13] ovn-controller with monitor-all misses port bindings sometimes | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Dumitru Ceara <dceara> | ||||||||||||
| Component: | ovn2.13 | Assignee: | Dumitru Ceara <dceara> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> | ||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||
| Priority: | urgent | ||||||||||||||
| Version: | RHEL 8.0 | CC: | ctrautma, dcbw, dceara, jishi, mmichels, ralongi | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | 1808125 | Environment: | |||||||||||||
| Last Closed: | 2020-04-20 19:43:23 UTC | Type: | --- | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | 1808125 | ||||||||||||||
| Bug Blocks: | |||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Dumitru Ceara
2020-03-30 09:38:52 UTC
hi Dumitru, the db files in https://bugzilla.redhat.com/show_bug.cgi?id=1808125#c6 doesn't work on ovn2.13, could you help to provides db files for 2.13, thanks Created attachment 1677528 [details]
OVS conf.db
Created attachment 1677557 [details] NB DB. NB DB taken from BZ 1808125 and fixed. Created attachment 1677558 [details] SB DB. SB DB taken from BZ 1808125 and fixed. /usr/share/ovn/scripts/ovn-ctl restart_northd --db-nb-cluster-local-addr=127.0.0.1 --db-sb-cluster-local-addr=127.0.0.1 still failed with nbdb and sbdb attached:
Exiting ovn-northd (19184) [ OK ]
ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed ()
Waiting for OVN_Northbound to come up 2020-04-10T00:48:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2020-04-10T00:48:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
[ OK ]
Upgrading database OVN_Northbound from schema version 5.18.0 to 5.20.0 2020-04-10T00:48:50Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-nb.ovsschema: changed 2 columns in 'OVN_Northbound' database from ephemeral
to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns
2020-04-10T00:49:20Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/usr/share/openvswitch/scripts/ovs-lib: line 602: 19488 Alarm clock "$@"
[FAILED]
ovsdb-tool: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format
ovsdb-server: ovsdb error: /etc/ovn/ovnsb_db.db: unexpected file format
Created attachment 1677739 [details]
standalone/simplifie NB DB
Created attachment 1677740 [details]
standalone/simplifie SB DB
Hi Jianlin, I simplified the NB/SB dbs to include only the relevant configurations. Also, they're not clustered DBs anymore so the steps to replicate the issue are: 1. start OVS with the attached conf.db database. 2. start OVN northd and NB/SB with the attached databases. E.g.: /usr/share/ovn/scripts/ovn-ctl start_northd 3. Attach the following gdb script to ovsdb-server SB: $ cat gdb-commands b ovsdb_jsonrpc_parse_monitor_request if (strcmp(table->schema->name, "Database")) commands shell ovs-vsctl set open . external-ids:ovn-monitor-all=true print sleep(1) dis c end c $ ovs-vsctl set open . external-ids:ovn-monitor-all=false $ ovn-sbctl destroy chassis . $ ovn-sbctl destroy port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal $ spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2) $ gdb -p $spid -x gdb-commands This clears ovn-monitor-all from the OVS database and also the relevant chassis and port bindings in the SB DB. Then it attaches to the ovsdb-server for SB and sets a breakpoint for when it would receive the monitor_cond_since command from ovn-controller for the OVN_Southbound database. At this point, before replying to ovn-controller the breakpoint script sets ovn-monitor-all=true (simulating what ovn-k8s would do), sleeps for 1 sec making sure ovn-controller received the update from OVS, and continues. 4. Start ovn-controller: /usr/share/ovn/scripts/ovn-ctl start_controller 5. Check that the port binding wasn't claimed by ovn-controller: $ ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] Thanks, Dumitru reproduced with following steps:
openvswitch-debuginfo and openvswitch-debugsource should be installed.
and yum debuginfo-install glibc-2.28-101.el8.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 libibverbs-26.0-8.el8.x86_64 libmnl-1.0.4-6.el8.x86_64 libnl3-3.5.0-1.el8.x86_64 numactl-libs-2.0.12-9.el8.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 python3-libs-3.6.8-23.el8.x86_64 unbound-libs-1.7.3-10.el8.x86_64 zlib-1.2.11-13.el8.x86_64
[root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-host-2.13.0-7.el8fdp.x86_64
ovn2.13-2.13.0-7.el8fdp.x86_64
openvswitch2.13-2.13.0-13.el8fdp.x86_64
openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64
ovn2.13-central-2.13.0-7.el8fdp.x86_64
openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64
1. systemctl start openvswitch
cp conf.db /etc/openvswitch
systemctl restart openvswitch
2. /usr/share/ovn/scripts/ovn-ctl start_northd
cp nbdb.db /etc/ovn/ovnnb_db.db
cp sbdb.db /etc/ovn/ovnsb_db.db
/usr/share/ovn/scripts/ovn-ctl restart_northd
3. ovs-vsctl set open . external-ids:ovn-monitor-all=false
ovn-sbctl destroy chassis .
ovn-sbctl clear port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis
spid=$(ps aux | grep -v grep | grep ovsdb-server-sb | tr -s ' ' | cut -d ' ' -f 2)
gdb -p $spid -x gdb-commands
4. /usr/share/ovn/scripts/ovn-ctl start_controller
ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis : []
<=== no chassis
5. quit gdb
6. [root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis : []
<=== still no chassis
Verified on ovn2.13.0-11:
=== the gdb output
0x00007f5fad77bee8 in __GI___poll (fds=fds@entry=0x557d37c51c00, nfds=nfds@entry=5,
timeout=timeout@entry=2500) at ../sysdeps/unix/sysv/linux/poll.c:29
29 return SYSCALL_CANCEL (poll, fds, nfds, timeout);
Breakpoint 1 at 0x557d370a1870: file ../ovsdb/jsonrpc-server.c, line 1276.
Breakpoint 1, ovsdb_jsonrpc_parse_monitor_request (dbmon=0x557d37caf6a0,
table=table@entry=0x557d37c510f0, cond=0x557d37caf760, monitor_request=0x557d37c2dee0)
at ../ovsdb/jsonrpc-server.c:1276
1276 {
$1 = 0
Program received signal SIGPIPE, Broken pipe.
0x00007f5fae31df5e in __libc_send (fd=22, buf=0x557d37cab980, len=33, flags=flags@entry=0)
at ../sysdeps/unix/sysv/linux/send.c:28
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) quit
A debugging session is active.
Inferior 1 [process 33544] will be detached.
Quit anyway? (y or n) y
Detaching from program: /usr/sbin/ovsdb-server, process 33544
[Inferior 1 (process 33544) detached]
[root@hp-dl380pg8-12 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller
Starting ovn-controller [ OK ]
[root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis : ba234fd5-784f-4363-8c92-be28df81df6c
<=== chassis is claimed
[root@hp-dl380pg8-12 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal
chassis : ba234fd5-784f-4363-8c92-be28df81df6c
[root@hp-dl380pg8-12 ~]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch-selinux-extra-policy-1.0-23.el8fdp.noarch
ovn2.13-central-2.13.0-11.el8fdp.x86_64
openvswitch2.13-2.13.0-13.el8fdp.x86_64
openvswitch2.13-debugsource-2.13.0-13.el8fdp.x86_64
ovn2.13-host-2.13.0-11.el8fdp.x86_64
openvswitch2.13-debuginfo-2.13.0-13.el8fdp.x86_64
ovn2.13-2.13.0-11.el8fdp.x86_64
reproduced on ovn2.13.0-7 rhel7 version: [root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller Starting ovn-controller [ OK ] [root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : [] <=== not claimed Verified on 2.13.0-11: [root@dell-per740-42 ~]# /usr/share/ovn/scripts/ovn-ctl start_controller Starting ovn-controller [ OK ] [root@dell-per740-42 ~]# ovn-sbctl --columns chassis list port_binding k8s-ci-op-7pxtd-m-0.c.openshift-gce-devel-ci.internal chassis : b1ff50fd-4be8-410e-af3d-1169cf913be8 [root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn" ovn2.13-central-2.13.0-11.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch openvswitch2.13-devel-2.13.0-9.el7fdp.x86_64 ovn2.13-2.13.0-11.el7fdp.x86_64 ovn2.13-host-2.13.0-11.el7fdp.x86_64 openvswitch2.13-2.13.0-9.el7fdp.x86_64 openvswitch2.13-debuginfo-2.13.0-9.el7fdp.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1501 |