Admittedly this isn't a common situation and feel free to close WONTFIX. ovsdb-server was openvswitch-2.13.0-1.fc32.x86_64 ovn-nbctl was ovn-2.11.1-5.fc30.x86_64 (which might be almost 2.12?) I *cannot* reproduce the issue with OVN 2.13 + OVS 2.13 because it fails to connect to the database with "ovn-nbctl: unix:/var/run/openvswitch/foo.sock: database connection failed ()". But OVN 2.11.1 does not throw that error and creates the 100% CPU issue as long as the ovn-nbctl command is running. If I Ctl+C ovn-nbctl things settle down. ovn-nbctl (2.11.1) looks to be stuck in ovsdb_idl_txn_commit_block(). Server side is stuck in looping over this: recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout) getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=24964}, ru_stime={tv_sec=47, tv_usec=489259}, ...}) = 0 read(16, 0x7ffe674f03b0, 512) = -1 EAGAIN (Resource temporarily unavailable) accept(13, 0x7ffe674f04b0, [128]) = -1 EAGAIN (Resource temporarily unavailable) accept(10, 0x7ffe674f0440, [128]) = -1 EAGAIN (Resource temporarily unavailable) recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout) getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25064}, ru_stime={tv_sec=47, tv_usec=489259}, ...}) = 0 read(16, 0x7ffe674f03b0, 512) = -1 EAGAIN (Resource temporarily unavailable) accept(13, 0x7ffe674f04b0, [128]) = -1 EAGAIN (Resource temporarily unavailable) accept(10, 0x7ffe674f0440, [128]) = -1 EAGAIN (Resource temporarily unavailable) recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout) getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25153}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0 read(16, 0x7ffe674f03b0, 512) = -1 EAGAIN (Resource temporarily unavailable) accept(13, 0x7ffe674f04b0, [128]) = -1 EAGAIN (Resource temporarily unavailable) accept(10, 0x7ffe674f0440, [128]) = -1 EAGAIN (Resource temporarily unavailable) recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout) getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25249}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0 read(16, 0x7ffe674f03b0, 512) = -1 EAGAIN (Resource temporarily unavailable) accept(13, 0x7ffe674f04b0, [128]) = -1 EAGAIN (Resource temporarily unavailable) accept(10, 0x7ffe674f0440, [128]) = -1 EAGAIN (Resource temporarily unavailable) recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout) getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25343}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0 Thread 1 (Thread 0x7fe2bdefc000 (LWP 30817)): #0 0x00007fe2bfd08a9b in _int_free () from /lib64/libc.so.6 #1 0x00007fe2c01751ea in ovsdb_txn_table_destroy (txn_table=<optimized out>) at ovsdb/transaction.c:1351 #2 for_each_txn_row (txn=txn@entry=0x55a2915f36f0, cb=cb@entry=0x7fe2c0174950 <ovsdb_txn_row_abort>) at ovsdb/transaction.c:1412 #3 0x00007fe2c0175c54 in ovsdb_txn_abort (txn=0x55a2915f36f0) at ovsdb/transaction.c:190 #4 0x00007fe2c0174252 in ovsdb_trigger_try (t=0x55a28df33520, now=<optimized out>) at ovsdb/trigger.c:302 #5 0x00007fe2c0174838 in ovsdb_trigger_run (db=0x55a28dd8dc60, now=2519641166) at ovsdb/trigger.c:152 #6 0x000055a289d83c59 in main_loop (is_backup=0x7ffe674f06ea, exiting=0x7ffe674f06eb, run_process=0x0, remotes=0x7ffe674f0740, unixctl=0x55a291631590, all_dbs=0x7ffe674f0780, jsonrpc=0x55a28b15d310, config=0x7ffe674f07e0) at ovsdb/ovsdb-server.c:230 #7 main (argc=<optimized out>, argv=<optimized out>) at ovsdb/ovsdb-server.c:469
Thanks Dan for the clarification. openvswitch-2.13.0-1.fc32.x86_64 misses latest upstream raft fixes: $ git log v2.13.0..origin/branch-2.13 --oneline | grep -i raft 168beb87c raft: Fix leak of the incomplete command. 25a7e5547 raft: Fix the problem of stuck in candidate role forever. 877618fc8 raft: Fix next_index in install_snapshot reply handling. 3ae90e189 raft: Avoid busy loop during leader election. adc64ab05 raft: Fix raft_is_connected() when there is no leader yet. f0c8b44c5 ovsdb-server: Don't disconnect clients after raft install_snapshot. 78c8011f5 raft-rpc: Fix message format. Specifically, we see the 100% CPU usage due to not having: 3ae90e189 raft: Avoid busy loop during leader election. Looks like latest downstream openvswitch rpm also misses these (openvswitch2.13-2.13.0-17.el7fdp): https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1175011 Changing component to openvswitch2.13. I guess it's good to have the upstream fixes downstream too instead of waiting for upstream 2.14, what do you think?
(In reply to Dumitru Ceara from comment #3) > Thanks Dan for the clarification. > > openvswitch-2.13.0-1.fc32.x86_64 misses latest upstream raft fixes: > > $ git log v2.13.0..origin/branch-2.13 --oneline | grep -i raft > 168beb87c raft: Fix leak of the incomplete command. > 25a7e5547 raft: Fix the problem of stuck in candidate role forever. > 877618fc8 raft: Fix next_index in install_snapshot reply handling. > 3ae90e189 raft: Avoid busy loop during leader election. > adc64ab05 raft: Fix raft_is_connected() when there is no leader yet. > f0c8b44c5 ovsdb-server: Don't disconnect clients after raft install_snapshot. > 78c8011f5 raft-rpc: Fix message format. > > Specifically, we see the 100% CPU usage due to not having: > 3ae90e189 raft: Avoid busy loop during leader election. > > Looks like latest downstream openvswitch rpm also misses these > (openvswitch2.13-2.13.0-17.el7fdp): > https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1175011 > > Changing component to openvswitch2.13. > > I guess it's good to have the upstream fixes downstream too instead of > waiting for upstream 2.14, what do you think? Yes I agree.
* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-24 - raft: Disable RAFT jsonrpc inactivity probe. (#1822290) [b12acf45a6872dda85642cbc73dd86eb529be17e] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-23 - raft: Fix leak of the incomplete command. (#1835729) [bb552cffb89104c2bb19b8aff749b8b825a6db13] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-22 - raft: Fix the problem of stuck in candidate role forever. (#1828639) [c5937276691bb90f99fad1871b5e3ca4ac9391e7] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-21 - raft: Fix next_index in install_snapshot reply handling. (#1828639) [09ac3c327ec678f36cd9df451b7846acdf734c0f] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-20 - raft: Avoid busy loop during leader election. (#1828639) [19683b041e19a49e275a4b42f5bb5b0528de898a] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-19 - raft: Fix raft_is_connected() when there is no leader yet. (#1828639) [2dae730162e5e1b084ac0d1fc339d2f09bd8cddb] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-18 - ovsdb-server: Don't disconnect clients after raft install_snapshot. (#1828639) [da9680c6095df8d6c477aa10e29baa8f00dc2e25] * Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-17 - raft-rpc: Fix message format. (#1828639) [e9bb63d6190925db63b4cad83e57a945c4ac0629]
Hi Jianlin, Here are the steps to reproduce the issue on RHEL8 with FDP packages: $ rpm -q openvswitch2.13 openvswitch2.13-2.13.0-25.el8fdp.x86_64 $ rpm -q ovn2.11-central ovn2.11-central-2.11.1-44.el8fdp.x86_64 $ rpm -q ovn2.11 ovn2.11-2.11.1-44.el8fdp.x86_64 # Use the attached ovnnb_db.db file, assuming it's located in /tmp $ cp /tmp/ovnnb_db.db /var/lib/openvswitch/ovnnb_db.db # Try to start ovn-northd and NB/SB DBs: systemctl start ovn-northd # Make sure the NB DB server is in "candidate" role: $ ovs-appctl -t /run/openvswitch/ovnnb_db.ctl cluster/status OVN_Northbound | grep Role Role: candidate # Try to execute a transaction on the NB DB: $ ovsdb-client transact unix:/var/run/openvswitch/ovnnb_db.sock '["OVN_Northbound", {"op": "update", "table": "NB_Global", "where": [], "row": {"nb_cfg": 42}}]' # This ^^ hangs and ovsdb-server for NB hogs CPU 100%. # We see in the logs: $ grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log 2020-06-15T07:58:57.374Z|00670|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (98% CPU usage) 2020-06-15T07:58:57.374Z|00671|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (98% CPU usage) The fact that the transaction hangs is ok because the database is not leader but CPU usage shouldn't be 100%. Hope this helps, Dumitru
Thanks Dumitru for reproducer in comment8. reproduced on openvswitch2.13.0-17.el7: [root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn" ovn2.11-central-2.11.1-47.el7fdp.x86_64 openvswitch2.13-2.13.0-17.el7fdp.x86_64 ovn2.11-host-2.11.1-47.el7fdp.x86_64 ovn2.11-2.11.1-47.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch [root@dell-per740-42 ~]# grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log 2020-06-15T09:05:06.757Z|00095|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage) 2020-06-15T09:05:12.757Z|00103|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage) 2020-06-15T09:05:18.757Z|00111|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage) 2020-06-15T09:05:24.757Z|00113|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage) Verified on openvswitch2.13.0-30.el7: [root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn" ovn2.11-central-2.11.1-47.el7fdp.x86_64 ovn2.11-host-2.11.1-47.el7fdp.x86_64 ovn2.11-2.11.1-47.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch openvswitch2.13-2.13.0-30.el7fdp.x86_64 [root@dell-per740-42 ~]# grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log <=== no wakeup error
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2944