Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1828639

Summary: [telco] ovsdb-server for NBDB pegged at 100% when running these commands
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Dan Williams <dcbw>
Component: openvswitch2.13Assignee: Dumitru Ceara <dceara>
Status: CLOSED ERRATA QA Contact: Zhiqiang Fang <zfang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: FDP 20.CCC: ctrautma, dceara, jhsiao, jishi, kfida, qding, ralongi
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard: Telco
Fixed In Version: openvswitch2.13-2.13.0-24.el7fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1836305 (view as bug list) Environment:
Last Closed: 2020-07-15 13:02:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1828637, 1836305, 1837257    

Comment 2 Dan Williams 2020-05-11 19:49:17 UTC
Admittedly this isn't a common situation and feel free to close WONTFIX.

ovsdb-server was openvswitch-2.13.0-1.fc32.x86_64
ovn-nbctl was ovn-2.11.1-5.fc30.x86_64 (which might be almost 2.12?)

I *cannot* reproduce the issue with OVN 2.13 + OVS 2.13 because it fails to connect to the database with "ovn-nbctl: unix:/var/run/openvswitch/foo.sock: database connection failed ()".

But OVN 2.11.1 does not throw that error and creates the 100% CPU issue as long as the ovn-nbctl command is running. If I Ctl+C ovn-nbctl things settle down.

ovn-nbctl (2.11.1) looks to be stuck in ovsdb_idl_txn_commit_block().

Server side is stuck in looping over this:

recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=24964}, ru_stime={tv_sec=47, tv_usec=489259}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25064}, ru_stime={tv_sec=47, tv_usec=489259}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25153}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25249}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25343}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0

Thread 1 (Thread 0x7fe2bdefc000 (LWP 30817)):
#0  0x00007fe2bfd08a9b in _int_free () from /lib64/libc.so.6
#1  0x00007fe2c01751ea in ovsdb_txn_table_destroy (txn_table=<optimized out>) at ovsdb/transaction.c:1351
#2  for_each_txn_row (txn=txn@entry=0x55a2915f36f0, cb=cb@entry=0x7fe2c0174950 <ovsdb_txn_row_abort>) at ovsdb/transaction.c:1412
#3  0x00007fe2c0175c54 in ovsdb_txn_abort (txn=0x55a2915f36f0) at ovsdb/transaction.c:190
#4  0x00007fe2c0174252 in ovsdb_trigger_try (t=0x55a28df33520, now=<optimized out>) at ovsdb/trigger.c:302
#5  0x00007fe2c0174838 in ovsdb_trigger_run (db=0x55a28dd8dc60, now=2519641166) at ovsdb/trigger.c:152
#6  0x000055a289d83c59 in main_loop (is_backup=0x7ffe674f06ea, exiting=0x7ffe674f06eb, run_process=0x0, remotes=0x7ffe674f0740, unixctl=0x55a291631590, all_dbs=0x7ffe674f0780, jsonrpc=0x55a28b15d310, config=0x7ffe674f07e0) at ovsdb/ovsdb-server.c:230
#7  main (argc=<optimized out>, argv=<optimized out>) at ovsdb/ovsdb-server.c:469

Comment 3 Dumitru Ceara 2020-05-12 10:31:29 UTC
Thanks Dan for the clarification.

openvswitch-2.13.0-1.fc32.x86_64 misses latest upstream raft fixes:

$ git log v2.13.0..origin/branch-2.13 --oneline | grep -i raft
168beb87c raft: Fix leak of the incomplete command.
25a7e5547 raft: Fix the problem of stuck in candidate role forever.
877618fc8 raft: Fix next_index in install_snapshot reply handling.
3ae90e189 raft: Avoid busy loop during leader election.
adc64ab05 raft: Fix raft_is_connected() when there is no leader yet.
f0c8b44c5 ovsdb-server: Don't disconnect clients after raft install_snapshot.
78c8011f5 raft-rpc: Fix message format.

Specifically, we see the 100% CPU usage due to not having:
3ae90e189 raft: Avoid busy loop during leader election.

Looks like latest downstream openvswitch rpm also misses these (openvswitch2.13-2.13.0-17.el7fdp):
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1175011

Changing component to openvswitch2.13.

I guess it's good to have the upstream fixes downstream too instead of waiting for upstream 2.14, what do you think?

Comment 4 Dan Williams 2020-05-12 15:35:14 UTC
(In reply to Dumitru Ceara from comment #3)
> Thanks Dan for the clarification.
> 
> openvswitch-2.13.0-1.fc32.x86_64 misses latest upstream raft fixes:
> 
> $ git log v2.13.0..origin/branch-2.13 --oneline | grep -i raft
> 168beb87c raft: Fix leak of the incomplete command.
> 25a7e5547 raft: Fix the problem of stuck in candidate role forever.
> 877618fc8 raft: Fix next_index in install_snapshot reply handling.
> 3ae90e189 raft: Avoid busy loop during leader election.
> adc64ab05 raft: Fix raft_is_connected() when there is no leader yet.
> f0c8b44c5 ovsdb-server: Don't disconnect clients after raft install_snapshot.
> 78c8011f5 raft-rpc: Fix message format.
> 
> Specifically, we see the 100% CPU usage due to not having:
> 3ae90e189 raft: Avoid busy loop during leader election.
> 
> Looks like latest downstream openvswitch rpm also misses these
> (openvswitch2.13-2.13.0-17.el7fdp):
> https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1175011
> 
> Changing component to openvswitch2.13.
> 
> I guess it's good to have the upstream fixes downstream too instead of
> waiting for upstream 2.14, what do you think?

Yes I agree.

Comment 5 OvS team 2020-05-15 00:58:09 UTC
* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-24
- raft: Disable RAFT jsonrpc inactivity probe. (#1822290)
  [b12acf45a6872dda85642cbc73dd86eb529be17e]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-23
- raft: Fix leak of the incomplete command. (#1835729)
  [bb552cffb89104c2bb19b8aff749b8b825a6db13]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-22
- raft: Fix the problem of stuck in candidate role forever. (#1828639)
  [c5937276691bb90f99fad1871b5e3ca4ac9391e7]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-21
- raft: Fix next_index in install_snapshot reply handling. (#1828639)
  [09ac3c327ec678f36cd9df451b7846acdf734c0f]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-20
- raft: Avoid busy loop during leader election. (#1828639)
  [19683b041e19a49e275a4b42f5bb5b0528de898a]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-19
- raft: Fix raft_is_connected() when there is no leader yet. (#1828639)
  [2dae730162e5e1b084ac0d1fc339d2f09bd8cddb]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-18
- ovsdb-server: Don't disconnect clients after raft install_snapshot. (#1828639)
  [da9680c6095df8d6c477aa10e29baa8f00dc2e25]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-17
- raft-rpc: Fix message format. (#1828639)
  [e9bb63d6190925db63b4cad83e57a945c4ac0629]

Comment 8 Dumitru Ceara 2020-06-15 08:24:02 UTC
Hi Jianlin,

Here are the steps to reproduce the issue on RHEL8 with FDP packages:

$ rpm -q openvswitch2.13
openvswitch2.13-2.13.0-25.el8fdp.x86_64

$ rpm -q ovn2.11-central
ovn2.11-central-2.11.1-44.el8fdp.x86_64

$ rpm -q ovn2.11
ovn2.11-2.11.1-44.el8fdp.x86_64

# Use the attached ovnnb_db.db file, assuming it's located in /tmp
$ cp /tmp/ovnnb_db.db /var/lib/openvswitch/ovnnb_db.db

# Try to start ovn-northd and NB/SB DBs:
systemctl start ovn-northd

# Make sure the NB DB server is in "candidate" role:
$ ovs-appctl -t /run/openvswitch/ovnnb_db.ctl cluster/status OVN_Northbound | grep Role
Role: candidate

# Try to execute a transaction on the NB DB:
$ ovsdb-client transact unix:/var/run/openvswitch/ovnnb_db.sock '["OVN_Northbound", {"op": "update", "table": "NB_Global", "where": [], "row": {"nb_cfg": 42}}]'

# This ^^ hangs and ovsdb-server for NB hogs CPU 100%.
# We see in the logs:
$  grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log
2020-06-15T07:58:57.374Z|00670|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (98% CPU usage)
2020-06-15T07:58:57.374Z|00671|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (98% CPU usage)

The fact that the transaction hangs is ok because the database is not leader but CPU usage shouldn't be 100%.

Hope this helps,
Dumitru

Comment 9 Jianlin Shi 2020-06-15 09:09:11 UTC
Thanks Dumitru for reproducer in comment8.
reproduced on openvswitch2.13.0-17.el7:

[root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn"                      
ovn2.11-central-2.11.1-47.el7fdp.x86_64
openvswitch2.13-2.13.0-17.el7fdp.x86_64
ovn2.11-host-2.11.1-47.el7fdp.x86_64                                                               
ovn2.11-2.11.1-47.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch

[root@dell-per740-42 ~]# grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log                
2020-06-15T09:05:06.757Z|00095|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)
2020-06-15T09:05:12.757Z|00103|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)
2020-06-15T09:05:18.757Z|00111|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)
2020-06-15T09:05:24.757Z|00113|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)

Verified on openvswitch2.13.0-30.el7:

[root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn"
ovn2.11-central-2.11.1-47.el7fdp.x86_64
ovn2.11-host-2.11.1-47.el7fdp.x86_64
ovn2.11-2.11.1-47.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
openvswitch2.13-2.13.0-30.el7fdp.x86_64

[root@dell-per740-42 ~]# grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log

<=== no wakeup error

Comment 11 errata-xmlrpc 2020-07-15 13:02:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2944