1828639 – [telco] ovsdb-server for NBDB pegged at 100% when running these commands

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1828639 - [telco] ovsdb-server for NBDB pegged at 100% when running these commands

Summary: [telco] ovsdb-server for NBDB pegged at 100% when running these commands

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	openvswitch2.13
Sub Component:
Version:	FDP 20.C
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Dumitru Ceara
QA Contact:	Zhiqiang Fang
Docs Contact:
URL:
Whiteboard:	Telco
Depends On:
Blocks:	1828637 1836305 1837257
TreeView+	depends on / blocked

Reported:	2020-04-28 01:46 UTC by Dan Williams
Modified:	2023-10-06 19:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:	openvswitch2.13-2.13.0-24.el7fdp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1836305 (view as bug list)
Environment:
Last Closed:	2020-07-15 13:02:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2944	0	None	None	None	2020-07-15 13:02:20 UTC

Comment 2 Dan Williams 2020-05-11 19:49:17 UTC

Admittedly this isn't a common situation and feel free to close WONTFIX.

ovsdb-server was openvswitch-2.13.0-1.fc32.x86_64
ovn-nbctl was ovn-2.11.1-5.fc30.x86_64 (which might be almost 2.12?)

I *cannot* reproduce the issue with OVN 2.13 + OVS 2.13 because it fails to connect to the database with "ovn-nbctl: unix:/var/run/openvswitch/foo.sock: database connection failed ()".

But OVN 2.11.1 does not throw that error and creates the 100% CPU issue as long as the ovn-nbctl command is running. If I Ctl+C ovn-nbctl things settle down.

ovn-nbctl (2.11.1) looks to be stuck in ovsdb_idl_txn_commit_block().

Server side is stuck in looping over this:

recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=24964}, ru_stime={tv_sec=47, tv_usec=489259}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25064}, ru_stime={tv_sec=47, tv_usec=489259}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25153}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25249}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0
read(16, 0x7ffe674f03b0, 512)           = -1 EAGAIN (Resource temporarily unavailable)
accept(13, 0x7ffe674f04b0, [128])       = -1 EAGAIN (Resource temporarily unavailable)
accept(10, 0x7ffe674f0440, [128])       = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(19, 0x55a28b1e7434, 1164, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}, {fd=19, events=POLLIN}], 4, 0) = 0 (Timeout)
getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=153, tv_usec=25343}, ru_stime={tv_sec=47, tv_usec=489267}, ...}) = 0

Thread 1 (Thread 0x7fe2bdefc000 (LWP 30817)):
#0  0x00007fe2bfd08a9b in _int_free () from /lib64/libc.so.6
#1  0x00007fe2c01751ea in ovsdb_txn_table_destroy (txn_table=<optimized out>) at ovsdb/transaction.c:1351
#2  for_each_txn_row (txn=txn@entry=0x55a2915f36f0, cb=cb@entry=0x7fe2c0174950 <ovsdb_txn_row_abort>) at ovsdb/transaction.c:1412
#3  0x00007fe2c0175c54 in ovsdb_txn_abort (txn=0x55a2915f36f0) at ovsdb/transaction.c:190
#4  0x00007fe2c0174252 in ovsdb_trigger_try (t=0x55a28df33520, now=<optimized out>) at ovsdb/trigger.c:302
#5  0x00007fe2c0174838 in ovsdb_trigger_run (db=0x55a28dd8dc60, now=2519641166) at ovsdb/trigger.c:152
#6  0x000055a289d83c59 in main_loop (is_backup=0x7ffe674f06ea, exiting=0x7ffe674f06eb, run_process=0x0, remotes=0x7ffe674f0740, unixctl=0x55a291631590, all_dbs=0x7ffe674f0780, jsonrpc=0x55a28b15d310, config=0x7ffe674f07e0) at ovsdb/ovsdb-server.c:230
#7  main (argc=<optimized out>, argv=<optimized out>) at ovsdb/ovsdb-server.c:469

Comment 3 Dumitru Ceara 2020-05-12 10:31:29 UTC

Thanks Dan for the clarification.

openvswitch-2.13.0-1.fc32.x86_64 misses latest upstream raft fixes:

$ git log v2.13.0..origin/branch-2.13 --oneline | grep -i raft
168beb87c raft: Fix leak of the incomplete command.
25a7e5547 raft: Fix the problem of stuck in candidate role forever.
877618fc8 raft: Fix next_index in install_snapshot reply handling.
3ae90e189 raft: Avoid busy loop during leader election.
adc64ab05 raft: Fix raft_is_connected() when there is no leader yet.
f0c8b44c5 ovsdb-server: Don't disconnect clients after raft install_snapshot.
78c8011f5 raft-rpc: Fix message format.

Specifically, we see the 100% CPU usage due to not having:
3ae90e189 raft: Avoid busy loop during leader election.

Looks like latest downstream openvswitch rpm also misses these (openvswitch2.13-2.13.0-17.el7fdp):
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1175011

Changing component to openvswitch2.13.

I guess it's good to have the upstream fixes downstream too instead of waiting for upstream 2.14, what do you think?

Comment 4 Dan Williams 2020-05-12 15:35:14 UTC

(In reply to Dumitru Ceara from comment #3)
> Thanks Dan for the clarification.
> 
> openvswitch-2.13.0-1.fc32.x86_64 misses latest upstream raft fixes:
> 
> $ git log v2.13.0..origin/branch-2.13 --oneline | grep -i raft
> 168beb87c raft: Fix leak of the incomplete command.
> 25a7e5547 raft: Fix the problem of stuck in candidate role forever.
> 877618fc8 raft: Fix next_index in install_snapshot reply handling.
> 3ae90e189 raft: Avoid busy loop during leader election.
> adc64ab05 raft: Fix raft_is_connected() when there is no leader yet.
> f0c8b44c5 ovsdb-server: Don't disconnect clients after raft install_snapshot.
> 78c8011f5 raft-rpc: Fix message format.
> 
> Specifically, we see the 100% CPU usage due to not having:
> 3ae90e189 raft: Avoid busy loop during leader election.
> 
> Looks like latest downstream openvswitch rpm also misses these
> (openvswitch2.13-2.13.0-17.el7fdp):
> https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1175011
> 
> Changing component to openvswitch2.13.
> 
> I guess it's good to have the upstream fixes downstream too instead of
> waiting for upstream 2.14, what do you think?

Yes I agree.

Comment 5 OvS team 2020-05-15 00:58:09 UTC

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-24
- raft: Disable RAFT jsonrpc inactivity probe. (#1822290)
  [b12acf45a6872dda85642cbc73dd86eb529be17e]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-23
- raft: Fix leak of the incomplete command. (#1835729)
  [bb552cffb89104c2bb19b8aff749b8b825a6db13]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-22
- raft: Fix the problem of stuck in candidate role forever. (#1828639)
  [c5937276691bb90f99fad1871b5e3ca4ac9391e7]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-21
- raft: Fix next_index in install_snapshot reply handling. (#1828639)
  [09ac3c327ec678f36cd9df451b7846acdf734c0f]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-20
- raft: Avoid busy loop during leader election. (#1828639)
  [19683b041e19a49e275a4b42f5bb5b0528de898a]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-19
- raft: Fix raft_is_connected() when there is no leader yet. (#1828639)
  [2dae730162e5e1b084ac0d1fc339d2f09bd8cddb]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-18
- ovsdb-server: Don't disconnect clients after raft install_snapshot. (#1828639)
  [da9680c6095df8d6c477aa10e29baa8f00dc2e25]

* Thu May 14 2020 Dumitru Ceara <dceara> - 2.13.0-17
- raft-rpc: Fix message format. (#1828639)
  [e9bb63d6190925db63b4cad83e57a945c4ac0629]

Comment 8 Dumitru Ceara 2020-06-15 08:24:02 UTC

Hi Jianlin,

Here are the steps to reproduce the issue on RHEL8 with FDP packages:

$ rpm -q openvswitch2.13
openvswitch2.13-2.13.0-25.el8fdp.x86_64

$ rpm -q ovn2.11-central
ovn2.11-central-2.11.1-44.el8fdp.x86_64

$ rpm -q ovn2.11
ovn2.11-2.11.1-44.el8fdp.x86_64

# Use the attached ovnnb_db.db file, assuming it's located in /tmp
$ cp /tmp/ovnnb_db.db /var/lib/openvswitch/ovnnb_db.db

# Try to start ovn-northd and NB/SB DBs:
systemctl start ovn-northd

# Make sure the NB DB server is in "candidate" role:
$ ovs-appctl -t /run/openvswitch/ovnnb_db.ctl cluster/status OVN_Northbound | grep Role
Role: candidate

# Try to execute a transaction on the NB DB:
$ ovsdb-client transact unix:/var/run/openvswitch/ovnnb_db.sock '["OVN_Northbound", {"op": "update", "table": "NB_Global", "where": [], "row": {"nb_cfg": 42}}]'

# This ^^ hangs and ovsdb-server for NB hogs CPU 100%.
# We see in the logs:
$  grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log
2020-06-15T07:58:57.374Z|00670|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (98% CPU usage)
2020-06-15T07:58:57.374Z|00671|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (98% CPU usage)

The fact that the transaction hangs is ok because the database is not leader but CPU usage shouldn't be 100%.

Hope this helps,
Dumitru

Comment 9 Jianlin Shi 2020-06-15 09:09:11 UTC

Thanks Dumitru for reproducer in comment8.
reproduced on openvswitch2.13.0-17.el7:

[root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn"                      
ovn2.11-central-2.11.1-47.el7fdp.x86_64
openvswitch2.13-2.13.0-17.el7fdp.x86_64
ovn2.11-host-2.11.1-47.el7fdp.x86_64                                                               
ovn2.11-2.11.1-47.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch

[root@dell-per740-42 ~]# grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log                
2020-06-15T09:05:06.757Z|00095|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)
2020-06-15T09:05:12.757Z|00103|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)
2020-06-15T09:05:18.757Z|00111|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)
2020-06-15T09:05:24.757Z|00113|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 (99% CPU usage)

Verified on openvswitch2.13.0-30.el7:

[root@dell-per740-42 ~]# rpm -qa | grep -E "openvswitch|ovn"
ovn2.11-central-2.11.1-47.el7fdp.x86_64
ovn2.11-host-2.11.1-47.el7fdp.x86_64
ovn2.11-2.11.1-47.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
openvswitch2.13-2.13.0-30.el7fdp.x86_64

[root@dell-per740-42 ~]# grep "wakeup due to" /var/log/openvswitch/ovsdb-server-nb.log

<=== no wakeup error

Comment 11 errata-xmlrpc 2020-07-15 13:02:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2944

Note You need to log in before you can comment on or make changes to this bug.