Bug 1834473 - ovnkube: set NB/SB database inactivity probes to 60 seconds
Summary: ovnkube: set NB/SB database inactivity probes to 60 seconds
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.5.0
Assignee: Dan Williams
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks: 1834474
TreeView+ depends on / blocked
 
Reported: 2020-05-11 19:06 UTC by Dan Williams
Modified: 2020-07-15 19:37 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1834474 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:37:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 631 0 None closed Bug 1834473: ovnkube: set NB/SB database inactivity probes to 60 seconds 2020-10-01 22:10:19 UTC
Github openshift cluster-network-operator pull 643 0 None closed Bug 1834473: ovnkube: really set NB/SB database inactivity probes to 60 seconds 2020-10-01 22:10:18 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:37:48 UTC

Description Dan Williams 2020-05-11 19:06:24 UTC
Multiple northds run for HA in active/passive mode where the active
northd holds a lock. If that northd loses connectivity to the database
or is killed without releasing the lock, ovsdb-server will clear
the lock after twice the inactivity probe. But if that probe is set
to 0 (disabled) that will never happen, and a new northd will never
grab the lock and continue reconciling NB->SB.

Set the DB inactivity probe to something greater than 0 to ensure
that a northd will always eventually become active. The value of 60 was
chosen as a reasonable middle-ground between the lock being cleared
and another northd grabbing it (~120s) and the possibility that a loaded
ovsdb-server (many ovn-controller clients) would take more than 30-40
seconds to send/reply to all inactivity probes from clients.

Comment 3 Ross Brattain 2020-05-13 04:48:21 UTC
documentation seems to suggest that inactivity_probe is in milliseconds

https://github.com/ovn-org/ovn/blob/master/ovn-ic-nb.xml#L243

      <column name="inactivity_probe">
        Maximum number of milliseconds of idle time on connection to the client
        before sending an inactivity probe message.  If Open vSwitch does not
        communicate with the client for the specified number of seconds, it
        will send a probe.  If a response is not received for the same
        additional amount of time, Open vSwitch assumes the connection has been
        broken and attempts to reconnect.  Default is implementation-specific.
        A value of 0 disables inactivity probes.
      </column>

is this a documentation issue?

Comment 4 Ross Brattain 2020-05-14 15:55:56 UTC
I don't see any lock changes in the logs, so I'm not sure this is doing anything even when I change it to 60000 milliseconds.

I see the tests use "--inactivity-probe=" is that also required?


https://github.com/ovn-org/ovn/blob/master/tests/ovn-nbctl.at#L1720

AT_CHECK([ovn-nbctl --inactivity-probe=30000 set-connection ptcp:6641:127.0.0.1 punix:$OVS_RUNDIR/ovnnb_db.sock])

https://github.com/ovn-org/ovn/blob/master/utilities/ovn-ic-nbctl.8.xml#L101



If tried with 

while ! ovn-nbctl --inactivity-probe=60000 --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000; do

but didn't see the lock change in northd

I just see the initial lock acquired.

2020-05-14T14:48:23Z|00050|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.

Comment 6 Ross Brattain 2020-05-21 17:46:36 UTC
verified on 4.5.0-0.nightly-2020-05-21-072118




northd-ovnkube-master-b8m7x-northd:2020-05-21T17:29:54Z|73741|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:14Z|73840|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:19Z|73876|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:24Z|73892|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

# kill -STOP ovnkube-master-b8m7x-northd here

# ~120 seconds later c2sc7 takes the lock

northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66663|jsonrpc|DBG|ssl:10.0.0.4:9642: received notification, method="locked", params=["ovn_northd"]
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66664|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66675|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:38:54Z|68572|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

# kill -CONT ovnkube-master-b8m7x-northd here and it goes to standby

northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74080|jsonrpc|DBG|ssl:10.0.0.3:9642: send request, method="lock", params=["ovn_northd"], id=5529
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74087|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74102|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.

northd-ovnkube-master-c2sc7-northd:2020-05-21T17:39:14Z|68667|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74194|jsonrpc|DBG|ssl:10.0.0.6:9642: send request, method="lock", params=["ovn_northd"], id=5534
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74196|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74205|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:18Z|74306|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="lock", params=["ovn_northd"], id=5539
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:18Z|74313|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.

northd-ovnkube-master-c2sc7-northd:2020-05-21T17:39:20Z|68702|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:40:54Z|69167|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

Comment 7 errata-xmlrpc 2020-07-13 17:37:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.