Bug 1834473

Summary: ovnkube: set NB/SB database inactivity probes to 60 seconds
Product: OpenShift Container Platform Reporter: Dan Williams <dcbw>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aconstan, anusaxen
Version: 4.3.z   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1834474 (view as bug list) Environment:
Last Closed: 2020-07-13 17:37:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1834474    

Description Dan Williams 2020-05-11 19:06:24 UTC
Multiple northds run for HA in active/passive mode where the active
northd holds a lock. If that northd loses connectivity to the database
or is killed without releasing the lock, ovsdb-server will clear
the lock after twice the inactivity probe. But if that probe is set
to 0 (disabled) that will never happen, and a new northd will never
grab the lock and continue reconciling NB->SB.

Set the DB inactivity probe to something greater than 0 to ensure
that a northd will always eventually become active. The value of 60 was
chosen as a reasonable middle-ground between the lock being cleared
and another northd grabbing it (~120s) and the possibility that a loaded
ovsdb-server (many ovn-controller clients) would take more than 30-40
seconds to send/reply to all inactivity probes from clients.

Comment 3 Ross Brattain 2020-05-13 04:48:21 UTC
documentation seems to suggest that inactivity_probe is in milliseconds

https://github.com/ovn-org/ovn/blob/master/ovn-ic-nb.xml#L243

      <column name="inactivity_probe">
        Maximum number of milliseconds of idle time on connection to the client
        before sending an inactivity probe message.  If Open vSwitch does not
        communicate with the client for the specified number of seconds, it
        will send a probe.  If a response is not received for the same
        additional amount of time, Open vSwitch assumes the connection has been
        broken and attempts to reconnect.  Default is implementation-specific.
        A value of 0 disables inactivity probes.
      </column>

is this a documentation issue?

Comment 4 Ross Brattain 2020-05-14 15:55:56 UTC
I don't see any lock changes in the logs, so I'm not sure this is doing anything even when I change it to 60000 milliseconds.

I see the tests use "--inactivity-probe=" is that also required?


https://github.com/ovn-org/ovn/blob/master/tests/ovn-nbctl.at#L1720

AT_CHECK([ovn-nbctl --inactivity-probe=30000 set-connection ptcp:6641:127.0.0.1 punix:$OVS_RUNDIR/ovnnb_db.sock])

https://github.com/ovn-org/ovn/blob/master/utilities/ovn-ic-nbctl.8.xml#L101



If tried with 

while ! ovn-nbctl --inactivity-probe=60000 --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000; do

but didn't see the lock change in northd

I just see the initial lock acquired.

2020-05-14T14:48:23Z|00050|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.

Comment 6 Ross Brattain 2020-05-21 17:46:36 UTC
verified on 4.5.0-0.nightly-2020-05-21-072118




northd-ovnkube-master-b8m7x-northd:2020-05-21T17:29:54Z|73741|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:14Z|73840|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:19Z|73876|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:24Z|73892|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

# kill -STOP ovnkube-master-b8m7x-northd here

# ~120 seconds later c2sc7 takes the lock

northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66663|jsonrpc|DBG|ssl:10.0.0.4:9642: received notification, method="locked", params=["ovn_northd"]
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66664|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66675|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:38:54Z|68572|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

# kill -CONT ovnkube-master-b8m7x-northd here and it goes to standby

northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74080|jsonrpc|DBG|ssl:10.0.0.3:9642: send request, method="lock", params=["ovn_northd"], id=5529
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74087|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74102|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.

northd-ovnkube-master-c2sc7-northd:2020-05-21T17:39:14Z|68667|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74194|jsonrpc|DBG|ssl:10.0.0.6:9642: send request, method="lock", params=["ovn_northd"], id=5534
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74196|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74205|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:18Z|74306|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="lock", params=["ovn_northd"], id=5539
northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:18Z|74313|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.

northd-ovnkube-master-c2sc7-northd:2020-05-21T17:39:20Z|68702|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
northd-ovnkube-master-c2sc7-northd:2020-05-21T17:40:54Z|69167|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"

Comment 7 errata-xmlrpc 2020-07-13 17:37:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409