Multiple northds run for HA in active/passive mode where the active northd holds a lock. If that northd loses connectivity to the database or is killed without releasing the lock, ovsdb-server will clear the lock after twice the inactivity probe. But if that probe is set to 0 (disabled) that will never happen, and a new northd will never grab the lock and continue reconciling NB->SB. Set the DB inactivity probe to something greater than 0 to ensure that a northd will always eventually become active. The value of 60 was chosen as a reasonable middle-ground between the lock being cleared and another northd grabbing it (~120s) and the possibility that a loaded ovsdb-server (many ovn-controller clients) would take more than 30-40 seconds to send/reply to all inactivity probes from clients.
documentation seems to suggest that inactivity_probe is in milliseconds https://github.com/ovn-org/ovn/blob/master/ovn-ic-nb.xml#L243 <column name="inactivity_probe"> Maximum number of milliseconds of idle time on connection to the client before sending an inactivity probe message. If Open vSwitch does not communicate with the client for the specified number of seconds, it will send a probe. If a response is not received for the same additional amount of time, Open vSwitch assumes the connection has been broken and attempts to reconnect. Default is implementation-specific. A value of 0 disables inactivity probes. </column> is this a documentation issue?
I don't see any lock changes in the logs, so I'm not sure this is doing anything even when I change it to 60000 milliseconds. I see the tests use "--inactivity-probe=" is that also required? https://github.com/ovn-org/ovn/blob/master/tests/ovn-nbctl.at#L1720 AT_CHECK([ovn-nbctl --inactivity-probe=30000 set-connection ptcp:6641:127.0.0.1 punix:$OVS_RUNDIR/ovnnb_db.sock]) https://github.com/ovn-org/ovn/blob/master/utilities/ovn-ic-nbctl.8.xml#L101 If tried with while ! ovn-nbctl --inactivity-probe=60000 --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000; do but didn't see the lock change in northd I just see the initial lock acquired. 2020-05-14T14:48:23Z|00050|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
verified on 4.5.0-0.nightly-2020-05-21-072118 northd-ovnkube-master-b8m7x-northd:2020-05-21T17:29:54Z|73741|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:14Z|73840|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:19Z|73876|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" northd-ovnkube-master-b8m7x-northd:2020-05-21T17:30:24Z|73892|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" # kill -STOP ovnkube-master-b8m7x-northd here # ~120 seconds later c2sc7 takes the lock northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66663|jsonrpc|DBG|ssl:10.0.0.4:9642: received notification, method="locked", params=["ovn_northd"] northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66664|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. northd-ovnkube-master-c2sc7-northd:2020-05-21T17:32:34Z|66675|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" northd-ovnkube-master-c2sc7-northd:2020-05-21T17:38:54Z|68572|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" # kill -CONT ovnkube-master-b8m7x-northd here and it goes to standby northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74080|jsonrpc|DBG|ssl:10.0.0.3:9642: send request, method="lock", params=["ovn_northd"], id=5529 northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74087|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:12Z|74102|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. northd-ovnkube-master-c2sc7-northd:2020-05-21T17:39:14Z|68667|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74194|jsonrpc|DBG|ssl:10.0.0.6:9642: send request, method="lock", params=["ovn_northd"], id=5534 northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74196|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:14Z|74205|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:18Z|74306|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="lock", params=["ovn_northd"], id=5539 northd-ovnkube-master-b8m7x-northd:2020-05-21T17:39:18Z|74313|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. northd-ovnkube-master-c2sc7-northd:2020-05-21T17:39:20Z|68702|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert" northd-ovnkube-master-c2sc7-northd:2020-05-21T17:40:54Z|69167|jsonrpc|DBG|ssl:10.0.0.4:9642: send request, method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409