+++ This bug was initially created as a clone of Bug #1970042 +++ Description of problem: In a scenario where northd consistently thinks that a nbdb has a stale data, it will infinitely try to reconnect and hold the leader lock. For an example of this scenario see: https://bugzilla.redhat.com/show_bug.cgi?id=1969445#c5 northd should give up after some timeout and release the lock and exit, allowing another northd to try to take over. --- Additional comment from Tim Rozet on 2021-06-09 17:04:51 UTC --- --- Additional comment from Tim Rozet on 2021-06-11 19:38:26 UTC --- changing the title of this bug. We see the same problem happen with ovn-controller. Only remedy was to kill ovn-controller on all nodes and after restart they connected to db and started functioning. --- Additional comment from Ilya Maximets on 2021-06-11 20:05:28 UTC --- (In reply to Tim Rozet from comment #2) > changing the title of this bug. We see the same problem happen with > ovn-controller. Only remedy was to kill ovn-controller on all nodes and > after restart they connected to db and started functioning. Hi, Tim. This sounds like a different issue to me. In case of northd, it holds the lock on one database, but can't connect to the other one for whatever reason. In this case it should release the lock and allow a different northd that presumably has access to both batabases to become active. Issue with other clients is specific to stale data and limited to connection to a single database. This issue also has a workaround: drop database index with a special appctl command. --- Additional comment from Tim Rozet on 2021-06-15 14:19:32 UTC --- ok I will change the title back and open a new bug
ovn-controller constantly reports stale cache data from db and refuses to consume any db info. This causes networking to fail on the node, except ovn-controller never exits, so the node stays in a bad state. ovn-controller should exit after a certain number of times retrieving stale data or a timeout.
This sounds similar to the situation described in https://bugzilla.redhat.com/show_bug.cgi?id=1829109 . In that issue, the southbound database was forcibly deleted, and it resulted in both northd and ovn-controller constantly rejecting DB updates because the data was "stale". This is because the index on the server was reset to 0, but the clients still had cached a higher local index from before the DB was deleted. To get around this problem, we added commands to northd and ovn-controller to reset their cached index to 0 so that it would accept updates from the database. This sounds similar. Do you know what condition resulted in ovn-northd seeing the database data as stale? Was it due to the database being forcibly deleted and/or restarted? Or was it due to unknown circumstances? As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises? If it does, then there's at least a workaround. And if ovn-northd reached this state because of a manual deletion of the DB, then this could be a viable solution for the issue. When the NB DB is deleted, be sure to reset the cluster state in ovn-northd and you're golden. If the cause of the scenario is unknown, then the ideal fix would be to ensure this situation does not happen unexpectedly. Crafting a policy about how to automatically deal with this situation in ovn-controller is tough. My main concern is if such a policy could be exploited or if it might be overreaching.
> As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises? Just as a quick note, this suggestion was made specifically for the situation where ovn-northd is getting stale data from the NB DB. You can also issue: ovn-appctl -t ovn-northd sb-cluster-state-reset for the case when ovn-northd is getting stale data from the SB DB, and ovn-appctl -t ovn-controller sb-cluster-state-reset for the case when ovn-controller is getting stale data from the SB DB.