Bug 1972270

Summary:	ovn-controller infinitely reports stale cache data from sbdb without exiting
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Tim Rozet <trozet>
Component:	OVN	Assignee:	Mark Michelson <mmichels>
Status:	CLOSED WONTFIX	QA Contact:	Jianlin Shi <jishi>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	RHEL 8.0	CC:	ctrautma, i.maximets, jiji, jishi, mmichels, ovnteam, rkhan
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1970042	Environment:
Last Closed:	2021-09-29 15:49:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1969445

Description Tim Rozet 2021-06-15 14:35:29 UTC

+++ This bug was initially created as a clone of Bug #1970042 +++

Description of problem:
In a scenario where northd consistently thinks that a nbdb has a stale data, it will infinitely try to reconnect and hold the leader lock. For an example of this scenario see:

https://bugzilla.redhat.com/show_bug.cgi?id=1969445#c5

northd should give up after some timeout and release the lock and exit, allowing another northd to try to take over.

--- Additional comment from Tim Rozet on 2021-06-09 17:04:51 UTC ---



--- Additional comment from Tim Rozet on 2021-06-11 19:38:26 UTC ---

changing the title of this bug. We see the same problem happen with ovn-controller. Only remedy was to kill ovn-controller on all nodes and after restart they connected to db and started functioning.

--- Additional comment from Ilya Maximets on 2021-06-11 20:05:28 UTC ---

(In reply to Tim Rozet from comment #2)
> changing the title of this bug. We see the same problem happen with
> ovn-controller. Only remedy was to kill ovn-controller on all nodes and
> after restart they connected to db and started functioning.

Hi, Tim.  This sounds like a different issue to me.  In case of northd,
it holds the lock on one database, but can't connect to the other one
for whatever reason.  In this case it should release the lock and allow
a different northd that presumably has access to both batabases to become
active.

Issue with other clients is specific to stale data and limited to connection
to a single database.  This issue also has a workaround: drop database
index with a special appctl command.

--- Additional comment from Tim Rozet on 2021-06-15 14:19:32 UTC ---

ok I will change the title back and open a new bug

Comment 1 Tim Rozet 2021-06-15 14:37:58 UTC

ovn-controller constantly reports stale cache data from db and refuses to consume any db info. This causes networking to fail on the node, except ovn-controller never exits, so the node stays in a bad state. ovn-controller should exit after a certain number of times retrieving stale data or a timeout.

Comment 2 Mark Michelson 2021-09-23 18:55:05 UTC

This sounds similar to the situation described in https://bugzilla.redhat.com/show_bug.cgi?id=1829109 .

In that issue, the southbound database was forcibly deleted, and it resulted in both northd and ovn-controller constantly rejecting DB updates because the data was "stale". This is because the index on the server was reset to 0, but the clients still had cached a higher local index from before the DB was deleted. To get around this problem, we added commands to northd and ovn-controller to reset their cached index to 0 so that it would accept updates from the database.

This sounds similar. Do you know what condition resulted in ovn-northd seeing the database data as stale? Was it due to the database being forcibly deleted and/or restarted? Or was it due to unknown circumstances?

As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises?

If it does, then there's at least a workaround. And if ovn-northd reached this state because of a manual deletion of the DB, then this could be a viable solution for the issue. When the NB DB is deleted, be sure to reset the cluster state in ovn-northd and you're golden.

If the cause of the scenario is unknown, then the ideal fix would be to ensure this situation does not happen unexpectedly. Crafting a policy about how to automatically deal with this situation in ovn-controller is tough. My main concern is if such a policy could be exploited or if it might be overreaching.

Comment 3 Mark Michelson 2021-09-23 18:58:26 UTC

> As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises?

Just as a quick note, this suggestion was made specifically for the situation where ovn-northd is getting stale data from the NB DB. You can also issue:

ovn-appctl -t ovn-northd sb-cluster-state-reset

for the case when ovn-northd is getting stale data from the SB DB, and

ovn-appctl -t ovn-controller sb-cluster-state-reset

for the case when ovn-controller is getting stale data from the SB DB.