1972270 – ovn-controller infinitely reports stale cache data from sbdb without exiting

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1972270 - ovn-controller infinitely reports stale cache data from sbdb without exiting

Summary: ovn-controller infinitely reports stale cache data from sbdb without exiting

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	OVN
Sub Component:
Version:	RHEL 8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mark Michelson
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1969445
TreeView+	depends on / blocked

Reported:	2021-06-15 14:35 UTC by Tim Rozet
Modified:	2021-09-29 15:49 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1970042
Environment:
Last Closed:	2021-09-29 15:49:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1363	0	None	None	None	2021-09-23 18:55:55 UTC

Description Tim Rozet 2021-06-15 14:35:29 UTC

+++ This bug was initially created as a clone of Bug #1970042 +++

Description of problem:
In a scenario where northd consistently thinks that a nbdb has a stale data, it will infinitely try to reconnect and hold the leader lock. For an example of this scenario see:

https://bugzilla.redhat.com/show_bug.cgi?id=1969445#c5

northd should give up after some timeout and release the lock and exit, allowing another northd to try to take over.

--- Additional comment from Tim Rozet on 2021-06-09 17:04:51 UTC ---



--- Additional comment from Tim Rozet on 2021-06-11 19:38:26 UTC ---

changing the title of this bug. We see the same problem happen with ovn-controller. Only remedy was to kill ovn-controller on all nodes and after restart they connected to db and started functioning.

--- Additional comment from Ilya Maximets on 2021-06-11 20:05:28 UTC ---

(In reply to Tim Rozet from comment #2)
> changing the title of this bug. We see the same problem happen with
> ovn-controller. Only remedy was to kill ovn-controller on all nodes and
> after restart they connected to db and started functioning.

Hi, Tim.  This sounds like a different issue to me.  In case of northd,
it holds the lock on one database, but can't connect to the other one
for whatever reason.  In this case it should release the lock and allow
a different northd that presumably has access to both batabases to become
active.

Issue with other clients is specific to stale data and limited to connection
to a single database.  This issue also has a workaround: drop database
index with a special appctl command.

--- Additional comment from Tim Rozet on 2021-06-15 14:19:32 UTC ---

ok I will change the title back and open a new bug

Comment 1 Tim Rozet 2021-06-15 14:37:58 UTC

ovn-controller constantly reports stale cache data from db and refuses to consume any db info. This causes networking to fail on the node, except ovn-controller never exits, so the node stays in a bad state. ovn-controller should exit after a certain number of times retrieving stale data or a timeout.

Comment 2 Mark Michelson 2021-09-23 18:55:05 UTC

This sounds similar to the situation described in https://bugzilla.redhat.com/show_bug.cgi?id=1829109 .

In that issue, the southbound database was forcibly deleted, and it resulted in both northd and ovn-controller constantly rejecting DB updates because the data was "stale". This is because the index on the server was reset to 0, but the clients still had cached a higher local index from before the DB was deleted. To get around this problem, we added commands to northd and ovn-controller to reset their cached index to 0 so that it would accept updates from the database.

This sounds similar. Do you know what condition resulted in ovn-northd seeing the database data as stale? Was it due to the database being forcibly deleted and/or restarted? Or was it due to unknown circumstances?

As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises?

If it does, then there's at least a workaround. And if ovn-northd reached this state because of a manual deletion of the DB, then this could be a viable solution for the issue. When the NB DB is deleted, be sure to reset the cluster state in ovn-northd and you're golden.

If the cause of the scenario is unknown, then the ideal fix would be to ensure this situation does not happen unexpectedly. Crafting a policy about how to automatically deal with this situation in ovn-controller is tough. My main concern is if such a policy could be exploited or if it might be overreaching.

Comment 3 Mark Michelson 2021-09-23 18:58:26 UTC

> As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises?

Just as a quick note, this suggestion was made specifically for the situation where ovn-northd is getting stale data from the NB DB. You can also issue:

ovn-appctl -t ovn-northd sb-cluster-state-reset

for the case when ovn-northd is getting stale data from the SB DB, and

ovn-appctl -t ovn-controller sb-cluster-state-reset

for the case when ovn-controller is getting stale data from the SB DB.

Note You need to log in before you can comment on or make changes to this bug.