Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1970042

Summary: northd should release leader lock if unable to connect to nbdb after a timeout
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Tim Rozet <trozet>
Component: OVNAssignee: Mark Michelson <mmichels>
Status: CLOSED WONTFIX QA Contact: Jianlin Shi <jishi>
Severity: high Docs Contact:
Priority: urgent    
Version: RHEL 8.0CC: bhershbe, ctrautma, ealcaniz, i.maximets, jiji, mark.d.gray, mmichels, rkhan, yjoseph
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1972270 (view as bug list) Environment:
Last Closed: 2021-09-24 15:18:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1969445    
Attachments:
Description Flags
relevant northd and nbdb logs none

Description Tim Rozet 2021-06-09 17:04:11 UTC
Description of problem:
In a scenario where northd consistently thinks that a nbdb has a stale data, it will infinitely try to reconnect and hold the leader lock. For an example of this scenario see:

https://bugzilla.redhat.com/show_bug.cgi?id=1969445#c5

northd should give up after some timeout and release the lock and exit, allowing another northd to try to take over.

Comment 1 Tim Rozet 2021-06-09 17:04:51 UTC
Created attachment 1789632 [details]
relevant northd and nbdb logs

Comment 2 Tim Rozet 2021-06-11 19:38:26 UTC
changing the title of this bug. We see the same problem happen with ovn-controller. Only remedy was to kill ovn-controller on all nodes and after restart they connected to db and started functioning.

Comment 3 Ilya Maximets 2021-06-11 20:05:28 UTC
(In reply to Tim Rozet from comment #2)
> changing the title of this bug. We see the same problem happen with
> ovn-controller. Only remedy was to kill ovn-controller on all nodes and
> after restart they connected to db and started functioning.

Hi, Tim.  This sounds like a different issue to me.  In case of northd,
it holds the lock on one database, but can't connect to the other one
for whatever reason.  In this case it should release the lock and allow
a different northd that presumably has access to both batabases to become
active.

Issue with other clients is specific to stale data and limited to connection
to a single database.  This issue also has a workaround: drop database
index with a special appctl command.

Comment 4 Tim Rozet 2021-06-15 14:19:32 UTC
ok I will change the title back and open a new bug

Comment 6 Mark Gray 2021-09-24 14:58:58 UTC
Looking through the logs, you can see that northd is cycling through the NBDBs in the cluster. I think that all of them have stale data (we wouldn't see this in the logs as we are also specifying leader-only) perhaps from some event in which caused the databases to reset their stored indices (as described in https://github.com/openvswitch/ovs/commit/89b522aee379f7ebd21ec67ffb622118af7e9db1). Can we confirm or deny this?

In this case, the proposed fix may not address the issue. The correct remedy would be to issue `ovn-appctl t ovn-northd nb-cluster-state`.

Comment 7 Mark Michelson 2021-09-24 15:18:51 UTC
Mark Gray came to the same conclusion as I did. Essentially, this is a problem that has a solution (or at least a workaround) in OVN. There is a similar issue open at https://bugzilla.redhat.com/show_bug.cgi?id=1972270 . Since the customer issue is closed and we have a similar issue we're tracking, I'm closing this issue in favor of tracking the previously-linked issue.