Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1970042

Summary:

northd should release leader lock if unable to connect to nbdb after a timeout

Product:

Red Hat Enterprise Linux Fast Datapath

Reporter:

Tim Rozet <trozet>

Component:

OVN

Assignee:

Mark Michelson <mmichels>

Status:

CLOSED WONTFIX

QA Contact:

Jianlin Shi <jishi>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

RHEL 8.0

CC:

bhershbe, ctrautma, ealcaniz, i.maximets, jiji, mark.d.gray, mmichels, rkhan, yjoseph

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1972270 (view as bug list)

Environment:

Last Closed:

2021-09-24 15:18:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1969445

Attachments:

Description	Flags
relevant northd and nbdb logs	none

Description Tim Rozet 2021-06-09 17:04:11 UTC

Description of problem:
In a scenario where northd consistently thinks that a nbdb has a stale data, it will infinitely try to reconnect and hold the leader lock. For an example of this scenario see:

https://bugzilla.redhat.com/show_bug.cgi?id=1969445#c5

northd should give up after some timeout and release the lock and exit, allowing another northd to try to take over.

Comment 1 Tim Rozet 2021-06-09 17:04:51 UTC

Created attachment 1789632 [details]
relevant northd and nbdb logs

Comment 2 Tim Rozet 2021-06-11 19:38:26 UTC

changing the title of this bug. We see the same problem happen with ovn-controller. Only remedy was to kill ovn-controller on all nodes and after restart they connected to db and started functioning.

Comment 3 Ilya Maximets 2021-06-11 20:05:28 UTC

(In reply to Tim Rozet from comment #2)
> changing the title of this bug. We see the same problem happen with
> ovn-controller. Only remedy was to kill ovn-controller on all nodes and
> after restart they connected to db and started functioning.

Hi, Tim.  This sounds like a different issue to me.  In case of northd,
it holds the lock on one database, but can't connect to the other one
for whatever reason.  In this case it should release the lock and allow
a different northd that presumably has access to both batabases to become
active.

Issue with other clients is specific to stale data and limited to connection
to a single database.  This issue also has a workaround: drop database
index with a special appctl command.

Comment 4 Tim Rozet 2021-06-15 14:19:32 UTC

ok I will change the title back and open a new bug

Comment 6 Mark Gray 2021-09-24 14:58:58 UTC

Looking through the logs, you can see that northd is cycling through the NBDBs in the cluster. I think that all of them have stale data (we wouldn't see this in the logs as we are also specifying leader-only) perhaps from some event in which caused the databases to reset their stored indices (as described in https://github.com/openvswitch/ovs/commit/89b522aee379f7ebd21ec67ffb622118af7e9db1). Can we confirm or deny this?

In this case, the proposed fix may not address the issue. The correct remedy would be to issue `ovn-appctl t ovn-northd nb-cluster-state`.

Comment 7 Mark Michelson 2021-09-24 15:18:51 UTC

Mark Gray came to the same conclusion as I did. Essentially, this is a problem that has a solution (or at least a workaround) in OVN. There is a similar issue open at https://bugzilla.redhat.com/show_bug.cgi?id=1972270 . Since the customer issue is closed and we have a similar issue we're tracking, I'm closing this issue in favor of tracking the previously-linked issue.