Bug 1970042
| Summary: | northd should release leader lock if unable to connect to nbdb after a timeout | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Tim Rozet <trozet> | ||||
| Component: | OVN | Assignee: | Mark Michelson <mmichels> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Jianlin Shi <jishi> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | RHEL 8.0 | CC: | bhershbe, ctrautma, ealcaniz, i.maximets, jiji, mark.d.gray, mmichels, rkhan, yjoseph | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1972270 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-09-24 15:18:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1969445 | ||||||
| Attachments: |
|
||||||
|
Description
Tim Rozet
2021-06-09 17:04:11 UTC
Created attachment 1789632 [details]
relevant northd and nbdb logs
changing the title of this bug. We see the same problem happen with ovn-controller. Only remedy was to kill ovn-controller on all nodes and after restart they connected to db and started functioning. (In reply to Tim Rozet from comment #2) > changing the title of this bug. We see the same problem happen with > ovn-controller. Only remedy was to kill ovn-controller on all nodes and > after restart they connected to db and started functioning. Hi, Tim. This sounds like a different issue to me. In case of northd, it holds the lock on one database, but can't connect to the other one for whatever reason. In this case it should release the lock and allow a different northd that presumably has access to both batabases to become active. Issue with other clients is specific to stale data and limited to connection to a single database. This issue also has a workaround: drop database index with a special appctl command. ok I will change the title back and open a new bug Looking through the logs, you can see that northd is cycling through the NBDBs in the cluster. I think that all of them have stale data (we wouldn't see this in the logs as we are also specifying leader-only) perhaps from some event in which caused the databases to reset their stored indices (as described in https://github.com/openvswitch/ovs/commit/89b522aee379f7ebd21ec67ffb622118af7e9db1). Can we confirm or deny this? In this case, the proposed fix may not address the issue. The correct remedy would be to issue `ovn-appctl t ovn-northd nb-cluster-state`. Mark Gray came to the same conclusion as I did. Essentially, this is a problem that has a solution (or at least a workaround) in OVN. There is a similar issue open at https://bugzilla.redhat.com/show_bug.cgi?id=1972270 . Since the customer issue is closed and we have a similar issue we're tracking, I'm closing this issue in favor of tracking the previously-linked issue. |