Bug 1972270
| Summary: | ovn-controller infinitely reports stale cache data from sbdb without exiting | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Tim Rozet <trozet> |
| Component: | OVN | Assignee: | Mark Michelson <mmichels> |
| Status: | CLOSED WONTFIX | QA Contact: | Jianlin Shi <jishi> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | RHEL 8.0 | CC: | ctrautma, i.maximets, jiji, jishi, mmichels, ovnteam, rkhan |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1970042 | Environment: | |
| Last Closed: | 2021-09-29 15:49:47 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1969445 | ||
|
Description
Tim Rozet
2021-06-15 14:35:29 UTC
ovn-controller constantly reports stale cache data from db and refuses to consume any db info. This causes networking to fail on the node, except ovn-controller never exits, so the node stays in a bad state. ovn-controller should exit after a certain number of times retrieving stale data or a timeout. This sounds similar to the situation described in https://bugzilla.redhat.com/show_bug.cgi?id=1829109 . In that issue, the southbound database was forcibly deleted, and it resulted in both northd and ovn-controller constantly rejecting DB updates because the data was "stale". This is because the index on the server was reset to 0, but the clients still had cached a higher local index from before the DB was deleted. To get around this problem, we added commands to northd and ovn-controller to reset their cached index to 0 so that it would accept updates from the database. This sounds similar. Do you know what condition resulted in ovn-northd seeing the database data as stale? Was it due to the database being forcibly deleted and/or restarted? Or was it due to unknown circumstances? As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises? If it does, then there's at least a workaround. And if ovn-northd reached this state because of a manual deletion of the DB, then this could be a viable solution for the issue. When the NB DB is deleted, be sure to reset the cluster state in ovn-northd and you're golden. If the cause of the scenario is unknown, then the ideal fix would be to ensure this situation does not happen unexpectedly. Crafting a policy about how to automatically deal with this situation in ovn-controller is tough. My main concern is if such a policy could be exploited or if it might be overreaching. > As a quick check, can you issue `ovn-appctl -t ovn-northd nb-cluster-state-reset` and see if that fixes the issue when this situation arises?
Just as a quick note, this suggestion was made specifically for the situation where ovn-northd is getting stale data from the NB DB. You can also issue:
ovn-appctl -t ovn-northd sb-cluster-state-reset
for the case when ovn-northd is getting stale data from the SB DB, and
ovn-appctl -t ovn-controller sb-cluster-state-reset
for the case when ovn-controller is getting stale data from the SB DB.
|