Bug 2052398
Summary: | 4.9 to 4.10 upgrade fails for ovnkube-masters | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ravi Trivedi <travi> | |
Component: | Networking | Assignee: | Andreas Karis <akaris> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | anbhat, bpickard, cblecker, pmannidi, rbryant, sdodson, trozet, wking | |
Version: | 4.10 | Keywords: | FastFix, ServiceDeliveryBlocker, ServiceDeliveryImpact | |
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
A goroutine that was responsible for handling cache updates in libovsdb from ovsdb could stall writing to an unbuffered channel while holding a mutex.
Consequence:
This could cause a deadlock in the OVN-Kubernetes master processes.
Fix:
Improvements to the way how libovsdb handles concurrency.
Result:
With this update, such race conditions were solved.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2058729 2058762 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:48:33 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2058729, 2058762 |
Description
Ravi Trivedi
2022-02-09 07:55:54 UTC
Great analysis Andreas! I think after staring at this for hours we know what is happening, we just don't know the exact conditions of how it happened. We know that the goroutine that is responsible for handling cache updates from ovsdb is stalled, because it is blocked writing to an unbuffered channel. This goroutine holds the cache mutex while it is stalled, which causes deadlock across the entire ovnkube-master. We dont know exactly why the channel is not being read from, other than Andreas and I inspected the stack and we think perhaps the goroutine that reads the channel has been respawned and is reading from a new channel. Either way I think we have enough information to come up with a proper fix. I can see there is a potential race condition, where when we shutdown the client due to a disconnect from OVSDB server, we will stop the goroutine that receives on this channel. However at this time, the cache update goroutine could still be executing, which would cause this hang. Additionally, the handleCacheErrors goroutine was not functioning correctly. It is supposed to detect cache errors and then disconnect/reconnect the client. However there was a bug there preventing this. I pushed what I think will fix this: https://github.com/ovn-org/libovsdb/pull/297 I think with ^ plus Andreas fix to regenerate the models we should be covered. Ravi, is this bug reproducible? If this is the race we think it is, then I would think it would be more likely on upgrade when things are disconnecting/reconnecting as OVSDB servers are rolled on upgrade. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |