Bug 1988440
Summary: | Network operator changes ovnkube-config too early causing ovnkube-master pods to crashloop during cluster upgrade | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Neil Girard <ngirard> | |
Component: | Networking | Assignee: | Christoph Stäbler <cstabler> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Mehul Modi <memodi> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | medium | CC: | anbhat, astoycos, bpickard, chdeshpa, memodi, surya, zzhao | |
Version: | 4.6 | |||
Target Milestone: | --- | |||
Target Release: | 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: ovnkube-node & -master pods fail to start, when the config file contains an unknown field or section.
Consequence: Can lead to failures on ovn-kubernetes updates, if a new config field or section was introduced. Imagine the following scenario:
1. ConfigMap is updated
2. ovnkube-node rollout starts
3. somehow an ovnkube-master pod needs to be (re-)started (be it through eviction from a node or something else)
4. the newly started ovnkube-master pod isn't aware of the new config structure (as it is still on the old version) and fails to parse the config, resulting in a crashloop of the newly ovnkube-master. This can result in a stucking rollout.
Fix: Make ovn-kube resilient for unknown field in config files and logs a warning instead of exiting if such a field was found.
Result: ovn-kube updates do not fail if config file contains an unknown field or section.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2027983 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-12 04:36:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2027983 |
Description
Neil Girard
2021-07-30 14:42:10 UTC
(In reply to Aniket Bhat from comment #1) > @ngirard great analysis on the bug. I think we haven't seen this > in our upgrade jobs, but I do understand the problem. If the master pods > don't get restarted between the time the CNO updates the config map and when > the ovnkube-master daemonset rolls out with the new image, we should be > covered. > > I will try to figure out updating the config map closer to the daemonset > roll out of the masters to narrow down the window of failure. Yeah actually we can't move the configmap update close to the master rollouts. ovnkube-nodes also use the same configmap, so if the nodes roll out before CNO picks up configmap (plus since CNO is level triggered for reconciliation this is hard to achieve) then once CNO applies the configmap nodes will reboot a second time which we don't want. (In reply to cstabler from comment #4) > What do you think about making ovnkube-(node&master) more resilient against > unknown fields in the configmap? So if this configmap cannot be manipulated by users as in its not user facing (which I think its not since CNO would reconcile any manual changes), we should be good. Same for the upstream scenario. If we don't expose the knobs to users and its an internal thing; I don't mind us silencing/ignoring unknown fields. But just wanna call out that its bad api ui (again since its not user facing we could get away with this). I just don't want folks to supply values and be surprised that changes aren't taking effect. Apart from the gateway-mode-config overrides that we allow users to do which doesn't get passed directly to ovn-k and is just parsed into the exec commands directly, I don't think we allow changing the configmap values. So we should be good from OCP perspective to do this change, let's make sure upstream is fine as well. Added QE testcoverage: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-46654 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |