Bug 2036113
Summary: | cluster scaling new nodes ovs-configuration fails on all new nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alvaro Soto <asoto> |
Component: | Networking | Assignee: | Jaime CaamaƱo Ruiz <jcaamano> |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | ajuarez, anusaxen, bhershbe, bpickard, bzvonar, dacarpen, ealcaniz, fbaudin, gdiotte, jcaamano, openshift-bugs-escalate, pibanezr, rbrattai, rpittau, trozet, vpickard, yjoseph, yprokule |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-12 04:40:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2038249 |
Description
Alvaro Soto
2021-12-29 18:51:57 UTC
Following a troubleshooting call on 03114000, we isolated the issue to a stale uuid within ovs. Running ovs-vsctl del-br br-ex allowed the ovs-configuration to succeed. What we're wondering about at this point is which mechanism exists to ensure the ovs UUID matches the NetworkManager UUID. It seems that rebooting allows those two to misalign. As a more persistent workaround, this is what we're doing, line 123 of the script. if ! nmcli connection show br-ex &> /dev/null; then nmcli c add type ovs-bridge \ con-name br-ex \ conn.interface br-ex \ 802-3-ethernet.mtu ${iface_mtu} \ 802-3-ethernet.cloned-mac-address ${iface_mac} \ ipv4.route-metric 100 \ ipv6.route-metric 100 \ ${extra_brex_args} fi becomes if ! nmcli connection show br-ex &> /dev/null; then ovs-vsctl --if-exists del-br br-ex nmcli c add type ovs-bridge \ con-name br-ex \ conn.interface br-ex \ 802-3-ethernet.mtu ${iface_mtu} \ 802-3-ethernet.cloned-mac-address ${iface_mac} \ ipv4.route-metric 100 \ ipv6.route-metric 100 \ ${extra_brex_args} fi Notice that we add the bridge deletion in ovs before creating it again through nmcli. This workaround has so far shown to address the uuid mismatch by validating the bridge doesn't exist in ovs before recreating it through networkmanager. This is, effectively, the previous workaround but automated. Is there a long term solution to this that would be safer? Hi Jcaamano, When they scale out/in, the nodes are re-labeled and get rebooted which applies the ovs-configuration that gets stuck due to lack of persistence. It would seem that an error occurs that causes these files to either not get cleaned up on start, or they don't clean up because I assume the naming convention changes. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |