Bug 1882667
Summary: | [ovn] br-ex Link not found when scale up RHEL worker | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | zhaozhanqi <zzhao> | |
Component: | Networking | Assignee: | Tim Rozet <trozet> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | aconstan, anbhat, anusaxen, bbennett, dcbw, huirwang, jialiu, mifiedle, trozet | |
Version: | 4.6 | Keywords: | TestBlocker | |
Target Milestone: | --- | Flags: | trozet:
needinfo-
|
|
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
Network-Manager OVS packages were not being installed for RHEL worker nodes, as well as other deployment time issues with systemd on RHEL nodes.
Consequence:
OVN-Kubernetes was unable to be used with RHEL worker nodes. NetworkManager packages were missing and this would cause the deployment to fail.
Fix:
Missing packages are now installed correctly by Openshift-Ansible at deploy time and systemd dependencies are fixed.
Result:
RHEL workers on 7.9z now work with OVN-Kubernetes deployments.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1884323 (view as bug list) | Environment: | ||
Last Closed: | 2021-02-24 15:21:14 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1871935, 1884095, 1885365 | |||
Bug Blocks: | 1884323 |
Description
zhaozhanqi
2020-09-25 09:56:42 UTC
ovs-configuration service is inactive on the RHEL worker sh-4.2# journalctl -u ovs-configuration -- No entries -- sh-4.2# systemctl status ovs-configuration ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: inactive (dead) Condition: start condition failed at Fri 2020-09-25 09:09:22 UTC; 55min ago ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json was not met Is this a RHEL7 or RHEL8 node? Also, this implies that /etc/ignition-machine-config-encapsulated.json is present on the machine. That file should be removed by the machine-config-daemon when it restarts the node after configuring it. Perhaps the node hasn't been restarted after the MCD configuration has been done? (In reply to Dan Williams from comment #2) > Is this a RHEL7 or RHEL8 node? A cluster was brought up on RHCOS nodes and later a RHEL7 node was scaled up on same. I am trying to simulate same cluster right now and can share if required this issue still NOT be fixed on CI build 4.6.0-0.ci-2020-09-30-031307 the ignition-machine-config-encapsulated.json is not exist in RHEL node. #oc debug node/ip-10-0-52-71.us-east-2.compute.internal Starting pod/ip-10-0-52-71us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.2# ls /etc/*json /etc/mcs-machine-config-content.json oc get pod -n openshift-ovn-kubernetes -o wide | grep ip-10-0-52-71.us-east-2.compute.internal ovnkube-node-kvwz8 1/2 CrashLoopBackOff 7 16m 10.0.52.71 ip-10-0-52-71.us-east-2.compute.internal <none> <none> ovnkube-node-metrics-sjzf8 1/1 Running 0 16m 10.0.52.71 ip-10-0-52-71.us-east-2.compute.internal <none> <none> ovs-node-sc6bg 1/1 Running 0 16m 10.0.52.71 ip-10-0-52-71.us-east-2.compute.internal <none> <none> oc logs ovnkube-node-kvwz8 --tail=10 -n openshift-ovn-kubernetes -c ovnkube-node I0930 06:16:22.796841 21149 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 32479 I0930 06:16:22.797097 21149 ovs.go:164] exec(5): /usr/bin/ovs-vsctl --timeout=15 -- port-to-br br-ex I0930 06:16:22.804275 21149 ovs.go:167] exec(5): stdout: "" I0930 06:16:22.804297 21149 ovs.go:168] exec(5): stderr: "ovs-vsctl: no port named br-ex\n" I0930 06:16:22.804305 21149 ovs.go:170] exec(5): err: exit status 1 I0930 06:16:22.804319 21149 ovs.go:164] exec(6): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-ex I0930 06:16:22.811339 21149 ovs.go:167] exec(6): stdout: "" I0930 06:16:22.811360 21149 ovs.go:168] exec(6): stderr: "" I0930 06:16:22.811366 21149 ovs.go:170] exec(6): err: exit status 2 F0930 06:16:22.811427 21149 ovnkube.go:130] failed to convert br-ex to OVS bridge: Link not found can you please check the ovs-configuration service as done in https://bugzilla.redhat.com/show_bug.cgi?id=1882667#c1? Moved this back to networking to further investigate if removing the encapsulated json didn't work Zhanqi or Anurag can you please get the system journal logs and systemctl status for ovs-configuration? Thanks to Ross today we were able to reproduce. It looks like there are a couple of issues here. First to add RHEL nodes we need to have the fixed NM packages in rhel 7.9z: https://bugzilla.redhat.com/show_bug.cgi?id=1871935 Second, whatever installs or configures (openshift-ansible?) these rhel nodes will need to ensure the NM ovs package is installed. I see it is missing on this node: sh-4.2# rpm -qa | grep NetworkMana NetworkManager-libnm-1.18.8-1.el7.x86_64 NetworkManager-config-server-1.18.4-3.el7.noarch NetworkManager-1.18.8-1.el7.x86_64 NetworkManager-tui-1.18.8-1.el7.x86_64 NetworkManager-team-1.18.8-1.el7.x86_64 Third, because this is RHEL, network.service and NetworkManager both exist on this machine. ovs-configuration waits until NetworkManager is done, but it doesn't wait for network.service. I think this may possibly interfere with ovs-configuration. I can see during the ovs-configuration run: Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal /etc/sysconfig/network-scripts/ifup-ipv6[1184]: Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal network[1031]: ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart network with '/sbin/service network restart' Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal /etc/sysconfig/network-scripts/ifup-ipv6[1185]: Please restart network with '/sbin/service network restart' Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Device 'eth0' successfully disconnected. Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal nm-dispatcher[829]: req:7 'down' [eth0]: start running ordered scripts... Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: + nmcli c add type 802-3-ethernet conn.interface eth0 master ovs-port-phys0 con-name ovs-if-phys0 connection.autoconnect-priority 100 802-3-ethernet.mtu 9001 Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 108.61.73.244 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 169.254.169.123 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 12.71.198.242 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 65.19.142.137 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 54.236.224.171 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9700] ifcfg-rh: add connection /etc/sysconfig/network-scripts/ifcfg-ovs-if-phys0 (affd7856-7ae6-40af-9cf0-4e63fe5598c2,"ovs-if-phys0") Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9707] audit: op="connection-add" uuid="affd7856-7ae6-40af-9cf0-4e63fe5598c2" name="ovs-if-phys0" pid=1208 uid=0 result="success" Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Connection 'ovs-if-phys0' (affd7856-7ae6-40af-9cf0-4e63fe5598c2) successfully added. Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: + nmcli conn up ovs-if-phys0 Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal network[1031]: [ OK ] Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9952] agent-manager: req[0x55cd48ea2690, :1.24/nmcli-connect/0]: agent registered Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9962] audit: op="connection-activate" uuid="affd7856-7ae6-40af-9cf0-4e63fe5598c2" name="ovs-if-phys0" pid=1227 uid=0 result="fail" reason="Master connection 'ovs Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Error: Connection activation failed: Master connection 'ovs-if-phys0' can't be activated: No device available Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: ovs-configuration.service: main process exited, code=exited, status=4/NOPERMISSION Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: Failed to start Configures OVS with proper host networking configuration. Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: Unit ovs-configuration.service entered failed state. network.service is coming up at the same time ovs-configuration is running. In ovs-configuration we disconnect the eth0 device, create our new connection, and try to bring it up. However during that time period something has taken eth0 and brought it back up, presumably network.service. So we should add in MCO After=network.service (In reply to Tim Rozet from comment #13) > Second, whatever installs or configures (openshift-ansible?) these rhel > nodes will need to ensure the NM ovs package is installed. I see it is > missing on this node: > sh-4.2# rpm -qa | grep NetworkMana > NetworkManager-libnm-1.18.8-1.el7.x86_64 > NetworkManager-config-server-1.18.4-3.el7.noarch > NetworkManager-1.18.8-1.el7.x86_64 > NetworkManager-tui-1.18.8-1.el7.x86_64 > NetworkManager-team-1.18.8-1.el7.x86_64 This was fixed by https://github.com/openshift/openshift-ansible/pull/12242 All OpenShift PRs have merged. We are waiting for RHEL 7 worker nodes to get https://bugzilla.redhat.com/show_bug.cgi?id=1871935 This will happen after the OpenShift 4.6 release date, but we can take this in a Z stream. Setting target to 4.7, and I will clone to track the backport to 4.6.z. RHEL 7.9 was released on Nov 10th. Can you retest with RHEL7.9 please? Thanks! Thanks Anurag for providing a setup. NetworkManger is working fine. The issue now is that ovs is running in container mode, because our check here is failing: https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/006-ovs-node.yaml#L68 By simply doing: sh-4.2# systemctl disable ovs-configuration Removed symlink /etc/systemd/system/multi-user.target.wants/ovs-configuration.service. sh-4.2# systemctl enabel ovs-configuration Unknown operation 'enabel'. sh-4.2# systemctl enable ovs-configuration Created symlink from /etc/systemd/system/network-online.target.wants/ovs-configuration.service to /etc/systemd/system/ovs-configuration.service. We can see it gets symlinked to hte right place. I believe we are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1885365 now One could argue that we could simply remove the CNO check in 4.7 since we don't ever need to run containerized OVS, but that wont be a complete fix because we also need this to work in 4.6. Now that https://bugzilla.redhat.com/show_bug.cgi?id=1885365 is fixed and verified, moving this back to MODIFIED. Anurag can you please try to verify again? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |