1882667 – [ovn] br-ex Link not found when scale up RHEL worker

Bug 1882667 - [ovn] br-ex Link not found when scale up RHEL worker

Summary: [ovn] br-ex Link not found when scale up RHEL worker

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	1871935 1884095 1885365
Blocks:	1884323
TreeView+	depends on / blocked

Reported:	2020-09-25 09:56 UTC by zhaozhanqi
Modified:	2023-09-15 00:48 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Network-Manager OVS packages were not being installed for RHEL worker nodes, as well as other deployment time issues with systemd on RHEL nodes. Consequence: OVN-Kubernetes was unable to be used with RHEL worker nodes. NetworkManager packages were missing and this would cause the deployment to fail. Fix: Missing packages are now installed correctly by Openshift-Ansible at deploy time and systemd dependencies are fixed. Result: RHEL workers on 7.9z now work with OVN-Kubernetes deployments.
Clone Of:
Clones:	1884323 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:21:14 UTC
Target Upstream Version:
Embargoed:
Flags:	trozet: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2120	None	closed	Bug 1882667: daemon: Unlink encapsulated config in once-from mode	2021-02-19 02:49:53 UTC
Github	openshift machine-config-operator pull 2125	None	closed	Bug 1882667: pkg/daemon: rm the encapsulated mc at the right time	2021-02-19 02:49:54 UTC
Github	openshift machine-config-operator pull 2131	None	closed	Bug 1882667: Make ovs-configuration wait for network.service	2021-02-19 02:49:54 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:22:07 UTC

Description zhaozhanqi 2020-09-25 09:56:42 UTC

Description of problem:
Scale up RHEL worker show `ovnkube.go:130] failed to convert br-ex to OVS bridge: Link not found` 

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-25-054214

How reproducible:
always

Steps to Reproduce:
1. setup OVN cluster
2. scale up RHEL worker
3. ovnkube-node pod crash with error

Actual results:

oc logs ovnkube-node-p84sr -n openshift-ovn-kubernetes -c ovnkube-node
+ [[ -f /env/ip-10-0-55-3.us-east-2.compute.internal ]]
++ date '+%m%d %H:%M:%S.%N'
+ echo 'I0925 09:46:46.157434183 - waiting for db_ip addresses'
I0925 09:46:46.157434183 - waiting for db_ip addresses
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
++ date '+%m%d %H:%M:%S.%N'
I0925 09:46:46.239129050 - disable conntrack on geneve port
+ echo 'I0925 09:46:46.239129050 - disable conntrack on geneve port'
+ iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
+ iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
+ retries=0
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=10.0.55.198
+ [[ -n 10.0.55.198 ]]
+ break
++ date '+%m%d %H:%M:%S.%N'
I0925 09:46:46.601231991 - starting ovnkube-node db_ip 10.0.55.198
+ echo 'I0925 09:46:46.601231991 - starting ovnkube-node db_ip 10.0.55.198'
+ gateway_mode_flags=
+ grep -q OVNKubernetes /etc/systemd/system/ovs-configuration.service
+ gateway_mode_flags='--gateway-mode local --gateway-interface br-ex'
+ exec /usr/bin/ovnkube --init-node ip-10-0-55-3.us-east-2.compute.internal --nb-address ssl:10.0.55.198:9641,ssl:10.0.60.130:9641,ssl:10.0.78.94:9641 --sb-address ssl:10.0.55.198:9642,ssl:10.0.60.130:9642,ssl:10.0.78.94:9642 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --inactivity-probe=30000 --gateway-mode local --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103
I0925 09:46:46.616581    3396 config.go:1286] Parsed config file /run/ovnkube-config/ovnkube.conf
I0925 09:46:46.616644    3396 config.go:1287] Parsed config: {Default:{MTU:8901 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.zzhao252.qe.devcluster.openshift.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789}}
I0925 09:46:46.621096    3396 reflector.go:175] Starting reflector *v1beta1.CustomResourceDefinition (0s) from k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117
I0925 09:46:46.621124    3396 reflector.go:211] Listing and watching *v1beta1.CustomResourceDefinition from k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117
I0925 09:46:46.821043    3396 shared_informer.go:253] caches populated
I0925 09:46:46.821272    3396 reflector.go:175] Starting reflector *v1.Endpoints (0s) from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821300    3396 reflector.go:211] Listing and watching *v1.Endpoints from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821415    3396 reflector.go:175] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821434    3396 reflector.go:211] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821415    3396 reflector.go:175] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821452    3396 reflector.go:211] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821476    3396 reflector.go:175] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821491    3396 reflector.go:211] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821758    3396 reflector.go:175] Starting reflector *v1.NetworkPolicy (0s) from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821773    3396 reflector.go:211] Listing and watching *v1.NetworkPolicy from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821796    3396 reflector.go:175] Starting reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.821811    3396 reflector.go:211] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:135
I0925 09:46:46.921255    3396 shared_informer.go:253] caches populated
I0925 09:46:46.921281    3396 shared_informer.go:253] caches populated
I0925 09:46:46.921290    3396 shared_informer.go:253] caches populated
I0925 09:46:46.921297    3396 shared_informer.go:253] caches populated
I0925 09:46:46.921304    3396 shared_informer.go:253] caches populated
I0925 09:46:46.921311    3396 shared_informer.go:253] caches populated
I0925 09:46:46.921513    3396 reflector.go:175] Starting reflector *v1.EgressIP (0s) from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117
I0925 09:46:46.921534    3396 reflector.go:211] Listing and watching *v1.EgressIP from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117
I0925 09:46:47.021537    3396 shared_informer.go:253] caches populated
I0925 09:46:47.021632    3396 ovnkube.go:333] Watching config file /run/ovnkube-config/ovnkube.conf for changes
I0925 09:46:47.021698    3396 ovnkube.go:333] Watching config file /run/ovnkube-config/..2020_09_25_09_09_30.120247287/ovnkube.conf for changes
I0925 09:46:47.022001    3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-nb="ssl:10.0.55.198:9641,ssl:10.0.60.130:9641,ssl:10.0.78.94:9641"
I0925 09:46:47.030755    3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 del-ssl
I0925 09:46:47.039166    3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 set-ssl /ovn-cert/tls.key /ovn-cert/tls.crt /ovn-ca/ca-bundle.crt
I0925 09:46:47.047327    3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-remote="ssl:10.0.55.198:9642,ssl:10.0.60.130:9642,ssl:10.0.78.94:9642"
I0925 09:46:47.059265    3396 ovs.go:162] exec(1): /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=10.0.55.3 external_ids:ovn-remote-probe-interval=30000 external_ids:ovn-openflow-probe-interval=180 external_ids:hostname="ip-10-0-55-3.us-east-2.compute.internal" external_ids:ovn-monitor-all=true
I0925 09:46:47.066830    3396 ovs.go:165] exec(1): stdout: ""
I0925 09:46:47.066853    3396 ovs.go:166] exec(1): stderr: ""
I0925 09:46:47.070762    3396 node.go:201] Node ip-10-0-55-3.us-east-2.compute.internal ready for ovn initialization with subnet 10.130.2.0/23
I0925 09:46:47.070907    3396 ovs.go:162] exec(2): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.10108.ctl connection-status
I0925 09:46:47.076141    3396 ovs.go:165] exec(2): stdout: "connected\n"
I0925 09:46:47.076165    3396 ovs.go:166] exec(2): stderr: ""
I0925 09:46:47.076186    3396 node.go:115] Node ip-10-0-55-3.us-east-2.compute.internal connection status = connected
I0925 09:46:47.076203    3396 ovs.go:162] exec(3): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-int
I0925 09:46:47.083364    3396 ovs.go:165] exec(3): stdout: ""
I0925 09:46:47.083383    3396 ovs.go:166] exec(3): stderr: ""
I0925 09:46:47.083393    3396 ovs.go:162] exec(4): /usr/bin/ovs-ofctl dump-aggregate br-int
I0925 09:46:47.088715    3396 ovs.go:165] exec(4): stdout: "NXST_AGGREGATE reply (xid=0x4): packet_count=0 byte_count=0 flow_count=11\n"
I0925 09:46:47.088739    3396 ovs.go:166] exec(4): stderr: ""
I0925 09:46:47.088802    3396 healthcheck.go:142] Opening healthcheck "openshift-ingress/router-default" on port 32260
I0925 09:46:47.089436    3396 factory.go:660] Added *v1.Service event handler 1
I0925 09:46:47.089485    3396 healthcheck.go:222] Reporting 0 endpoints for healthcheck "openshift-ingress/router-default"
I0925 09:46:47.089507    3396 factory.go:660] Added *v1.Endpoints event handler 2
I0925 09:46:47.089540    3396 port_claim.go:124] Opening socket for service: openshift-ingress/router-default and port: 31858
I0925 09:46:47.089585    3396 port_claim.go:124] Opening socket for service: openshift-ingress/router-default and port: 30872
I0925 09:46:47.089640    3396 factory.go:660] Added *v1.Service event handler 3
I0925 09:46:47.089654    3396 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 32260
I0925 09:46:47.090045    3396 ovs.go:162] exec(5): /usr/bin/ovs-vsctl --timeout=15 -- port-to-br br-ex
I0925 09:46:47.097513    3396 ovs.go:165] exec(5): stdout: ""
I0925 09:46:47.097536    3396 ovs.go:166] exec(5): stderr: "ovs-vsctl: no port named br-ex\n"
I0925 09:46:47.097544    3396 ovs.go:168] exec(5): err: exit status 1
I0925 09:46:47.097557    3396 ovs.go:162] exec(6): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-ex
I0925 09:46:47.104589    3396 ovs.go:165] exec(6): stdout: ""
I0925 09:46:47.104612    3396 ovs.go:166] exec(6): stderr: ""
I0925 09:46:47.104621    3396 ovs.go:168] exec(6): err: exit status 2
F0925 09:46:47.104714    3396 ovnkube.go:130] failed to convert br-ex to OVS bridge: Link not found


Expected results:
scale up should work

Additional info:

Comment 1 zhaozhanqi 2020-09-25 10:08:10 UTC

ovs-configuration service is inactive on the RHEL worker

sh-4.2# journalctl -u ovs-configuration
-- No entries --
sh-4.2# systemctl status ovs-configuration
● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Fri 2020-09-25 09:09:22 UTC; 55min ago
           ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json was not met

Comment 2 Dan Williams 2020-09-25 17:07:25 UTC

Is this a RHEL7 or RHEL8 node?

Comment 3 Dan Williams 2020-09-25 17:10:01 UTC

Also, this implies that /etc/ignition-machine-config-encapsulated.json is present on the machine. That file should be removed by the machine-config-daemon when it restarts the node after configuring it. Perhaps the node hasn't been restarted after the MCD configuration has been done?

Comment 5 Anurag saxena 2020-09-25 18:01:50 UTC

(In reply to Dan Williams from comment #2)
> Is this a RHEL7 or RHEL8 node?

A cluster was brought up on RHCOS nodes and later a RHEL7 node was scaled up on same. I am trying to simulate same cluster right now and can share if required

Comment 10 zhaozhanqi 2020-09-30 06:18:48 UTC

this issue still NOT be fixed on CI build 4.6.0-0.ci-2020-09-30-031307

the ignition-machine-config-encapsulated.json is not exist in RHEL node. 
#oc debug node/ip-10-0-52-71.us-east-2.compute.internal
Starting pod/ip-10-0-52-71us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.2# ls /etc/*json
/etc/mcs-machine-config-content.json


oc get pod -n openshift-ovn-kubernetes -o wide | grep ip-10-0-52-71.us-east-2.compute.internal
ovnkube-node-kvwz8             1/2     CrashLoopBackOff   7          16m    10.0.52.71    ip-10-0-52-71.us-east-2.compute.internal    <none>           <none>
ovnkube-node-metrics-sjzf8     1/1     Running            0          16m    10.0.52.71    ip-10-0-52-71.us-east-2.compute.internal    <none>           <none>
ovs-node-sc6bg                 1/1     Running            0          16m    10.0.52.71    ip-10-0-52-71.us-east-2.compute.internal    <none>           <none>

oc logs ovnkube-node-kvwz8 --tail=10 -n openshift-ovn-kubernetes -c ovnkube-node
I0930 06:16:22.796841   21149 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 32479
I0930 06:16:22.797097   21149 ovs.go:164] exec(5): /usr/bin/ovs-vsctl --timeout=15 -- port-to-br br-ex
I0930 06:16:22.804275   21149 ovs.go:167] exec(5): stdout: ""
I0930 06:16:22.804297   21149 ovs.go:168] exec(5): stderr: "ovs-vsctl: no port named br-ex\n"
I0930 06:16:22.804305   21149 ovs.go:170] exec(5): err: exit status 1
I0930 06:16:22.804319   21149 ovs.go:164] exec(6): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-ex
I0930 06:16:22.811339   21149 ovs.go:167] exec(6): stdout: ""
I0930 06:16:22.811360   21149 ovs.go:168] exec(6): stderr: ""
I0930 06:16:22.811366   21149 ovs.go:170] exec(6): err: exit status 2
F0930 06:16:22.811427   21149 ovnkube.go:130] failed to convert br-ex to OVS bridge: Link not found

Comment 11 Antonio Murdaca 2020-09-30 08:37:55 UTC

can you please check the ovs-configuration service as done in https://bugzilla.redhat.com/show_bug.cgi?id=1882667#c1?
Moved this back to networking to further investigate if removing the encapsulated json didn't work

Comment 12 Tim Rozet 2020-09-30 14:31:34 UTC

Zhanqi or Anurag can you please get the system journal logs and systemctl status for ovs-configuration?

Comment 13 Tim Rozet 2020-09-30 22:04:28 UTC

Thanks to Ross today we were able to reproduce. It looks like there are a couple of issues here. First to add RHEL nodes we need to have the fixed NM packages in rhel 7.9z:
https://bugzilla.redhat.com/show_bug.cgi?id=1871935

Second, whatever installs or configures (openshift-ansible?) these rhel nodes will need to ensure the NM ovs package is installed. I see it is missing on this node:
sh-4.2# rpm -qa | grep NetworkMana
NetworkManager-libnm-1.18.8-1.el7.x86_64
NetworkManager-config-server-1.18.4-3.el7.noarch
NetworkManager-1.18.8-1.el7.x86_64
NetworkManager-tui-1.18.8-1.el7.x86_64
NetworkManager-team-1.18.8-1.el7.x86_64

Third, because this is RHEL, network.service and NetworkManager both exist on this machine. ovs-configuration waits until NetworkManager is done, but it doesn't wait for network.service. I think this may possibly interfere with ovs-configuration. I can see during the ovs-configuration run:

Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal /etc/sysconfig/network-scripts/ifup-ipv6[1184]: Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal network[1031]: ERROR     : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart network with '/sbin/service network restart'
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal /etc/sysconfig/network-scripts/ifup-ipv6[1185]: Please restart network with '/sbin/service network restart'
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Device 'eth0' successfully disconnected.
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal nm-dispatcher[829]: req:7 'down' [eth0]: start running ordered scripts...
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: + nmcli c add type 802-3-ethernet conn.interface eth0 master ovs-port-phys0 con-name ovs-if-phys0 connection.autoconnect-priority 100 802-3-ethernet.mtu 9001
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 108.61.73.244 offline
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 169.254.169.123 offline
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 12.71.198.242 offline
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 65.19.142.137 offline
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 54.236.224.171 offline
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info>  [1601498874.9700] ifcfg-rh: add connection /etc/sysconfig/network-scripts/ifcfg-ovs-if-phys0 (affd7856-7ae6-40af-9cf0-4e63fe5598c2,"ovs-if-phys0")
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info>  [1601498874.9707] audit: op="connection-add" uuid="affd7856-7ae6-40af-9cf0-4e63fe5598c2" name="ovs-if-phys0" pid=1208 uid=0 result="success"
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Connection 'ovs-if-phys0' (affd7856-7ae6-40af-9cf0-4e63fe5598c2) successfully added.
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: + nmcli conn up ovs-if-phys0
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal network[1031]: [  OK  ]
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info>  [1601498874.9952] agent-manager: req[0x55cd48ea2690, :1.24/nmcli-connect/0]: agent registered
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info>  [1601498874.9962] audit: op="connection-activate" uuid="affd7856-7ae6-40af-9cf0-4e63fe5598c2" name="ovs-if-phys0" pid=1227 uid=0 result="fail" reason="Master connection 'ovs
Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Error: Connection activation failed: Master connection 'ovs-if-phys0' can't be activated: No device available
Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: ovs-configuration.service: main process exited, code=exited, status=4/NOPERMISSION
Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: Unit ovs-configuration.service entered failed state.


network.service is coming up at the same time ovs-configuration is running. In ovs-configuration we disconnect the eth0 device, create our new connection, and try to bring it up. However during that time period something has taken eth0 and brought it back up, presumably network.service. So we should add in MCO After=network.service

Comment 14 Dan Williams 2020-10-01 14:51:28 UTC

(In reply to Tim Rozet from comment #13)
> Second, whatever installs or configures (openshift-ansible?) these rhel
> nodes will need to ensure the NM ovs package is installed. I see it is
> missing on this node:
> sh-4.2# rpm -qa | grep NetworkMana
> NetworkManager-libnm-1.18.8-1.el7.x86_64
> NetworkManager-config-server-1.18.4-3.el7.noarch
> NetworkManager-1.18.8-1.el7.x86_64
> NetworkManager-tui-1.18.8-1.el7.x86_64
> NetworkManager-team-1.18.8-1.el7.x86_64

This was fixed by https://github.com/openshift/openshift-ansible/pull/12242

Comment 15 Ben Bennett 2020-10-01 16:43:03 UTC

All OpenShift PRs have merged.  We are waiting for RHEL 7 worker nodes to get https://bugzilla.redhat.com/show_bug.cgi?id=1871935

This will happen after the OpenShift 4.6 release date, but we can take this in a Z stream.  Setting target to 4.7, and I will clone to track the backport to 4.6.z.

Comment 16 Dan Williams 2020-11-12 17:42:31 UTC

RHEL 7.9 was released on Nov 10th. Can you retest with RHEL7.9 please? Thanks!

Comment 24 Tim Rozet 2020-12-02 19:54:47 UTC

Thanks Anurag for providing a setup. NetworkManger is working fine. The issue now is that ovs is running in container mode, because our check here is failing:
https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/006-ovs-node.yaml#L68

By simply doing:
sh-4.2# systemctl disable ovs-configuration
Removed symlink /etc/systemd/system/multi-user.target.wants/ovs-configuration.service.
sh-4.2# systemctl enabel ovs-configuration
Unknown operation 'enabel'.
sh-4.2# systemctl enable ovs-configuration
Created symlink from /etc/systemd/system/network-online.target.wants/ovs-configuration.service to /etc/systemd/system/ovs-configuration.service.

We can see it gets symlinked to hte right place. I believe we are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1885365 now

One could argue that we could simply remove the CNO check in 4.7 since we don't ever need to run containerized OVS, but that wont be a complete fix because we also need this to work in 4.6.

Comment 25 Tim Rozet 2021-01-06 19:54:55 UTC

Now that https://bugzilla.redhat.com/show_bug.cgi?id=1885365 is fixed and verified, moving this back to MODIFIED. Anurag can you please try to verify again?

Comment 30 errata-xmlrpc 2021-02-24 15:21:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 31 Red Hat Bugzilla 2023-09-15 00:48:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.