Description of problem: Scale up RHEL worker show `ovnkube.go:130] failed to convert br-ex to OVS bridge: Link not found` Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-25-054214 How reproducible: always Steps to Reproduce: 1. setup OVN cluster 2. scale up RHEL worker 3. ovnkube-node pod crash with error Actual results: oc logs ovnkube-node-p84sr -n openshift-ovn-kubernetes -c ovnkube-node + [[ -f /env/ip-10-0-55-3.us-east-2.compute.internal ]] ++ date '+%m%d %H:%M:%S.%N' + echo 'I0925 09:46:46.157434183 - waiting for db_ip addresses' I0925 09:46:46.157434183 - waiting for db_ip addresses + cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/ + ovn_config_namespace=openshift-ovn-kubernetes ++ date '+%m%d %H:%M:%S.%N' I0925 09:46:46.239129050 - disable conntrack on geneve port + echo 'I0925 09:46:46.239129050 - disable conntrack on geneve port' + iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK + iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK + retries=0 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip=10.0.55.198 + [[ -n 10.0.55.198 ]] + break ++ date '+%m%d %H:%M:%S.%N' I0925 09:46:46.601231991 - starting ovnkube-node db_ip 10.0.55.198 + echo 'I0925 09:46:46.601231991 - starting ovnkube-node db_ip 10.0.55.198' + gateway_mode_flags= + grep -q OVNKubernetes /etc/systemd/system/ovs-configuration.service + gateway_mode_flags='--gateway-mode local --gateway-interface br-ex' + exec /usr/bin/ovnkube --init-node ip-10-0-55-3.us-east-2.compute.internal --nb-address ssl:10.0.55.198:9641,ssl:10.0.60.130:9641,ssl:10.0.78.94:9641 --sb-address ssl:10.0.55.198:9642,ssl:10.0.60.130:9642,ssl:10.0.78.94:9642 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nb-cert-common-name ovn --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --config-file=/run/ovnkube-config/ovnkube.conf --loglevel 4 --inactivity-probe=30000 --gateway-mode local --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 I0925 09:46:46.616581 3396 config.go:1286] Parsed config file /run/ovnkube-config/ovnkube.conf I0925 09:46:46.616644 3396 config.go:1287] Parsed config: {Default:{MTU:8901 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.zzhao252.qe.devcluster.openshift.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false externalID: exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789}} I0925 09:46:46.621096 3396 reflector.go:175] Starting reflector *v1beta1.CustomResourceDefinition (0s) from k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117 I0925 09:46:46.621124 3396 reflector.go:211] Listing and watching *v1beta1.CustomResourceDefinition from k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117 I0925 09:46:46.821043 3396 shared_informer.go:253] caches populated I0925 09:46:46.821272 3396 reflector.go:175] Starting reflector *v1.Endpoints (0s) from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821300 3396 reflector.go:211] Listing and watching *v1.Endpoints from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821415 3396 reflector.go:175] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821434 3396 reflector.go:211] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821415 3396 reflector.go:175] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821452 3396 reflector.go:211] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821476 3396 reflector.go:175] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821491 3396 reflector.go:211] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821758 3396 reflector.go:175] Starting reflector *v1.NetworkPolicy (0s) from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821773 3396 reflector.go:211] Listing and watching *v1.NetworkPolicy from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821796 3396 reflector.go:175] Starting reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.821811 3396 reflector.go:211] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:135 I0925 09:46:46.921255 3396 shared_informer.go:253] caches populated I0925 09:46:46.921281 3396 shared_informer.go:253] caches populated I0925 09:46:46.921290 3396 shared_informer.go:253] caches populated I0925 09:46:46.921297 3396 shared_informer.go:253] caches populated I0925 09:46:46.921304 3396 shared_informer.go:253] caches populated I0925 09:46:46.921311 3396 shared_informer.go:253] caches populated I0925 09:46:46.921513 3396 reflector.go:175] Starting reflector *v1.EgressIP (0s) from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117 I0925 09:46:46.921534 3396 reflector.go:211] Listing and watching *v1.EgressIP from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:117 I0925 09:46:47.021537 3396 shared_informer.go:253] caches populated I0925 09:46:47.021632 3396 ovnkube.go:333] Watching config file /run/ovnkube-config/ovnkube.conf for changes I0925 09:46:47.021698 3396 ovnkube.go:333] Watching config file /run/ovnkube-config/..2020_09_25_09_09_30.120247287/ovnkube.conf for changes I0925 09:46:47.022001 3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-nb="ssl:10.0.55.198:9641,ssl:10.0.60.130:9641,ssl:10.0.78.94:9641" I0925 09:46:47.030755 3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 del-ssl I0925 09:46:47.039166 3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 set-ssl /ovn-cert/tls.key /ovn-cert/tls.crt /ovn-ca/ca-bundle.crt I0925 09:46:47.047327 3396 config.go:916] exec: /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-remote="ssl:10.0.55.198:9642,ssl:10.0.60.130:9642,ssl:10.0.78.94:9642" I0925 09:46:47.059265 3396 ovs.go:162] exec(1): /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=10.0.55.3 external_ids:ovn-remote-probe-interval=30000 external_ids:ovn-openflow-probe-interval=180 external_ids:hostname="ip-10-0-55-3.us-east-2.compute.internal" external_ids:ovn-monitor-all=true I0925 09:46:47.066830 3396 ovs.go:165] exec(1): stdout: "" I0925 09:46:47.066853 3396 ovs.go:166] exec(1): stderr: "" I0925 09:46:47.070762 3396 node.go:201] Node ip-10-0-55-3.us-east-2.compute.internal ready for ovn initialization with subnet 10.130.2.0/23 I0925 09:46:47.070907 3396 ovs.go:162] exec(2): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.10108.ctl connection-status I0925 09:46:47.076141 3396 ovs.go:165] exec(2): stdout: "connected\n" I0925 09:46:47.076165 3396 ovs.go:166] exec(2): stderr: "" I0925 09:46:47.076186 3396 node.go:115] Node ip-10-0-55-3.us-east-2.compute.internal connection status = connected I0925 09:46:47.076203 3396 ovs.go:162] exec(3): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-int I0925 09:46:47.083364 3396 ovs.go:165] exec(3): stdout: "" I0925 09:46:47.083383 3396 ovs.go:166] exec(3): stderr: "" I0925 09:46:47.083393 3396 ovs.go:162] exec(4): /usr/bin/ovs-ofctl dump-aggregate br-int I0925 09:46:47.088715 3396 ovs.go:165] exec(4): stdout: "NXST_AGGREGATE reply (xid=0x4): packet_count=0 byte_count=0 flow_count=11\n" I0925 09:46:47.088739 3396 ovs.go:166] exec(4): stderr: "" I0925 09:46:47.088802 3396 healthcheck.go:142] Opening healthcheck "openshift-ingress/router-default" on port 32260 I0925 09:46:47.089436 3396 factory.go:660] Added *v1.Service event handler 1 I0925 09:46:47.089485 3396 healthcheck.go:222] Reporting 0 endpoints for healthcheck "openshift-ingress/router-default" I0925 09:46:47.089507 3396 factory.go:660] Added *v1.Endpoints event handler 2 I0925 09:46:47.089540 3396 port_claim.go:124] Opening socket for service: openshift-ingress/router-default and port: 31858 I0925 09:46:47.089585 3396 port_claim.go:124] Opening socket for service: openshift-ingress/router-default and port: 30872 I0925 09:46:47.089640 3396 factory.go:660] Added *v1.Service event handler 3 I0925 09:46:47.089654 3396 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 32260 I0925 09:46:47.090045 3396 ovs.go:162] exec(5): /usr/bin/ovs-vsctl --timeout=15 -- port-to-br br-ex I0925 09:46:47.097513 3396 ovs.go:165] exec(5): stdout: "" I0925 09:46:47.097536 3396 ovs.go:166] exec(5): stderr: "ovs-vsctl: no port named br-ex\n" I0925 09:46:47.097544 3396 ovs.go:168] exec(5): err: exit status 1 I0925 09:46:47.097557 3396 ovs.go:162] exec(6): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-ex I0925 09:46:47.104589 3396 ovs.go:165] exec(6): stdout: "" I0925 09:46:47.104612 3396 ovs.go:166] exec(6): stderr: "" I0925 09:46:47.104621 3396 ovs.go:168] exec(6): err: exit status 2 F0925 09:46:47.104714 3396 ovnkube.go:130] failed to convert br-ex to OVS bridge: Link not found Expected results: scale up should work Additional info:
ovs-configuration service is inactive on the RHEL worker sh-4.2# journalctl -u ovs-configuration -- No entries -- sh-4.2# systemctl status ovs-configuration ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: inactive (dead) Condition: start condition failed at Fri 2020-09-25 09:09:22 UTC; 55min ago ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json was not met
Is this a RHEL7 or RHEL8 node?
Also, this implies that /etc/ignition-machine-config-encapsulated.json is present on the machine. That file should be removed by the machine-config-daemon when it restarts the node after configuring it. Perhaps the node hasn't been restarted after the MCD configuration has been done?
(In reply to Dan Williams from comment #2) > Is this a RHEL7 or RHEL8 node? A cluster was brought up on RHCOS nodes and later a RHEL7 node was scaled up on same. I am trying to simulate same cluster right now and can share if required
this issue still NOT be fixed on CI build 4.6.0-0.ci-2020-09-30-031307 the ignition-machine-config-encapsulated.json is not exist in RHEL node. #oc debug node/ip-10-0-52-71.us-east-2.compute.internal Starting pod/ip-10-0-52-71us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.2# ls /etc/*json /etc/mcs-machine-config-content.json oc get pod -n openshift-ovn-kubernetes -o wide | grep ip-10-0-52-71.us-east-2.compute.internal ovnkube-node-kvwz8 1/2 CrashLoopBackOff 7 16m 10.0.52.71 ip-10-0-52-71.us-east-2.compute.internal <none> <none> ovnkube-node-metrics-sjzf8 1/1 Running 0 16m 10.0.52.71 ip-10-0-52-71.us-east-2.compute.internal <none> <none> ovs-node-sc6bg 1/1 Running 0 16m 10.0.52.71 ip-10-0-52-71.us-east-2.compute.internal <none> <none> oc logs ovnkube-node-kvwz8 --tail=10 -n openshift-ovn-kubernetes -c ovnkube-node I0930 06:16:22.796841 21149 healthcheck.go:167] Starting goroutine for healthcheck "openshift-ingress/router-default" on port 32479 I0930 06:16:22.797097 21149 ovs.go:164] exec(5): /usr/bin/ovs-vsctl --timeout=15 -- port-to-br br-ex I0930 06:16:22.804275 21149 ovs.go:167] exec(5): stdout: "" I0930 06:16:22.804297 21149 ovs.go:168] exec(5): stderr: "ovs-vsctl: no port named br-ex\n" I0930 06:16:22.804305 21149 ovs.go:170] exec(5): err: exit status 1 I0930 06:16:22.804319 21149 ovs.go:164] exec(6): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-ex I0930 06:16:22.811339 21149 ovs.go:167] exec(6): stdout: "" I0930 06:16:22.811360 21149 ovs.go:168] exec(6): stderr: "" I0930 06:16:22.811366 21149 ovs.go:170] exec(6): err: exit status 2 F0930 06:16:22.811427 21149 ovnkube.go:130] failed to convert br-ex to OVS bridge: Link not found
can you please check the ovs-configuration service as done in https://bugzilla.redhat.com/show_bug.cgi?id=1882667#c1? Moved this back to networking to further investigate if removing the encapsulated json didn't work
Zhanqi or Anurag can you please get the system journal logs and systemctl status for ovs-configuration?
Thanks to Ross today we were able to reproduce. It looks like there are a couple of issues here. First to add RHEL nodes we need to have the fixed NM packages in rhel 7.9z: https://bugzilla.redhat.com/show_bug.cgi?id=1871935 Second, whatever installs or configures (openshift-ansible?) these rhel nodes will need to ensure the NM ovs package is installed. I see it is missing on this node: sh-4.2# rpm -qa | grep NetworkMana NetworkManager-libnm-1.18.8-1.el7.x86_64 NetworkManager-config-server-1.18.4-3.el7.noarch NetworkManager-1.18.8-1.el7.x86_64 NetworkManager-tui-1.18.8-1.el7.x86_64 NetworkManager-team-1.18.8-1.el7.x86_64 Third, because this is RHEL, network.service and NetworkManager both exist on this machine. ovs-configuration waits until NetworkManager is done, but it doesn't wait for network.service. I think this may possibly interfere with ovs-configuration. I can see during the ovs-configuration run: Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal /etc/sysconfig/network-scripts/ifup-ipv6[1184]: Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal network[1031]: ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart network with '/sbin/service network restart' Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal /etc/sysconfig/network-scripts/ifup-ipv6[1185]: Please restart network with '/sbin/service network restart' Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Device 'eth0' successfully disconnected. Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal nm-dispatcher[829]: req:7 'down' [eth0]: start running ordered scripts... Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: + nmcli c add type 802-3-ethernet conn.interface eth0 master ovs-port-phys0 con-name ovs-if-phys0 connection.autoconnect-priority 100 802-3-ethernet.mtu 9001 Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 108.61.73.244 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 169.254.169.123 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 12.71.198.242 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 65.19.142.137 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal chronyd[703]: Source 54.236.224.171 offline Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9700] ifcfg-rh: add connection /etc/sysconfig/network-scripts/ifcfg-ovs-if-phys0 (affd7856-7ae6-40af-9cf0-4e63fe5598c2,"ovs-if-phys0") Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9707] audit: op="connection-add" uuid="affd7856-7ae6-40af-9cf0-4e63fe5598c2" name="ovs-if-phys0" pid=1208 uid=0 result="success" Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Connection 'ovs-if-phys0' (affd7856-7ae6-40af-9cf0-4e63fe5598c2) successfully added. Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: + nmcli conn up ovs-if-phys0 Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal network[1031]: [ OK ] Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9952] agent-manager: req[0x55cd48ea2690, :1.24/nmcli-connect/0]: agent registered Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal NetworkManager[765]: <info> [1601498874.9962] audit: op="connection-activate" uuid="affd7856-7ae6-40af-9cf0-4e63fe5598c2" name="ovs-if-phys0" pid=1227 uid=0 result="fail" reason="Master connection 'ovs Sep 30 20:47:54 ip-10-0-53-121.us-east-2.compute.internal configure-ovs.sh[1040]: Error: Connection activation failed: Master connection 'ovs-if-phys0' can't be activated: No device available Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: ovs-configuration.service: main process exited, code=exited, status=4/NOPERMISSION Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: Failed to start Configures OVS with proper host networking configuration. Sep 30 20:47:55 ip-10-0-53-121.us-east-2.compute.internal systemd[1]: Unit ovs-configuration.service entered failed state. network.service is coming up at the same time ovs-configuration is running. In ovs-configuration we disconnect the eth0 device, create our new connection, and try to bring it up. However during that time period something has taken eth0 and brought it back up, presumably network.service. So we should add in MCO After=network.service
(In reply to Tim Rozet from comment #13) > Second, whatever installs or configures (openshift-ansible?) these rhel > nodes will need to ensure the NM ovs package is installed. I see it is > missing on this node: > sh-4.2# rpm -qa | grep NetworkMana > NetworkManager-libnm-1.18.8-1.el7.x86_64 > NetworkManager-config-server-1.18.4-3.el7.noarch > NetworkManager-1.18.8-1.el7.x86_64 > NetworkManager-tui-1.18.8-1.el7.x86_64 > NetworkManager-team-1.18.8-1.el7.x86_64 This was fixed by https://github.com/openshift/openshift-ansible/pull/12242
All OpenShift PRs have merged. We are waiting for RHEL 7 worker nodes to get https://bugzilla.redhat.com/show_bug.cgi?id=1871935 This will happen after the OpenShift 4.6 release date, but we can take this in a Z stream. Setting target to 4.7, and I will clone to track the backport to 4.6.z.
RHEL 7.9 was released on Nov 10th. Can you retest with RHEL7.9 please? Thanks!
Thanks Anurag for providing a setup. NetworkManger is working fine. The issue now is that ovs is running in container mode, because our check here is failing: https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/006-ovs-node.yaml#L68 By simply doing: sh-4.2# systemctl disable ovs-configuration Removed symlink /etc/systemd/system/multi-user.target.wants/ovs-configuration.service. sh-4.2# systemctl enabel ovs-configuration Unknown operation 'enabel'. sh-4.2# systemctl enable ovs-configuration Created symlink from /etc/systemd/system/network-online.target.wants/ovs-configuration.service to /etc/systemd/system/ovs-configuration.service. We can see it gets symlinked to hte right place. I believe we are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1885365 now One could argue that we could simply remove the CNO check in 4.7 since we don't ever need to run containerized OVS, but that wont be a complete fix because we also need this to work in 4.6.
Now that https://bugzilla.redhat.com/show_bug.cgi?id=1885365 is fixed and verified, moving this back to MODIFIED. Anurag can you please try to verify again?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days