Description of problem: ----------------------- OVN migration to second interface failed due to ovnkube-node pod failing oc get po -n openshift-ovn-kubernetes -o wide | awk /openshift-master-1/ ovnkube-master-5sw2m 6/6 Running 12 109m 10.1.208.21 openshift-master-1 <none> <none> ovnkube-node-b6lmg 3/4 Running 13 (3m57s ago) 109m 10.1.208.21 openshift-master-1 <none> <none> oc get po -n openshift-ovn-kubernetes -o wide | awk /openshift-master-1/ ovnkube-master-5sw2m 6/6 Running 12 109m 10.1.208.21 openshift-master-1 <none> <none> ovnkube-node-b6lmg 3/4 Running 13 (3m57s ago) 109m 10.1.208.21 openshift-master-1 <none> <none> [kni ~]$ oc describe po ovnkube-node-b6lmg -n openshift-ovn-kubernetes Name: ovnkube-node-b6lmg Namespace: openshift-ovn-kubernetes Priority: 2000001000 Priority Class Name: system-node-critical Node: openshift-master-1/10.1.208.21 Start Time: Thu, 02 Dec 2021 06:46:47 -0500 Labels: app=ovnkube-node component=network controller-revision-hash=5895968979 kubernetes.io/os=linux openshift.io/component=network pod-template-generation=1 type=infra Annotations: networkoperator.openshift.io/ip-family-mode: single-stack Status: Running IP: 10.1.208.21 IPs: IP: 10.1.208.21 Controlled By: DaemonSet/ovnkube-node Containers: ovn-controller: Container ID: cri-o://68889fc9266ce191823569e745c5001e37d9c7929726d6723422532764bf2c31 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e Port: <none> Host Port: <none> Command: /bin/bash -c set -e if [[ -f "/env/${K8S_NODE}" ]]; then set -o allexport source "/env/${K8S_NODE}" set +o allexport fi echo "$(date -Iseconds) - starting ovn-controller" exec ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off \ --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid \ --syslog-method="null" \ --log-file=/var/log/ovn/acl-audit-log.log \ -vFACILITY:"local0" \ -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt \ -vconsole:"${OVN_LOG_LEVEL}" -vconsole:"acl_log:off" \ -vPATTERN:console:"%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" \ -vsyslog:"acl_log:info" \ -vfile:"acl_log:info" State: Running Started: Thu, 02 Dec 2021 07:36:17 -0500 Ready: True Restart Count: 1 Requests: cpu: 10m memory: 300Mi Environment: OVN_LOG_LEVEL: info K8S_NODE: (v1:spec.nodeName) Mounts: /dev/log from log-socket (rw) /env from env-overrides (rw) /etc/openvswitch from etc-openvswitch (rw) /etc/ovn/ from etc-openvswitch (rw) /ovn-ca from ovn-ca (rw) /ovn-cert from ovn-cert (rw) /run/openvswitch from run-openvswitch (rw) /run/ovn/ from run-ovn (rw) /var/lib/openvswitch from var-lib-openvswitch (rw) /var/log/ovn from node-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro) ovn-acl-logging: Container ID: cri-o://16092f57da4eb0329c0978a6a8ec905243e38d0622aae9c397ba1b2d026b16f7 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e Port: <none> Host Port: <none> Command: /bin/bash -c set -euo pipefail # Rotate audit log files when then get to max size (in bytes) MAXFILESIZE=$(( "50"*1000000 )) LOGFILE=/var/log/ovn/acl-audit-log.log CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid) # Redirect err to null so no messages are shown upon rotation tail -F ${LOGFILE} 2> /dev/null & while true do # Make sure ovn-controller's logfile exists, and get current size in bytes if [ -f "$LOGFILE" ]; then file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1` else ovs-appctl -t /var/run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1` fi if [ $file_size -gt $MAXFILESIZE ];then echo "Rotating OVN ACL Log File" timestamp=`date '+%Y-%m-%dT%H-%M-%S'` mv ${LOGFILE} /var/log/ovn/acl-audit-log.$timestamp.log ovs-appctl -t /run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid) fi # sleep for 30 seconds to avoid wasting CPU sleep 30 done State: Running Started: Thu, 02 Dec 2021 07:36:17 -0500 Ready: True Restart Count: 1 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /run/ovn/ from run-ovn (rw) /var/log/ovn from node-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro) kube-rbac-proxy: Container ID: cri-o://df3da787138c488d1dfa74c998d1102a62da28d25a4de93a55246dc3cf4e27a7 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc Port: 9103/TCP Host Port: 9103/TCP Command: /bin/bash -c #!/bin/bash set -euo pipefail TLS_PK=/etc/pki/tls/metrics-cert/tls.key TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt # As the secret mount is optional we must wait for the files to be present. # The service is created in monitor.yaml and this is created in sdn.yaml. # If it isn't created there is probably an issue so we want to crashloop. retries=0 TS=$(date +%s) WARN_TS=$(( ${TS} + $(( 20 * 60)) )) HAS_LOGGED_INFO=0 log_missing_certs(){ CUR_TS=$(date +%s) if [[ "${CUR_TS}" -gt "WARN_TS" ]]; then echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes. elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour. HAS_LOGGED_INFO=1 fi } while [[ ! -f "${TLS_PK}" || ! -f "${TLS_CERT}" ]] ; do log_missing_certs sleep 5 done echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy exec /usr/bin/kube-rbac-proxy \ --logtostderr \ --secure-listen-address=:9103 \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \ --upstream=http://127.0.0.1:29103/ \ --tls-private-key-file=${TLS_PK} \ --tls-cert-file=${TLS_CERT} State: Running Started: Thu, 02 Dec 2021 07:36:17 -0500 Ready: True Restart Count: 1 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /etc/pki/tls/metrics-cert from ovn-node-metrics-cert (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro) ovnkube-node: Container ID: cri-o://32cbd581af2f60dabb7cad4f8b46b26f3bd31c7ad1200ea9ddeb44b400f9d613 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e Port: 29103/TCP Host Port: 29103/TCP Command: /bin/bash -c set -xe if [[ -f "/env/${K8S_NODE}" ]]; then set -o allexport source "/env/${K8S_NODE}" set +o allexport fi echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db_ip addresses" cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/ ovn_config_namespace=openshift-ovn-kubernetes echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on geneve port" iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK ip6tables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK ip6tables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK retries=0 while true; do # TODO: change to use '--request-timeout=30s', if https://github.com/kubernetes/kubernetes/issues/49343 is fixed. db_ip=$(timeout 30 kubectl get ep -n ${ovn_config_namespace} ovnkube-db -o jsonpath='{.subsets[0].addresses[0].ip}') if [[ -n "${db_ip}" ]]; then break fi (( retries += 1 )) if [[ "${retries}" -gt 40 ]]; then echo "E$(date "+%m%d %H:%M:%S.%N") - db endpoint never came up" exit 1 fi echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db endpoint" sleep 5 done echo "I$(date "+%m%d %H:%M:%S.%N") - starting ovnkube-node db_ip ${db_ip}" if [ "shared" == "shared" ]; then gateway_mode_flags="--gateway-mode shared --gateway-interface br-ex" elif [ "shared" == "local" ]; then gateway_mode_flags="--gateway-mode local --gateway-interface br-ex" else echo "Invalid OVN_GATEWAY_MODE: \"shared\". Must be \"local\" or \"shared\"." exit 1 fi export_network_flows_flags= if [[ -n "${NETFLOW_COLLECTORS}" ]] ; then export_network_flows_flags="--netflow-targets ${NETFLOW_COLLECTORS}" fi if [[ -n "${SFLOW_COLLECTORS}" ]] ; then export_network_flows_flags="$export_network_flows_flags --sflow-targets ${SFLOW_COLLECTORS}" fi if [[ -n "${IPFIX_COLLECTORS}" ]] ; then export_network_flows_flags="$export_network_flows_flags --ipfix-targets ${IPFIX_COLLECTORS}" fi gw_interface_flag= # if br-ex1 is configured on the node, we want to use it for external gateway traffic if [ -d /sys/class/net/br-ex1 ]; then gw_interface_flag="--exgw-interface=br-ex1" fi exec /usr/bin/ovnkube --init-node "${K8S_NODE}" \ --nb-address "ssl:10.1.208.20:9641,ssl:10.1.208.21:9641,ssl:10.1.208.22:9641" \ --sb-address "ssl:10.1.208.20:9642,ssl:10.1.208.21:9642,ssl:10.1.208.22:9642" \ --nb-client-privkey /ovn-cert/tls.key \ --nb-client-cert /ovn-cert/tls.crt \ --nb-client-cacert /ovn-ca/ca-bundle.crt \ --nb-cert-common-name "ovn" \ --sb-client-privkey /ovn-cert/tls.key \ --sb-client-cert /ovn-cert/tls.crt \ --sb-client-cacert /ovn-ca/ca-bundle.crt \ --sb-cert-common-name "ovn" \ --config-file=/run/ovnkube-config/ovnkube.conf \ --loglevel "${OVN_KUBE_LOG_LEVEL}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ ${gateway_mode_flags} \ --metrics-bind-address "127.0.0.1:29103" \ --metrics-enable-pprof \ ${export_network_flows_flags} \ ${gw_interface_flag} State: Running Started: Thu, 02 Dec 2021 08:32:53 -0500 Last State: Terminated Reason: Error Message: stderr: "" I1202 13:32:30.332571 162643 ovs.go:206] exec(3036): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex1_openshift-master-1-to-br-int ofport I1202 13:32:30.335837 162643 ovs.go:209] exec(3036): stdout: "2\n" I1202 13:32:30.335844 162643 ovs.go:210] exec(3036): stderr: "" I1202 13:32:30.335848 162643 gateway.go:264] Gateway is ready I1202 13:32:30.335866 162643 ovs.go:206] exec(3037): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.9999.ctl connection-status I1202 13:32:30.338144 162643 ovs.go:209] exec(3037): stdout: "connected\n" I1202 13:32:30.338151 162643 ovs.go:210] exec(3037): stderr: "" I1202 13:32:30.338154 162643 node.go:239] Node connection status = connected I1202 13:32:30.338158 162643 ovs.go:206] exec(3038): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-int I1202 13:32:30.341647 162643 ovs.go:209] exec(3038): stdout: "" I1202 13:32:30.341654 162643 ovs.go:210] exec(3038): stderr: "" I1202 13:32:30.341657 162643 ovs.go:206] exec(3039): /usr/bin/ovs-ofctl dump-aggregate br-int I1202 13:32:30.344256 162643 ovs.go:209] exec(3039): stdout: "NXST_AGGREGATE reply (xid=0x4): packet_count=3056957 byte_count=815706824 flow_count=1929\n" I1202 13:32:30.344263 162643 ovs.go:210] exec(3039): stderr: "" I1202 13:32:30.344272 162643 ovs.go:206] exec(3040): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex_openshift-master-1-to-br-int ofport I1202 13:32:30.347453 162643 ovs.go:209] exec(3040): stdout: "" I1202 13:32:30.347459 162643 ovs.go:210] exec(3040): stderr: "" I1202 13:32:30.347464 162643 ovs.go:206] exec(3041): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex1_openshift-master-1-to-br-int ofport I1202 13:32:30.350651 162643 ovs.go:209] exec(3041): stdout: "2\n" I1202 13:32:30.350659 162643 ovs.go:210] exec(3041): stderr: "" I1202 13:32:30.350661 162643 gateway.go:264] Gateway is ready F1202 13:32:30.350680 162643 ovnkube.go:132] error waiting for node readiness: timed out waiting for the condition Exit Code: 1 Started: Thu, 02 Dec 2021 08:27:29 -0500 Finished: Thu, 02 Dec 2021 08:32:30 -0500 Ready: False Restart Count: 10 Requests: cpu: 10m memory: 300Mi Readiness: exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=5s #success=1 #failure=3 Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: api-int.kni-qe-4.lab.eng.rdu2.redhat.com OVN_CONTROLLER_INACTIVITY_PROBE: 180000 OVN_KUBE_LOG_LEVEL: 4 K8S_NODE: (v1:spec.nodeName) Mounts: /cni-bin-dir from host-cni-bin (rw) /env from env-overrides (rw) /etc/cni/net.d from host-cni-netd (rw) /etc/openvswitch from etc-openvswitch (rw) /etc/ovn/ from etc-openvswitch (rw) /etc/systemd/system from systemd-units (ro) /host from host-slash (ro) /ovn-ca from ovn-ca (rw) /ovn-cert from ovn-cert (rw) /run/netns from host-run-netns (ro) /run/openvswitch from run-openvswitch (rw) /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw) /run/ovn/ from run-ovn (rw) /run/ovnkube-config/ from ovnkube-config (rw) /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw) /var/lib/openvswitch from var-lib-openvswitch (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: systemd-units: Type: HostPath (bare host directory volume) Path: /etc/systemd/system HostPathType: host-slash: Type: HostPath (bare host directory volume) Path: / HostPathType: host-run-netns: Type: HostPath (bare host directory volume) Path: /run/netns HostPathType: var-lib-openvswitch: Type: HostPath (bare host directory volume) Path: /var/lib/openvswitch/data HostPathType: etc-openvswitch: Type: HostPath (bare host directory volume) Path: /etc/openvswitch HostPathType: run-openvswitch: Type: HostPath (bare host directory volume) Path: /var/run/openvswitch HostPathType: run-ovn: Type: HostPath (bare host directory volume) Path: /var/run/ovn HostPathType: node-log: Type: HostPath (bare host directory volume) Path: /var/log/ovn HostPathType: log-socket: Type: HostPath (bare host directory volume) Path: /dev/log HostPathType: host-run-ovn-kubernetes: Type: HostPath (bare host directory volume) Path: /run/ovn-kubernetes HostPathType: host-cni-bin: Type: HostPath (bare host directory volume) Path: /var/lib/cni/bin HostPathType: host-cni-netd: Type: HostPath (bare host directory volume) Path: /var/run/multus/cni/net.d HostPathType: host-var-lib-cni-networks-ovn-kubernetes: Type: HostPath (bare host directory volume) Path: /var/lib/cni/networks/ovn-k8s-cni-overlay HostPathType: ovnkube-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: ovnkube-config Optional: false env-overrides: Type: ConfigMap (a volume populated by a ConfigMap) Name: env-overrides Optional: true ovn-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ovn-ca Optional: false ovn-cert: Type: Secret (a volume populated by a Secret) SecretName: ovn-cert Optional: false ovn-node-metrics-cert: Type: Secret (a volume populated by a Secret) SecretName: ovn-node-metrics-cert Optional: true kube-api-access-vk2gn: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux Tolerations: op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 109m default-scheduler Successfully assigned openshift-ovn-kubernetes/ovnkube-node-b6lmg to openshift-master-1 Warning FailedMount 109m kubelet MountVolume.SetUp failed for volume "ovn-cert" : failed to sync secret cache: timed out waiting for the condition Warning FailedMount 109m kubelet MountVolume.SetUp failed for volume "ovn-ca" : failed to sync configmap cache: timed out waiting for the condition Warning FailedMount 109m kubelet MountVolume.SetUp failed for volume "env-overrides" : failed to sync configmap cache: timed out waiting for the condition Normal Pulling 109m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" Normal Pulled 109m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" in 7.48203 1219s Normal Pulled 109m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m achine Normal Started 109m kubelet Started container ovn-controller Normal Created 109m kubelet Created container ovn-controller Normal Created 109m kubelet Created container ovn-acl-logging Normal Started 109m kubelet Started container ovn-acl-logging Normal Pulling 109m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc" Normal Pulled 109m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc" in 519.808 15ms Normal Created 109m kubelet Created container kube-rbac-proxy Normal Started 109m kubelet Started container kube-rbac-proxy Normal Pulled 109m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m achine Normal Created 109m kubelet Created container ovnkube-node Normal Started 109m kubelet Started container ovnkube-node Warning Unhealthy 108m (x8 over 109m) kubelet Readiness probe failed: Warning NodeNotReady 64m node-controller Node is not ready Normal Pulled 60m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m achine Normal Started 60m kubelet Started container kube-rbac-proxy Normal Created 60m kubelet Created container ovn-controller Normal Started 60m kubelet Started container ovn-controller Normal Pulled 60m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m achine Normal Created 60m kubelet Created container ovn-acl-logging Normal Started 60m kubelet Started container ovn-acl-logging Normal Created 60m kubelet Created container kube-rbac-proxy Normal Pulled 60m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc" already present on m achine Normal Created 60m kubelet Created container ovnkube-node Normal Started 60m kubelet Started container ovnkube-node Normal Pulled 50m (x3 over 60m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m achine Warning Unhealthy 45m kubelet Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of fa319ba7878442f588697436d695599a6435e231f3 213623611f99a1b9200e22 is running failed: container process not found Warning BackOff 15m (x47 over 50m) kubelet Back-off restarting failed container Warning Unhealthy 1s (x616 over 60m) kubelet Readiness probe failed: Version-Release number of selected component (if applicable): ------------------------------------------------------------- 4.10.0-0.nightly-2021-12-02-033910 Steps to Reproduce: ------------------- 1. Deploy BM IPI cluster 2. Start OVN migration to 2nd interface 3. Observe status of ovnkube pods in openshift-ovn-kubernetes Actual results: --------------- OVN migration fails. From machine-config-daemon pods on machine: I1202 12:36:19.033534 10130 daemon.go:872] systemctl --failed: NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online I1202 12:36:19.033546 10130 daemon.go:608] Starting MachineConfigDaemon I1202 12:36:19.033624 10130 daemon.go:615] Enabling Kubelet Healthz Monitor I1202 12:36:48.954555 10130 trace.go:205] Trace[1002495192]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (02-Dec-2021 12:36:18.953) (total time: 30000ms): Trace[1002495192]: [30.000891916s] [30.000891916s] END E1202 12:36:48.954596 10130 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp> I1202 12:36:48.954846 10130 trace.go:205] Trace[2118766229]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (02-Dec-2021 12:36:18.953) > Trace[2118766229]: [30.001136739s] [30.001136739s] END Expected results: ----------------- OVN migration succeeds Additional info: ---------------- Baremetal cluster: 3masters + 4 workers - baremetal network IPv4 - provisioning network IPv6
I think there's something odd in ovs. Here are my findings: Ovnkube node complains about the missing patchport for br-ex: I1202 15:04:22.319087 434538 ovs.go:206] exec(2095): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex_openshift-master-1-to-br-int ofport I1202 15:04:22.322518 434538 ovs.go:209] exec(2095): stdout: "" I1202 15:04:22.322524 434538 ovs.go:210] exec(2095): stderr: "" There's a check that verifies the result being different from "" : https://github.com/openshift/ovn-kubernetes/blob/29b2532df414f4af3519a9da205146380a8dac55/go-controller/pkg/node/gateway.go#L258 The patchport is not there: [root@openshift-master-1 ~]# /usr/bin/ovs-vsctl list-ports br-ex1 bond0.113 patch-br-ex1_openshift-master-1-to-br-int [root@openshift-master-1 ~]# /usr/bin/ovs-vsctl list-ports br-ex bond0 The physnet option is set on the port: _uuid : 9201aac4-c512-42dd-9e46-4746c34296e1 addresses : [unknown] dhcpv4_options : [] dhcpv6_options : [] dynamic_addresses : [] enabled : [] external_ids : {} ha_chassis_group : [] name : br-ex_openshift-master-1 options : {network_name=physnet} parent_name : [] port_security : [] tag : [] tag_request : 0 type : localnet up : false The mappings are in place: [root@openshift-master-1 ~]# /usr/bin/ovs-vsctl list Open_vSwitch . _uuid : 0a6ea54b-e583-42eb-8d63-a9df6ab56eb0 bridges : [0beb018b-4939-4a26-964d-530f2e433570, 45c2d5cd-3945-4250-9297-7f51a50e91dd, 7ac2fb5e-dd11-4d71-bbc7-a79b3e28b919, b3cc26fc-e209-486d-a959-1de6d33575f7] cur_cfg : 266 datapath_types : [netdev, system] datapaths : {system=b272b3d3-2e73-4ca8-9257-b446c03cf558} db_version : "8.2.0" dpdk_initialized : false dpdk_version : "DPDK 20.11.1" external_ids : {hostname=openshift-master-1, ovn-bridge-mappings="physnet:br-ex,exgwphysnet:br-ex1", ovn-enable-lflow-cache="true", ovn-encap-ip="10.1.208.21", ovn-encap-type=geneve, ovn-limit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-openflow-probe-interval="180", ovn-remote="ssl:10.1.208.20:9642,ssl:10.1.208.21:9642,ssl:10.1.208.22:9642", ovn-remote-probe-interval="180000", rundir="/var/run/openvswitch", system-id="0d2b1da5-a235-463a-8055-c71272cacbf6"} iface_types : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan] manager_options : [] next_cfg : 266 other_config : {vlan-limit="0"} ovs_version : "2.15.2" ssl : e37ab694-b1d6-4367-992b-e9e0b8f714b5 statistics : {} system_type : rhcos system_version : "4.10" This should trigger some magic in ovs to create the patchport, afaik.
OVN SB chassis: _uuid : 6143a1d8-e278-41af-a60e-d0d2f697906b encaps : [9a7af96b-db0a-4fe1-9c23-4e70767b259d] external_ids : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet:br-ex,exgwphysnet:br-ex1", ovn-chassis-mac-mappings="", ovn-cms-options="", ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="true", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"} hostname : openshift-master-1 name : "0d2b1da5-a235-463a-8055-c71272cacbf6" nb_cfg : 0 other_config : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet:br-ex,exgwphysnet:br-ex1", ovn-chassis-mac-mappings="", ovn-cms-options="", ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="true", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"} transport_zones : [] vtep_logical_switches: [] NB switch ports: root@fedora ~]# ovn-nbctl list logical_switch_port br-ex1_openshift-master-1 _uuid : 12ff7c78-5607-4f26-85a1-be2ddf3b4ee8 addresses : [unknown] dhcpv4_options : [] dhcpv6_options : [] dynamic_addresses : [] enabled : [] external_ids : {} ha_chassis_group : [] name : br-ex1_openshift-master-1 options : {network_name=exgwphysnet} parent_name : [] port_security : [] tag : [] tag_request : [] type : localnet up : false [root@fedora ~]# ovn-nbctl list logical_switch_port 9201aac4-c512-42dd-9e46-4746c34296e1 _uuid : 9201aac4-c512-42dd-9e46-4746c34296e1 addresses : [unknown] dhcpv4_options : [] dhcpv6_options : [] dynamic_addresses : [] enabled : [] external_ids : {} ha_chassis_group : [] name : br-ex_openshift-master-1 options : {network_name=physnet} parent_name : [] port_security : [] tag : [] tag_request : 0 type : localnet up : false
Per offline discussion, this is a bug on the ovn-kubernetes side, incorrectly assigning the same LSP to multiple logical switches.
[root@fedora ~]# ovn-nbctl show exgw-ext_openshift-master-1 switch 4825e419-f614-4ff2-b1e7-3dd506fa386a (exgw-ext_openshift-master-1) port br-ex1_openshift-master-1 type: localnet addresses: ["unknown"] port etor-GR_openshift-master-1 type: router addresses: ["0c:42:a1:ee:7f:12"] router-port: rtoe-GR_openshift-master-1 [root@fedora ~]# ovn-nbctl show ext_openshift-master-1 switch f361acea-b980-416b-b518-5fe24c6f3413 (ext_openshift-master-1) port etor-GR_openshift-master-1 type: router addresses: ["0c:42:a1:ee:7f:12"] router-port: rtoe-GR_openshift-master-1 port br-ex_openshift-master-1 type: localnet addresses: ["unknown"] Same router port on multiple switches, this is a regression due to libovsdb changes.
Verified on 4.11.0-0.nightly-2022-05-20-213928 BM IPI dual-stack I see different routers on different switches. exgw-rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com sh-4.4# for f in $(ovn-nbctl show | awk '/ext_/ {gsub("[()]", "", $3) ; print $3}' | sort ) ; do ovn-nbctl show $f ; done switch 131438be-352a-46f5-a4ce-c3e00ef26ed6 (exgw-ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com) port br-ex1_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com type: localnet addresses: ["unknown"] port exgw-etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com type: router addresses: ["52:54:00:12:c1:38"] router-port: exgw-rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com switch 904e96f7-9362-4737-a490-25b8997c647f (exgw-ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com) port exgw-etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com type: router addresses: ["52:54:00:64:68:ed"] router-port: exgw-rtoe-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com port br-ex1_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com type: localnet addresses: ["unknown"] switch ccc43ce6-2a4a-49d8-8fa1-37fee2c32a2d (exgw-ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com) port exgw-etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com type: router addresses: ["52:54:00:3a:b8:e2"] router-port: exgw-rtoe-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com port br-ex1_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com type: localnet addresses: ["unknown"] switch 4d4789d1-c0b5-44c9-aa2e-6e4c90dee061 (ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com) port etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com type: router addresses: ["52:54:00:12:c1:38"] router-port: rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com port br-ex_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com type: localnet addresses: ["unknown"] switch 68fa87b5-0a8b-4e4a-83d1-3aa1ac3c7be7 (ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com) port br-ex_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com type: localnet addresses: ["unknown"] port etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com type: router addresses: ["52:54:00:64:68:ed"] router-port: rtoe-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com switch d2b74a48-bdbf-4ba6-b632-4d6222cdd6ab (ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com) port etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com type: router addresses: ["52:54:00:3a:b8:e2"] router-port: rtoe-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com port br-ex_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com type: localnet addresses: ["unknown"] sh-4.4# for f in $(ovn-nbctl show | awk '/ext_/ {gsub("[()]", "", $3) ; print $3}' | sort ) ; do echo -e "\n$f" ; ovn-nbctl lsp-list $f ; echo ; done exgw-ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com 42b0d986-d7c8-4b73-8af5-800b747172ea (br-ex1_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com) ebea449e-a3d9-4780-88a9-f6830488c22b (exgw-etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com) exgw-ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com 94d6b35e-ce13-4874-b7e9-1e94f0543b25 (br-ex1_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com) 7cfd0622-06f6-4d57-982a-460e6e861175 (exgw-etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com) exgw-ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com aa1368a6-3267-4620-9dca-16d71bbcd594 (br-ex1_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com) 7c0d1f38-9b14-4e49-8afc-46da64e9385d (exgw-etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com) ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com 8f0e1d40-d1fe-408c-8dce-ac71532cb7b7 (br-ex_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com) 31e0c30a-bdb5-4ba4-ab7e-cfabb7e83bd5 (etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com) ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com 103d0e52-f3a7-4f41-a843-98e4db199d2a (br-ex_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com) 62e3f17a-e051-46b4-a723-67d7605f3bae (etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com) ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com 8c6441c6-b8a9-452f-aabf-448d65d78001 (br-ex_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com) 887e2721-5399-49b7-a704-37e151e863ab (etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com) I don't have F5 so can't test that.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069