2078866 – [BM][IPI] Installation with bonds fail - DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress

Bug 2078866 - [BM][IPI] Installation with bonds fail - DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress

Summary: [BM][IPI] Installation with bonds fail - DaemonSet "openshift-ovn-kubernetes/...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Jaime Caamaño Ruiz
QA Contact:	Ross Brattain
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2077052 2077900 (view as bug list)
Depends On:
Blocks:	2089757
TreeView+	depends on / blocked

Reported:	2022-04-26 11:47 UTC by Yurii Prokulevych
Modified:	2023-09-15 01:54 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2089757 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:08:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3120	0	None	Merged	Bug 2078866: configure-ovs: avoid restarting NetworkManager	2022-09-11 21:37:37 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:08:59 UTC

Description Yurii Prokulevych 2022-04-26 11:47:37 UTC

Description of problem:
-----------------------
Installation of OCP-4.11 with bonded interfaces "baremetal" network fails:

...
time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network ManagementStateDegraded is False with : "
time="2022-04-26T04:02:46-04:00" level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-8m45c is in CrashLoopBackOff State\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-b7zz2 is in CrashLoopBackOff State\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-kn9g2 is in CrashLoopBackOff State\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - last change 2022-04-26T07:49:55Z"
time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"/openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready\nDaemonSet \"/openshift-multus/multus-admission-controller\" is waiting for other operators to become ready\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" is not available (awaiting 3 nodes)\nDaemonSet \"/openshift-network-diagnostics/network-check-target\" is waiting for other operators to become ready\nDeployment \"/openshift-network-diagnostics/network-check-source\" is waiting for other operators to become ready"
time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network Available is False with Startup: The network is starting up"
time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network  is  with : "
time="2022-04-26T04:02:46-04:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2022-04-26T04:02:46-04:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
time="2022-04-26T04:02:46-04:00" level=info msg="Bootstrap gather logs captured here \"/home/kni/clusterconfigs/log-bundle-20220426040203.tar.gz\""


Version-Release number of selected component (if applicable):
=============================================================
4.11.0-0.nightly-2022-04-25-220649

How reproducible:
-----------------
So far 100%

Steps to Reproduce:
-------------------
1. Deploy baremetal OCP with bonded interfaces on "baremetal" network


Actual results:
===============
Deployment fails


Expected results:
=================
Deployment succeeds


Additional info:
================
Virtual setup: 3 masters + 2 workers; baremetal network IPv4; provisioning network IPv6

Comment 1 Yurii Prokulevych 2022-04-26 11:55:26 UTC

oc get po -n openshift-ovn-kubernetes -o wide
NAME                   READY   STATUS             RESTARTS         AGE   IP                NODE         NOMINATED NODE   READINESS GATES
ovnkube-master-l5q26   6/6     Running            6 (3h59m ago)    4h    192.168.123.77    master-0-0   <none>           <none>
ovnkube-master-vdg2b   6/6     Running            0                4h    192.168.123.103   master-0-1   <none>           <none>
ovnkube-master-wnblg   6/6     Running            6 (3h59m ago)    4h    192.168.123.133   master-0-2   <none>           <none>
ovnkube-node-8m45c     4/5     CrashLoopBackOff   51 (2m55s ago)   4h    192.168.123.103   master-0-1   <none>           <none>
ovnkube-node-b7zz2     4/5     CrashLoopBackOff   51 (3m22s ago)   4h    192.168.123.133   master-0-2   <none>           <none>
ovnkube-node-kn9g2     4/5     CrashLoopBackOff   54 (2m39s ago)   4h    192.168.123.77    master-0-0   <none>           <none>

oc describe po -n openshift-ovn-kubernetes ovnkube-node-kn9g2
=============================================================
Name:                 ovnkube-node-kn9g2
Namespace:            openshift-ovn-kubernetes
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 master-0-0/192.168.123.77
Start Time:           Tue, 26 Apr 2022 03:50:09 -0400
Labels:               app=ovnkube-node
                      component=network
                      controller-revision-hash=767d97d469
                      kubernetes.io/os=linux
                      openshift.io/component=network
                      pod-template-generation=1
                      type=infra
Annotations:          networkoperator.openshift.io/ip-family-mode: single-stack
Status:               Running
IP:                   192.168.123.77
IPs:
  IP:           192.168.123.77
Controlled By:  DaemonSet/ovnkube-node
Containers:
  ovn-controller:
    Container ID:  cri-o://efeb78bca3e686a24b957bdac1285cec1ee5bf250e8d86c05e4c0dc147f96b87
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      set -e
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      fi

      echo "$(date -Iseconds) - starting ovn-controller"
      exec ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off \
        --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid \
        --syslog-method="null" \
        --log-file=/var/log/ovn/acl-audit-log.log \
        -vFACILITY:"local0" \
        -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt \
        -vconsole:"${OVN_LOG_LEVEL}" -vconsole:"acl_log:off" \
        -vPATTERN:console:"%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" \
        -vsyslog:"acl_log:info" \
        -vfile:"acl_log:info"
    State:          Running
      Started:      Tue, 26 Apr 2022 03:50:25 -0400
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     10m
      memory:  300Mi
    Environment:
      OVN_LOG_LEVEL:  info
      K8S_NODE:        (v1:spec.nodeName)
    Mounts:
      /dev/log from log-socket (rw)
      /env from env-overrides (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /etc/ovn/ from etc-openvswitch (rw)
      /ovn-ca from ovn-ca (rw)
      /ovn-cert from ovn-cert (rw)
      /run/openvswitch from run-openvswitch (rw)
      /run/ovn/ from run-ovn (rw)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/log/ovn from node-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro)

  ovn-acl-logging:
    Container ID:  cri-o://3e9a34eb270e86663ef252046788cd5f4ae8ffca71c67832f98b20799960c1cf
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      set -euo pipefail

      # Rotate audit log files when then get to max size (in bytes)
      MAXFILESIZE=$(( "50"*1000000 ))
      LOGFILE=/var/log/ovn/acl-audit-log.log
      CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid)

      # Redirect err to null so no messages are shown upon rotation
      tail -F ${LOGFILE} 2> /dev/null &

      while true
      do
        # Make sure ovn-controller's logfile exists, and get current size in bytes
        if [ -f "$LOGFILE" ]; then
          file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1`
        else
          ovs-appctl -t /var/run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen
          file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1`
        fi

        if [ $file_size -gt $MAXFILESIZE ];then
          echo "Rotating OVN ACL Log File"
          timestamp=`date '+%Y-%m-%dT%H-%M-%S'`
          mv ${LOGFILE} /var/log/ovn/acl-audit-log.$timestamp.log
          ovs-appctl -t /run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen
          CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid)
        fi

        # sleep for 30 seconds to avoid wasting CPU
        sleep 30
      done

    State:          Running
      Started:      Tue, 26 Apr 2022 03:50:25 -0400
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /run/ovn/ from run-ovn (rw)
      /var/log/ovn from node-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://4c5d506d60ab72a6830c70d9e10a4d293f41c433614a5361166dcb17f68b288b
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd
    Port:          9103/TCP
    Host Port:     9103/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -euo pipefail
      TLS_PK=/etc/pki/tls/metrics-cert/tls.key
      TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt
      # As the secret mount is optional we must wait for the files to be present.
      # The service is created in monitor.yaml and this is created in sdn.yaml.
      # If it isn't created there is probably an issue so we want to crashloop.
      retries=0
      TS=$(date +%s)
      WARN_TS=$(( ${TS} + $(( 20 * 60)) ))
      HAS_LOGGED_INFO=0
      log_missing_certs(){
          CUR_TS=$(date +%s)
          if [[ "${CUR_TS}" -gt "WARN_TS"  ]]; then
            echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes.
          elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then
            echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour.
            HAS_LOGGED_INFO=1
          fi
      }
      while [[ ! -f "${TLS_PK}" ||  ! -f "${TLS_CERT}" ]] ; do
        log_missing_certs
        sleep 5
      done

      echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
      exec /usr/bin/kube-rbac-proxy \
        --logtostderr \
        --secure-listen-address=:9103 \
        --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \
        --upstream=http://127.0.0.1:29103/ \
        --tls-private-key-file=${TLS_PK} \
        --tls-cert-file=${TLS_CERT}

    State:          Running
      Started:      Tue, 26 Apr 2022 03:50:27 -0400
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/pki/tls/metrics-cert from ovn-node-metrics-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro)
  kube-rbac-proxy-ovn-metrics:
    Container ID:  cri-o://d28b57b17fc7c505153788007a5e850fce95c5a148c52ded163098b77374dd2d
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd
    Port:          9105/TCP
    Host Port:     9105/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -euo pipefail
      TLS_PK=/etc/pki/tls/metrics-cert/tls.key
      TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt
      # As the secret mount is optional we must wait for the files to be present.
      # The service is created in monitor.yaml and this is created in sdn.yaml.
      # If it isn't created there is probably an issue so we want to crashloop.
      retries=0
      TS=$(date +%s)
      WARN_TS=$(( ${TS} + $(( 20 * 60)) ))
      HAS_LOGGED_INFO=0

      log_missing_certs(){
          CUR_TS=$(date +%s)
          if [[ "${CUR_TS}" -gt "WARN_TS"  ]]; then
            echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes.
          elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then
            echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour.
            HAS_LOGGED_INFO=1
          fi
      }
      while [[ ! -f "${TLS_PK}" ||  ! -f "${TLS_CERT}" ]] ; do
        log_missing_certs
        sleep 5
      done

      echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
      exec /usr/bin/kube-rbac-proxy \
        --logtostderr \
        --secure-listen-address=:9105 \
        --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \
        --upstream=http://127.0.0.1:29105/ \
        --tls-private-key-file=${TLS_PK} \
        --tls-cert-file=${TLS_CERT}
    State:          Running
      Started:      Tue, 26 Apr 2022 03:50:27 -0400
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/pki/tls/metrics-cert from ovn-node-metrics-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro)
  ovnkube-node:
    Container ID:  cri-o://f8e370dd2f46e3760225de60fecb2472bab8b24c1ce978601de14acbe9cfae7f
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0
    Port:          29103/TCP
    Host Port:     29103/TCP
    Command:
      /bin/bash
      -c
      set -xe
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      fi
      cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
      ovn_config_namespace=openshift-ovn-kubernetes
      echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on geneve port"
      iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
      iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
      ip6tables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
      ip6tables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
      echo "I$(date "+%m%d %H:%M:%S.%N") - starting ovnkube-node"

      if [ "shared" == "shared" ]; then
        gateway_mode_flags="--gateway-mode shared --gateway-interface br-ex"
      elif [ "shared" == "local" ]; then
        gateway_mode_flags="--gateway-mode local --gateway-interface br-ex"
      else
        echo "Invalid OVN_GATEWAY_MODE: \"shared\". Must be \"local\" or \"shared\"."
        exit 1
      fi

      export_network_flows_flags=
      if [[ -n "${NETFLOW_COLLECTORS}" ]] ; then
        export_network_flows_flags="--netflow-targets ${NETFLOW_COLLECTORS}"
      fi
      if [[ -n "${SFLOW_COLLECTORS}" ]] ; then
        export_network_flows_flags="$export_network_flows_flags --sflow-targets ${SFLOW_COLLECTORS}"
      fi
      if [[ -n "${IPFIX_COLLECTORS}" ]] ; then
        export_network_flows_flags="$export_network_flows_flags --ipfix-targets ${IPFIX_COLLECTORS}"
      fi
      if [[ -n "${IPFIX_CACHE_MAX_FLOWS}" ]] ; then
        export_network_flows_flags="$export_network_flows_flags --ipfix-cache-max-flows ${IPFIX_CACHE_MAX_FLOWS}"
      fi
      if [[ -n "${IPFIX_CACHE_ACTIVE_TIMEOUT}" ]] ; then
        export_network_flows_flags="$export_network_flows_flags --ipfix-cache-active-timeout ${IPFIX_CACHE_ACTIVE_TIMEOUT}"
      fi
      if [[ -n "${IPFIX_SAMPLING}" ]] ; then
        export_network_flows_flags="$export_network_flows_flags --ipfix-sampling ${IPFIX_SAMPLING}"
      fi
      gw_interface_flag=
      # if br-ex1 is configured on the node, we want to use it for external gateway traffic
      if [ -d /sys/class/net/br-ex1 ]; then
        gw_interface_flag="--exgw-interface=br-ex1"
      fi

      node_mgmt_port_netdev_flags=
      if [[ -n "${OVNKUBE_NODE_MGMT_PORT_NETDEV}" ]] ; then
        node_mgmt_port_netdev_flags="--ovnkube-node-mgmt-port-netdev ${OVNKUBE_NODE_MGMT_PORT_NETDEV}"
      fi

      exec /usr/bin/ovnkube --init-node "${K8S_NODE}" \
        --nb-address "ssl:192.168.123.103:9641,ssl:192.168.123.133:9641,ssl:192.168.123.77:9641" \
        --sb-address "ssl:192.168.123.103:9642,ssl:192.168.123.133:9642,ssl:192.168.123.77:9642" \
        --nb-client-privkey /ovn-cert/tls.key \
        --nb-client-cert /ovn-cert/tls.crt \
        --nb-client-cacert /ovn-ca/ca-bundle.crt \
        --nb-cert-common-name "ovn" \
        --sb-client-privkey /ovn-cert/tls.key \
        --sb-client-cert /ovn-cert/tls.crt \
        --sb-client-cacert /ovn-ca/ca-bundle.crt \
        --sb-cert-common-name "ovn" \
        --config-file=/run/ovnkube-config/ovnkube.conf \
        --loglevel "${OVN_KUBE_LOG_LEVEL}" \
        --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
        ${gateway_mode_flags} \
        --metrics-bind-address "127.0.0.1:29103" \
        --ovn-metrics-bind-address "127.0.0.1:29105" \
        --metrics-enable-pprof \
        --disable-snat-multiple-gws \
        ${export_network_flows_flags} \
        ${gw_interface_flag}

    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   133:9642,ssl:192.168.123.77:9642 --timeout=15 --columns=up list Port_Binding
I0426 11:48:05.579980  451912 ovs.go:205] Exec(3): stdout: "up                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup                  : false\n\nup    >
I0426 11:48:05.580014  451912 ovs.go:206] Exec(3): stderr: ""
I0426 11:48:05.580040  451912 node.go:312] Detected support for port binding with external IDs
I0426 11:48:05.580120  451912 ovs.go:202] Exec(4): /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-master-0-0 -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_re>
I0426 11:48:05.588085  451912 ovs.go:205] Exec(4): stdout: ""
I0426 11:48:05.588103  451912 ovs.go:206] Exec(4): stderr: ""
I0426 11:48:05.588114  451912 ovs.go:202] Exec(5): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface ovn-k8s-mp0 mac_in_use
I0426 11:48:05.595051  451912 ovs.go:205] Exec(5): stdout: "\"6a:98:ef:ab:97:c5\"\n"
I0426 11:48:05.595069  451912 ovs.go:206] Exec(5): stderr: ""
I0426 11:48:05.595098  451912 ovs.go:202] Exec(6): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=6a\:98\:ef\:ab\:97\:c5
I0426 11:48:05.602049  451912 ovs.go:205] Exec(6): stdout: ""
I0426 11:48:05.602067  451912 ovs.go:206] Exec(6): stderr: ""
I0426 11:48:05.648189  451912 gateway_init.go:261] Initializing Gateway Functionality
I0426 11:48:05.648608  451912 gateway_localnet.go:163] Node local addresses initialized to: map[10.128.0.2:{10.128.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 192.168.123.77:{192.168.123.0 ffffff00} ::1:{::1 ffffffffffff>
F0426 11:48:05.648778  451912 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found

      Exit Code:    1
      Started:      Tue, 26 Apr 2022 07:48:05 -0400
      Finished:     Tue, 26 Apr 2022 07:48:05 -0400
    Ready:          False
    Restart Count:  54
    Requests:
      cpu:      10m
      memory:   300Mi
    Readiness:  exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      KUBERNETES_SERVICE_PORT:          6443
      KUBERNETES_SERVICE_HOST:          api-int.edge-0.qe.lab.redhat.com
      OVN_CONTROLLER_INACTIVITY_PROBE:  180000
      OVN_KUBE_LOG_LEVEL:               4
      K8S_NODE:                          (v1:spec.nodeName)
    Mounts:
      /cni-bin-dir from host-cni-bin (rw)
      /env from env-overrides (rw)
      /etc/cni/net.d from host-cni-netd (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /etc/ovn/ from etc-openvswitch (rw)
      /etc/systemd/system from systemd-units (ro)
      /host from host-slash (ro)
      /ovn-ca from ovn-ca (rw)
      /ovn-cert from ovn-cert (rw)
      /run/netns from host-run-netns (ro)
      /run/openvswitch from run-openvswitch (rw)
      /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw)
      /run/ovn/ from run-ovn (rw)
      /run/ovnkube-config/ from ovnkube-config (rw)
      /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  systemd-units:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/systemd/system
    HostPathType:
  host-slash:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-run-netns:
    Type:          HostPath (bare host directory volume)
    Path:          /run/netns
    HostPathType:
  var-lib-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/openvswitch/data
    HostPathType:
  etc-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/openvswitch
    HostPathType:
  run-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/openvswitch
    HostPathType:
  run-ovn:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/ovn
    HostPathType:
  node-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/ovn
    HostPathType:
  log-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:
  host-run-ovn-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /run/ovn-kubernetes
    HostPathType:
  host-cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/bin
    HostPathType:
  host-cni-netd:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/multus/cni/net.d
    HostPathType:
  host-var-lib-cni-networks-ovn-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks/ovn-k8s-cni-overlay
    HostPathType:
  ovnkube-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovnkube-config
    Optional:  false
  env-overrides:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      env-overrides
    Optional:  true
  ovn-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovn-ca
    Optional:  false
  ovn-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-cert
    Optional:    false
  ovn-node-metrics-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-node-metrics-cert
    Optional:    true
  kube-api-access-kghkz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              beta.kubernetes.io/os=linux
Tolerations:                 op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   126m (x31 over 4h)      kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0" already present on machine
  Warning  BackOff  70s (x1117 over 3h59m)  kubelet  Back-off restarting failed container

Comment 3 Ben Nemec 2022-04-26 15:46:03 UTC

How is the bond being configured? I don't see a networkConfig section in install-config. If the base image is being modified outside of the deployment process, it's possible there is an issue with permissions on the nmconnection files because configure-ovs is failing with "ovs-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION".

The error from earlier seems to be related to bringing up br-ex: "Error: Connection activation failed: IP configuration could not be reserved (no available address, timeout, etc.)" I haven't seen that before so I can't say what would cause it, but the permission error return code is suspicious.

Comment 9 Andreas Karis 2022-05-03 13:54:04 UTC

*** Bug 2077052 has been marked as a duplicate of this bug. ***

Comment 10 Mohamed Mahmoud 2022-05-06 15:53:13 UTC

*** Bug 2077900 has been marked as a duplicate of this bug. ***

Comment 12 Lalatendu Mohanty 2022-05-11 18:54:27 UTC

We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS".
example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0.
example: All customers upgrading from 4.y.z to 4.y+1.z fail. Check your vulnerability with oc adm upgrade to show your current cluster version.

What is the impact? Is it serious enough to warrant removing update recommendations?

reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc ....
example: Up to 90 seconds of API downtime. Check with curl ....
example: etcd loses quorum and you have to restore from backup. Check with ssh ....

How involved is remediation?

reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
example: Issue resolves itself after five minutes.
example: Admin can run a single: oc ....
example: Admin must SSH to hosts, restore from backups, or other non standard admin activities.

Is this a regression?

reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
example: No, it has always been like this we just never noticed.
example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.

Comment 13 Stephen Benjamin 2022-05-12 12:01:42 UTC

+@akaris marked https://bugzilla.redhat.com/show_bug.cgi?id=2077052 as a dupe of this.. if they're both caused by NM 1.36.0, that team has provided a scratch build in https://bugzilla.redhat.com/show_bug.cgi?id=2077605#c19.  Can we get an RHCOS version with that scratch build RPM, and an OpenShift payload that uses it to test?

Comment 18 W. Trevor King 2022-05-18 22:54:35 UTC

This is a 4.11 bug which doesn't seem to impact 4.10 code.  It's blocker+, so we'll have it fixed before we ship 4.11 with supported 4.10-to-4.11 updates.  That means we don't expect to make any graph-data changes based on this bug [1], so I'm dropping the keyword.  Feel free to add it back if I'm misunderstanding something.

[1]: https://github.com/openshift/enhancements/blob/master/enhancements/update/update-blocker-lifecycle/README.md#summary

Comment 19 Ross Brattain 2022-05-24 23:00:58 UTC

Verified on 4.11.0-0.nightly-2022-05-20-213928   IPI baremetal dual-stack bonding 


Verified using `autoconnect-priority=99` as per bug 2055433, comment 1

> The solution is to deploy the custom NM profiles with higher autoconnect-priority than the default (which is 0) so that the intended interfaces are always activated through those custom profiles instead of the default one.


/etc/NetworkManager/system-connections/enp6s0.nmconnection
[connection]
id=enp6s0
type=ethernet
interface-name=enp6s0
master=bond0
slave-type=bond
autoconnect=true
autoconnect-priority=99

/etc/NetworkManager/system-connections/enp5s0.nmconnection:
[connection]
id=enp5s0
type=ethernet
interface-name=enp5s0
master=bond0
slave-type=bond
autoconnect=true
autoconnect-priority=99




/etc/NetworkManager/system-connections/bond0.nmconnection:
[connection]
id=bond0
type=bond
interface-name=bond0
autoconnect=true
connection.autoconnect-slaves=1
autoconnect-priority=99

[bond]
mode=802.3ad
miimon=100

[ipv4]
method=auto
dhcp-timeout=2147483647

[ipv6]
method=auto
dhcp-timeout=2147483647



May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX")
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local conn=enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local active_state=
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']'
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for i in {1..10}
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection enp5s0-slave-ovs-clone'
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Error: Connection activation failed: Unknown reason
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Hint: use 'journalctl -xe NM_CONNECTION=a9b3f226-7435-47bd-a431-d106c8fd8f39 + NM_DEVICE=enp5s0' to get more details.
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + s=4
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + sleep 5
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + for i in {1..10}
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 2 to bring up connection enp5s0-slave-ovs-clone'
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: Attempt 2 to bring up connection enp5s0-slave-ovs-clone
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/20)
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + s=0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + break
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection enp5s0-slave-ovs-clone successfully'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Brought up connection enp5s0-slave-ovs-clone successfully
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp5s0-slave-ovs-clone connection.autoconnect yes
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX")
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp6s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=enp6s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp6s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection enp6s0-slave-ovs-clone already activated'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection enp6s0-slave-ovs-clone already activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp6s0-slave-ovs-clone connection.autoconnect yes
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-phys0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-phys0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-phys0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection ovs-if-phys0 already activated'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection ovs-if-phys0 already activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-phys0 connection.autoconnect yes
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for i in {1..10}
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection ovs-if-br-ex'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli conn up ovs-if-br-ex
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/21)
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + s=0
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + break
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']'
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection ovs-if-br-ex successfully'
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Brought up connection ovs-if-br-ex successfully
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' -f /etc/ovnk/extra_bridge ']'
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + handle_exit
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + e=0
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']'


May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip route show
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via 192.168.123.1 dev br-ex proto dhcp metric 48
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: 192.168.123.0/24 dev br-ex proto kernel scope link src 192.168.123.97 metric 48
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip -6 route show
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: ::1 dev lo proto kernel metric 256 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::4d dev br-ex proto kernel metric 48 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::/64 dev br-ex proto ra metric 48 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev br-ex proto kernel metric 48 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via fe80::5054:ff:fe97:2978 dev br-ex proto ra metric 48 pref medium

Comment 21 errata-xmlrpc 2022-08-10 11:08:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 22 Red Hat Bugzilla 2023-09-15 01:54:13 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.