Description of problem: ----------------------- Installation of OCP-4.11 with bonded interfaces "baremetal" network fails: ... time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network ManagementStateDegraded is False with : " time="2022-04-26T04:02:46-04:00" level=error msg="Cluster operator network Degraded is True with RolloutHung: DaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-8m45c is in CrashLoopBackOff State\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-b7zz2 is in CrashLoopBackOff State\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - pod ovnkube-node-kn9g2 is in CrashLoopBackOff State\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" rollout is not making progress - last change 2022-04-26T07:49:55Z" time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"/openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready\nDaemonSet \"/openshift-multus/multus-admission-controller\" is waiting for other operators to become ready\nDaemonSet \"/openshift-ovn-kubernetes/ovnkube-node\" is not available (awaiting 3 nodes)\nDaemonSet \"/openshift-network-diagnostics/network-check-target\" is waiting for other operators to become ready\nDeployment \"/openshift-network-diagnostics/network-check-source\" is waiting for other operators to become ready" time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network Available is False with Startup: The network is starting up" time="2022-04-26T04:02:46-04:00" level=info msg="Cluster operator network is with : " time="2022-04-26T04:02:46-04:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2022-04-26T04:02:46-04:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." time="2022-04-26T04:02:46-04:00" level=info msg="Bootstrap gather logs captured here \"/home/kni/clusterconfigs/log-bundle-20220426040203.tar.gz\"" Version-Release number of selected component (if applicable): ============================================================= 4.11.0-0.nightly-2022-04-25-220649 How reproducible: ----------------- So far 100% Steps to Reproduce: ------------------- 1. Deploy baremetal OCP with bonded interfaces on "baremetal" network Actual results: =============== Deployment fails Expected results: ================= Deployment succeeds Additional info: ================ Virtual setup: 3 masters + 2 workers; baremetal network IPv4; provisioning network IPv6
oc get po -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-l5q26 6/6 Running 6 (3h59m ago) 4h 192.168.123.77 master-0-0 <none> <none> ovnkube-master-vdg2b 6/6 Running 0 4h 192.168.123.103 master-0-1 <none> <none> ovnkube-master-wnblg 6/6 Running 6 (3h59m ago) 4h 192.168.123.133 master-0-2 <none> <none> ovnkube-node-8m45c 4/5 CrashLoopBackOff 51 (2m55s ago) 4h 192.168.123.103 master-0-1 <none> <none> ovnkube-node-b7zz2 4/5 CrashLoopBackOff 51 (3m22s ago) 4h 192.168.123.133 master-0-2 <none> <none> ovnkube-node-kn9g2 4/5 CrashLoopBackOff 54 (2m39s ago) 4h 192.168.123.77 master-0-0 <none> <none> oc describe po -n openshift-ovn-kubernetes ovnkube-node-kn9g2 ============================================================= Name: ovnkube-node-kn9g2 Namespace: openshift-ovn-kubernetes Priority: 2000001000 Priority Class Name: system-node-critical Node: master-0-0/192.168.123.77 Start Time: Tue, 26 Apr 2022 03:50:09 -0400 Labels: app=ovnkube-node component=network controller-revision-hash=767d97d469 kubernetes.io/os=linux openshift.io/component=network pod-template-generation=1 type=infra Annotations: networkoperator.openshift.io/ip-family-mode: single-stack Status: Running IP: 192.168.123.77 IPs: IP: 192.168.123.77 Controlled By: DaemonSet/ovnkube-node Containers: ovn-controller: Container ID: cri-o://efeb78bca3e686a24b957bdac1285cec1ee5bf250e8d86c05e4c0dc147f96b87 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0 Port: <none> Host Port: <none> Command: /bin/bash -c set -e if [[ -f "/env/${K8S_NODE}" ]]; then set -o allexport source "/env/${K8S_NODE}" set +o allexport fi echo "$(date -Iseconds) - starting ovn-controller" exec ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off \ --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid \ --syslog-method="null" \ --log-file=/var/log/ovn/acl-audit-log.log \ -vFACILITY:"local0" \ -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt \ -vconsole:"${OVN_LOG_LEVEL}" -vconsole:"acl_log:off" \ -vPATTERN:console:"%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" \ -vsyslog:"acl_log:info" \ -vfile:"acl_log:info" State: Running Started: Tue, 26 Apr 2022 03:50:25 -0400 Ready: True Restart Count: 0 Requests: cpu: 10m memory: 300Mi Environment: OVN_LOG_LEVEL: info K8S_NODE: (v1:spec.nodeName) Mounts: /dev/log from log-socket (rw) /env from env-overrides (rw) /etc/openvswitch from etc-openvswitch (rw) /etc/ovn/ from etc-openvswitch (rw) /ovn-ca from ovn-ca (rw) /ovn-cert from ovn-cert (rw) /run/openvswitch from run-openvswitch (rw) /run/ovn/ from run-ovn (rw) /var/lib/openvswitch from var-lib-openvswitch (rw) /var/log/ovn from node-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro) ovn-acl-logging: Container ID: cri-o://3e9a34eb270e86663ef252046788cd5f4ae8ffca71c67832f98b20799960c1cf Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0 Port: <none> Host Port: <none> Command: /bin/bash -c set -euo pipefail # Rotate audit log files when then get to max size (in bytes) MAXFILESIZE=$(( "50"*1000000 )) LOGFILE=/var/log/ovn/acl-audit-log.log CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid) # Redirect err to null so no messages are shown upon rotation tail -F ${LOGFILE} 2> /dev/null & while true do # Make sure ovn-controller's logfile exists, and get current size in bytes if [ -f "$LOGFILE" ]; then file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1` else ovs-appctl -t /var/run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1` fi if [ $file_size -gt $MAXFILESIZE ];then echo "Rotating OVN ACL Log File" timestamp=`date '+%Y-%m-%dT%H-%M-%S'` mv ${LOGFILE} /var/log/ovn/acl-audit-log.$timestamp.log ovs-appctl -t /run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid) fi # sleep for 30 seconds to avoid wasting CPU sleep 30 done State: Running Started: Tue, 26 Apr 2022 03:50:25 -0400 Ready: True Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /run/ovn/ from run-ovn (rw) /var/log/ovn from node-log (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro) kube-rbac-proxy: Container ID: cri-o://4c5d506d60ab72a6830c70d9e10a4d293f41c433614a5361166dcb17f68b288b Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd Port: 9103/TCP Host Port: 9103/TCP Command: /bin/bash -c #!/bin/bash set -euo pipefail TLS_PK=/etc/pki/tls/metrics-cert/tls.key TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt # As the secret mount is optional we must wait for the files to be present. # The service is created in monitor.yaml and this is created in sdn.yaml. # If it isn't created there is probably an issue so we want to crashloop. retries=0 TS=$(date +%s) WARN_TS=$(( ${TS} + $(( 20 * 60)) )) HAS_LOGGED_INFO=0 log_missing_certs(){ CUR_TS=$(date +%s) if [[ "${CUR_TS}" -gt "WARN_TS" ]]; then echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes. elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour. HAS_LOGGED_INFO=1 fi } while [[ ! -f "${TLS_PK}" || ! -f "${TLS_CERT}" ]] ; do log_missing_certs sleep 5 done echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy exec /usr/bin/kube-rbac-proxy \ --logtostderr \ --secure-listen-address=:9103 \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \ --upstream=http://127.0.0.1:29103/ \ --tls-private-key-file=${TLS_PK} \ --tls-cert-file=${TLS_CERT} State: Running Started: Tue, 26 Apr 2022 03:50:27 -0400 Ready: True Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /etc/pki/tls/metrics-cert from ovn-node-metrics-cert (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro) kube-rbac-proxy-ovn-metrics: Container ID: cri-o://d28b57b17fc7c505153788007a5e850fce95c5a148c52ded163098b77374dd2d Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:352b9150d76987c29c7ec9a634b848553c363de7b6ab7b5587b3e93aeed858cd Port: 9105/TCP Host Port: 9105/TCP Command: /bin/bash -c #!/bin/bash set -euo pipefail TLS_PK=/etc/pki/tls/metrics-cert/tls.key TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt # As the secret mount is optional we must wait for the files to be present. # The service is created in monitor.yaml and this is created in sdn.yaml. # If it isn't created there is probably an issue so we want to crashloop. retries=0 TS=$(date +%s) WARN_TS=$(( ${TS} + $(( 20 * 60)) )) HAS_LOGGED_INFO=0 log_missing_certs(){ CUR_TS=$(date +%s) if [[ "${CUR_TS}" -gt "WARN_TS" ]]; then echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes. elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour. HAS_LOGGED_INFO=1 fi } while [[ ! -f "${TLS_PK}" || ! -f "${TLS_CERT}" ]] ; do log_missing_certs sleep 5 done echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy exec /usr/bin/kube-rbac-proxy \ --logtostderr \ --secure-listen-address=:9105 \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \ --upstream=http://127.0.0.1:29105/ \ --tls-private-key-file=${TLS_PK} \ --tls-cert-file=${TLS_CERT} State: Running Started: Tue, 26 Apr 2022 03:50:27 -0400 Ready: True Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /etc/pki/tls/metrics-cert from ovn-node-metrics-cert (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro) ovnkube-node: Container ID: cri-o://f8e370dd2f46e3760225de60fecb2472bab8b24c1ce978601de14acbe9cfae7f Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0 Port: 29103/TCP Host Port: 29103/TCP Command: /bin/bash -c set -xe if [[ -f "/env/${K8S_NODE}" ]]; then set -o allexport source "/env/${K8S_NODE}" set +o allexport fi cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/ ovn_config_namespace=openshift-ovn-kubernetes echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on geneve port" iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK ip6tables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK ip6tables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK echo "I$(date "+%m%d %H:%M:%S.%N") - starting ovnkube-node" if [ "shared" == "shared" ]; then gateway_mode_flags="--gateway-mode shared --gateway-interface br-ex" elif [ "shared" == "local" ]; then gateway_mode_flags="--gateway-mode local --gateway-interface br-ex" else echo "Invalid OVN_GATEWAY_MODE: \"shared\". Must be \"local\" or \"shared\"." exit 1 fi export_network_flows_flags= if [[ -n "${NETFLOW_COLLECTORS}" ]] ; then export_network_flows_flags="--netflow-targets ${NETFLOW_COLLECTORS}" fi if [[ -n "${SFLOW_COLLECTORS}" ]] ; then export_network_flows_flags="$export_network_flows_flags --sflow-targets ${SFLOW_COLLECTORS}" fi if [[ -n "${IPFIX_COLLECTORS}" ]] ; then export_network_flows_flags="$export_network_flows_flags --ipfix-targets ${IPFIX_COLLECTORS}" fi if [[ -n "${IPFIX_CACHE_MAX_FLOWS}" ]] ; then export_network_flows_flags="$export_network_flows_flags --ipfix-cache-max-flows ${IPFIX_CACHE_MAX_FLOWS}" fi if [[ -n "${IPFIX_CACHE_ACTIVE_TIMEOUT}" ]] ; then export_network_flows_flags="$export_network_flows_flags --ipfix-cache-active-timeout ${IPFIX_CACHE_ACTIVE_TIMEOUT}" fi if [[ -n "${IPFIX_SAMPLING}" ]] ; then export_network_flows_flags="$export_network_flows_flags --ipfix-sampling ${IPFIX_SAMPLING}" fi gw_interface_flag= # if br-ex1 is configured on the node, we want to use it for external gateway traffic if [ -d /sys/class/net/br-ex1 ]; then gw_interface_flag="--exgw-interface=br-ex1" fi node_mgmt_port_netdev_flags= if [[ -n "${OVNKUBE_NODE_MGMT_PORT_NETDEV}" ]] ; then node_mgmt_port_netdev_flags="--ovnkube-node-mgmt-port-netdev ${OVNKUBE_NODE_MGMT_PORT_NETDEV}" fi exec /usr/bin/ovnkube --init-node "${K8S_NODE}" \ --nb-address "ssl:192.168.123.103:9641,ssl:192.168.123.133:9641,ssl:192.168.123.77:9641" \ --sb-address "ssl:192.168.123.103:9642,ssl:192.168.123.133:9642,ssl:192.168.123.77:9642" \ --nb-client-privkey /ovn-cert/tls.key \ --nb-client-cert /ovn-cert/tls.crt \ --nb-client-cacert /ovn-ca/ca-bundle.crt \ --nb-cert-common-name "ovn" \ --sb-client-privkey /ovn-cert/tls.key \ --sb-client-cert /ovn-cert/tls.crt \ --sb-client-cacert /ovn-ca/ca-bundle.crt \ --sb-cert-common-name "ovn" \ --config-file=/run/ovnkube-config/ovnkube.conf \ --loglevel "${OVN_KUBE_LOG_LEVEL}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ ${gateway_mode_flags} \ --metrics-bind-address "127.0.0.1:29103" \ --ovn-metrics-bind-address "127.0.0.1:29105" \ --metrics-enable-pprof \ --disable-snat-multiple-gws \ ${export_network_flows_flags} \ ${gw_interface_flag} State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: 133:9642,ssl:192.168.123.77:9642 --timeout=15 --columns=up list Port_Binding I0426 11:48:05.579980 451912 ovs.go:205] Exec(3): stdout: "up : false\n\nup : false\n\nup : false\n\nup : false\n\nup : false\n\nup > I0426 11:48:05.580014 451912 ovs.go:206] Exec(3): stderr: "" I0426 11:48:05.580040 451912 node.go:312] Detected support for port binding with external IDs I0426 11:48:05.580120 451912 ovs.go:202] Exec(4): /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-master-0-0 -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_re> I0426 11:48:05.588085 451912 ovs.go:205] Exec(4): stdout: "" I0426 11:48:05.588103 451912 ovs.go:206] Exec(4): stderr: "" I0426 11:48:05.588114 451912 ovs.go:202] Exec(5): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface ovn-k8s-mp0 mac_in_use I0426 11:48:05.595051 451912 ovs.go:205] Exec(5): stdout: "\"6a:98:ef:ab:97:c5\"\n" I0426 11:48:05.595069 451912 ovs.go:206] Exec(5): stderr: "" I0426 11:48:05.595098 451912 ovs.go:202] Exec(6): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=6a\:98\:ef\:ab\:97\:c5 I0426 11:48:05.602049 451912 ovs.go:205] Exec(6): stdout: "" I0426 11:48:05.602067 451912 ovs.go:206] Exec(6): stderr: "" I0426 11:48:05.648189 451912 gateway_init.go:261] Initializing Gateway Functionality I0426 11:48:05.648608 451912 gateway_localnet.go:163] Node local addresses initialized to: map[10.128.0.2:{10.128.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 192.168.123.77:{192.168.123.0 ffffff00} ::1:{::1 ffffffffffff> F0426 11:48:05.648778 451912 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found Exit Code: 1 Started: Tue, 26 Apr 2022 07:48:05 -0400 Finished: Tue, 26 Apr 2022 07:48:05 -0400 Ready: False Restart Count: 54 Requests: cpu: 10m memory: 300Mi Readiness: exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=5s #success=1 #failure=3 Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: api-int.edge-0.qe.lab.redhat.com OVN_CONTROLLER_INACTIVITY_PROBE: 180000 OVN_KUBE_LOG_LEVEL: 4 K8S_NODE: (v1:spec.nodeName) Mounts: /cni-bin-dir from host-cni-bin (rw) /env from env-overrides (rw) /etc/cni/net.d from host-cni-netd (rw) /etc/openvswitch from etc-openvswitch (rw) /etc/ovn/ from etc-openvswitch (rw) /etc/systemd/system from systemd-units (ro) /host from host-slash (ro) /ovn-ca from ovn-ca (rw) /ovn-cert from ovn-cert (rw) /run/netns from host-run-netns (ro) /run/openvswitch from run-openvswitch (rw) /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw) /run/ovn/ from run-ovn (rw) /run/ovnkube-config/ from ovnkube-config (rw) /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw) /var/lib/openvswitch from var-lib-openvswitch (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kghkz (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: systemd-units: Type: HostPath (bare host directory volume) Path: /etc/systemd/system HostPathType: host-slash: Type: HostPath (bare host directory volume) Path: / HostPathType: host-run-netns: Type: HostPath (bare host directory volume) Path: /run/netns HostPathType: var-lib-openvswitch: Type: HostPath (bare host directory volume) Path: /var/lib/openvswitch/data HostPathType: etc-openvswitch: Type: HostPath (bare host directory volume) Path: /etc/openvswitch HostPathType: run-openvswitch: Type: HostPath (bare host directory volume) Path: /var/run/openvswitch HostPathType: run-ovn: Type: HostPath (bare host directory volume) Path: /var/run/ovn HostPathType: node-log: Type: HostPath (bare host directory volume) Path: /var/log/ovn HostPathType: log-socket: Type: HostPath (bare host directory volume) Path: /dev/log HostPathType: host-run-ovn-kubernetes: Type: HostPath (bare host directory volume) Path: /run/ovn-kubernetes HostPathType: host-cni-bin: Type: HostPath (bare host directory volume) Path: /var/lib/cni/bin HostPathType: host-cni-netd: Type: HostPath (bare host directory volume) Path: /var/run/multus/cni/net.d HostPathType: host-var-lib-cni-networks-ovn-kubernetes: Type: HostPath (bare host directory volume) Path: /var/lib/cni/networks/ovn-k8s-cni-overlay HostPathType: ovnkube-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: ovnkube-config Optional: false env-overrides: Type: ConfigMap (a volume populated by a ConfigMap) Name: env-overrides Optional: true ovn-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ovn-ca Optional: false ovn-cert: Type: Secret (a volume populated by a Secret) SecretName: ovn-cert Optional: false ovn-node-metrics-cert: Type: Secret (a volume populated by a Secret) SecretName: ovn-node-metrics-cert Optional: true kube-api-access-kghkz: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux Tolerations: op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 126m (x31 over 4h) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f1efa98ed579d61701e870942c2e0ea90dc4eed346d1da190ecab5f1366e5ff0" already present on machine Warning BackOff 70s (x1117 over 3h59m) kubelet Back-off restarting failed container
How is the bond being configured? I don't see a networkConfig section in install-config. If the base image is being modified outside of the deployment process, it's possible there is an issue with permissions on the nmconnection files because configure-ovs is failing with "ovs-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION". The error from earlier seems to be related to bringing up br-ex: "Error: Connection activation failed: IP configuration could not be reserved (no available address, timeout, etc.)" I haven't seen that before so I can't say what would cause it, but the permission error return code is suspicious.
*** Bug 2077052 has been marked as a duplicate of this bug. ***
*** Bug 2077900 has been marked as a duplicate of this bug. ***
We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters? reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS". example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0. example: All customers upgrading from 4.y.z to 4.y+1.z fail. Check your vulnerability with oc adm upgrade to show your current cluster version. What is the impact? Is it serious enough to warrant removing update recommendations? reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS". example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc .... example: Up to 90 seconds of API downtime. Check with curl .... example: etcd loses quorum and you have to restore from backup. Check with ssh .... How involved is remediation? reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate. example: Issue resolves itself after five minutes. example: Admin can run a single: oc .... example: Admin must SSH to hosts, restore from backups, or other non standard admin activities. Is this a regression? reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure. example: No, it has always been like this we just never noticed. example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.
+@akaris marked https://bugzilla.redhat.com/show_bug.cgi?id=2077052 as a dupe of this.. if they're both caused by NM 1.36.0, that team has provided a scratch build in https://bugzilla.redhat.com/show_bug.cgi?id=2077605#c19. Can we get an RHCOS version with that scratch build RPM, and an OpenShift payload that uses it to test?
This is a 4.11 bug which doesn't seem to impact 4.10 code. It's blocker+, so we'll have it fixed before we ship 4.11 with supported 4.10-to-4.11 updates. That means we don't expect to make any graph-data changes based on this bug [1], so I'm dropping the keyword. Feel free to add it back if I'm misunderstanding something. [1]: https://github.com/openshift/enhancements/blob/master/enhancements/update/update-blocker-lifecycle/README.md#summary
Verified on 4.11.0-0.nightly-2022-05-20-213928 IPI baremetal dual-stack bonding Verified using `autoconnect-priority=99` as per bug 2055433, comment 1 > The solution is to deploy the custom NM profiles with higher autoconnect-priority than the default (which is 0) so that the intended interfaces are always activated through those custom profiles instead of the default one. /etc/NetworkManager/system-connections/enp6s0.nmconnection [connection] id=enp6s0 type=ethernet interface-name=enp6s0 master=bond0 slave-type=bond autoconnect=true autoconnect-priority=99 /etc/NetworkManager/system-connections/enp5s0.nmconnection: [connection] id=enp5s0 type=ethernet interface-name=enp5s0 master=bond0 slave-type=bond autoconnect=true autoconnect-priority=99 /etc/NetworkManager/system-connections/bond0.nmconnection: [connection] id=bond0 type=bond interface-name=bond0 autoconnect=true connection.autoconnect-slaves=1 autoconnect-priority=99 [bond] mode=802.3ad miimon=100 [ipv4] method=auto dhcp-timeout=2147483647 [ipv6] method=auto dhcp-timeout=2147483647 May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX") May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local conn=enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local active_state= May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']' May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for i in {1..10} May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection enp5s0-slave-ovs-clone' May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Error: Connection activation failed: Unknown reason May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Hint: use 'journalctl -xe NM_CONNECTION=a9b3f226-7435-47bd-a431-d106c8fd8f39 + NM_DEVICE=enp5s0' to get more details. May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + s=4 May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + sleep 5 May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + for i in {1..10} May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 2 to bring up connection enp5s0-slave-ovs-clone' May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: Attempt 2 to bring up connection enp5s0-slave-ovs-clone May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/20) May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + s=0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + break May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection enp5s0-slave-ovs-clone successfully' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Brought up connection enp5s0-slave-ovs-clone successfully May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp5s0-slave-ovs-clone connection.autoconnect yes May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX") May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp6s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=enp6s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp6s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection enp6s0-slave-ovs-clone already activated' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection enp6s0-slave-ovs-clone already activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp6s0-slave-ovs-clone connection.autoconnect yes May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-phys0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-phys0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-phys0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection ovs-if-phys0 already activated' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection ovs-if-phys0 already activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-phys0 connection.autoconnect yes May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state= May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for i in {1..10} May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection ovs-if-br-ex' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli conn up ovs-if-br-ex May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/21) May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + s=0 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + break May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection ovs-if-br-ex successfully' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Brought up connection ovs-if-br-ex successfully May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' -f /etc/ovnk/extra_bridge ']' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + handle_exit May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + e=0 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip route show May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via 192.168.123.1 dev br-ex proto dhcp metric 48 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: 192.168.123.0/24 dev br-ex proto kernel scope link src 192.168.123.97 metric 48 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip -6 route show May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: ::1 dev lo proto kernel metric 256 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::4d dev br-ex proto kernel metric 48 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::/64 dev br-ex proto ra metric 48 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev br-ex proto kernel metric 48 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via fe80::5054:ff:fe97:2978 dev br-ex proto ra metric 48 pref medium
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days