Description of problem: SDN migrating to OVN succeeded with rhel worker, but SDN pods got crash after rollback for rhel worker. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-18-000316 How reproducible: Not Sure Steps to Reproduce: 1. Setup IPI vsphere and scale up 2 rhel nodes. 2. Migrated SDN to OVN successfully. 3. Then do rollback operation. Note:When I did migrating and rollback, just changed the network type, no optional customize. 4. After manually reboot all the nodes, check the nodes status and sdn pods status Actual results: The rhel worker nodes were in NotReady and sdn pods for rhel worker were in error status. oc get nodes NAME STATUS ROLES AGE VERSION huirwang-vs47-wbnsq-master-0 Ready master 4h52m v1.20.0+d9c52cc huirwang-vs47-wbnsq-master-1 Ready master 4h52m v1.20.0+d9c52cc huirwang-vs47-wbnsq-master-2 Ready master 4h52m v1.20.0+d9c52cc huirwang-vs47-wbnsq-rhel-0 NotReady worker 3h5m v1.20.0+d9c52cc huirwang-vs47-wbnsq-rhel-1 NotReady worker 3h5m v1.20.0+d9c52cc huirwang-vs47-wbnsq-worker-4986s Ready worker 4h42m v1.20.0+d9c52cc huirwang-vs47-wbnsq-worker-x8cld Ready worker 4h42m v1.20.0+d9c52cc oc get pods -n openshift-sdn -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovs-56hxp 1/1 Running 0 90m 172.31.249.74 huirwang-vs47-wbnsq-rhel-0 <none> <none> ovs-d4ln5 1/1 Running 0 90m 172.31.249.170 huirwang-vs47-wbnsq-master-0 <none> <none> ovs-ng8jg 1/1 Running 0 90m 172.31.249.30 huirwang-vs47-wbnsq-rhel-1 <none> <none> ovs-nttnv 1/1 Running 0 90m 172.31.249.199 huirwang-vs47-wbnsq-master-1 <none> <none> ovs-pbxd6 1/1 Running 0 90m 172.31.249.41 huirwang-vs47-wbnsq-worker-x8cld <none> <none> ovs-qckxb 1/1 Running 0 90m 172.31.249.66 huirwang-vs47-wbnsq-master-2 <none> <none> ovs-rq787 1/1 Running 0 90m 172.31.249.213 huirwang-vs47-wbnsq-worker-4986s <none> <none> sdn-8gt9c 1/2 Error 7 90m 172.31.249.74 huirwang-vs47-wbnsq-rhel-0 <none> <none> sdn-controller-75xjs 1/1 Running 0 90m 172.31.249.170 huirwang-vs47-wbnsq-master-0 <none> <none> sdn-controller-gbmsg 1/1 Running 0 90m 172.31.249.66 huirwang-vs47-wbnsq-master-2 <none> <none> sdn-controller-hkcws 1/1 Running 0 90m 172.31.249.199 huirwang-vs47-wbnsq-master-1 <none> <none> sdn-gnr2d 2/2 Running 0 90m 172.31.249.213 huirwang-vs47-wbnsq-worker-4986s <none> <none> sdn-k8tf7 2/2 Running 0 90m 172.31.249.66 huirwang-vs47-wbnsq-master-2 <none> <none> sdn-mdndp 2/2 Running 0 90m 172.31.249.41 huirwang-vs47-wbnsq-worker-x8cld <none> <none> sdn-mx2cv 2/2 Running 0 90m 172.31.249.170 huirwang-vs47-wbnsq-master-0 <none> <none> sdn-nttkp 1/2 CrashLoopBackOff 7 90m 172.31.249.30 huirwang-vs47-wbnsq-rhel-1 <none> <none> sdn-vb9jx 2/2 Running 0 90m 172.31.249.199 huirwang-vs47-wbnsq-master-1 <none> <none> oc describe pod sdn-nttkp -n openshift-sdn Name: sdn-nttkp Namespace: openshift-sdn Priority: 2000001000 Priority Class Name: system-node-critical Node: huirwang-vs47-wbnsq-rhel-1/172.31.249.30 Start Time: Mon, 18 Jan 2021 15:15:01 +0800 Labels: app=sdn component=network controller-revision-hash=b57fcd4f8 openshift.io/component=network pod-template-generation=1 type=infra Annotations: <none> Status: Running IP: 172.31.249.30 IPs: IP: 172.31.249.30 Controlled By: DaemonSet/sdn Containers: sdn: Container ID: cri-o://d075c7c5f2b32472c094c64a6567a934c815394e36fbe6a7adcda172acc108fd Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8cb012af8124ada25dc604fcaf8184d6f6e37b018c34ce47eef1a1d527a7c0 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8cb012af8124ada25dc604fcaf8184d6f6e37b018c34ce47eef1a1d527a7c0 Port: 10256/TCP Host Port: 10256/TCP Command: /bin/bash -c #!/bin/bash set -euo pipefail # if another process is listening on the cni-server socket, wait until it exits trap 'kill $(jobs -p); rm -f /etc/cni/net.d/80-openshift-network.conf ; exit 0' TERM retries=0 while true; do if echo 'test' | socat - UNIX-CONNECT:/var/run/openshift-sdn/cniserver/socket &>/dev/null; then echo "warning: Another process is currently listening on the CNI socket, waiting 15s ..." 2>&1 sleep 15 & wait (( retries += 1 )) else break fi if [[ "${retries}" -gt 40 ]]; then echo "error: Another process is currently listening on the CNI socket, exiting" 2>&1 exit 1 fi done # local environment overrides if [[ -f /etc/sysconfig/openshift-sdn ]]; then set -o allexport source /etc/sysconfig/openshift-sdn set +o allexport fi #BUG: cdc accidentally mounted /etc/sysconfig/openshift-sdn as DirectoryOrCreate; clean it up so we can ultimately mount /etc/sysconfig/openshift-sdn as FileOrCreate # Once this is released, then we can mount it properly if [[ -d /etc/sysconfig/openshift-sdn ]]; then rmdir /etc/sysconfig/openshift-sdn || true fi # configmap-based overrides if [[ -f /env/${K8S_NODE_NAME} ]]; then set -o allexport source /env/${K8S_NODE_NAME} set +o allexport fi # Take over network functions on the node rm -f /etc/cni/net.d/80-openshift-network.conf cp -f /opt/cni/bin/openshift-sdn /host/opt/cni/bin/ # Launch the network process exec /usr/bin/openshift-sdn-node \ --node-name ${K8S_NODE_NAME} --node-ip ${K8S_NODE_IP} \ --proxy-config /config/kube-proxy-config.yaml \ --v ${OPENSHIFT_SDN_LOG_LEVEL:-2} State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: a00, 0xc0002bf300, 0x1211, 0x1211, 0x0, 0x0, 0x0) internal/poll/fd_unix.go:159 +0x1a5 net.(*netFD).Read(0xc00062da00, 0xc0002bf300, 0x1211, 0x1211, 0x203000, 0x7f753cd77fa0, 0x7f) net/fd_posix.go:55 +0x4f net.(*conn).Read(0xc000012328, 0xc0002bf300, 0x1211, 0x1211, 0x0, 0x0, 0x0) net/net.go:182 +0x8e crypto/tls.(*atLeastReader).Read(0xc00051efe0, 0xc0002bf300, 0x1211, 0x1211, 0x33e, 0x11c6, 0xc0006b5710) crypto/tls/conn.go:779 +0x62 bytes.(*Buffer).ReadFrom(0xc0007fe980, 0x2127400, 0xc00051efe0, 0x411805, 0x1c32400, 0x1e0bb40) bytes/buffer.go:204 +0xb1 crypto/tls.(*Conn).readFromUntil(0xc0007fe700, 0x2129860, 0xc000012328, 0x5, 0xc000012328, 0x32d) crypto/tls/conn.go:801 +0xf3 crypto/tls.(*Conn).readRecordOrCCS(0xc0007fe700, 0x0, 0x0, 0xc0006b5d18) crypto/tls/conn.go:608 +0x115 crypto/tls.(*Conn).readRecord(...) crypto/tls/conn.go:576 crypto/tls.(*Conn).Read(0xc0007fe700, 0xc000374000, 0x1000, 0x1000, 0x0, 0x0, 0x0) crypto/tls/conn.go:1252 +0x15f bufio.(*Reader).Read(0xc00052e7e0, 0xc00013f458, 0x9, 0x9, 0xc0006b5d18, 0x1fb4e00, 0x9cb3cb) bufio/bufio.go:227 +0x222 io.ReadAtLeast(0x2127220, 0xc00052e7e0, 0xc00013f458, 0x9, 0x9, 0x9, 0xc000080060, 0x0, 0x2127600) io/io.go:314 +0x87 io.ReadFull(...) io/io.go:333 golang.org/x/net/http2.readFrameHeader(0xc00013f458, 0x9, 0x9, 0x2127220, 0xc00052e7e0, 0x0, 0x0, 0xc0006b5dd0, 0x473505) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:237 +0x89 golang.org/x/net/http2.(*Framer).ReadFrame(0xc00013f420, 0xc0001fb110, 0x0, 0x0, 0x0) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:492 +0xa5 golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0006b5fa8, 0x0, 0x0) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1819 +0xd8 golang.org/x/net/http2.(*ClientConn).readLoop(0xc0003a5080) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1741 +0x6f created by golang.org/x/net/http2.(*Transport).newClientConn golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:705 +0x6c5 Exit Code: 255 Started: Mon, 18 Jan 2021 15:32:31 +0800 Finished: Mon, 18 Jan 2021 15:32:31 +0800 Ready: False Restart Count: 7 Requests: cpu: 100m memory: 200Mi Readiness: exec [test -f /etc/cni/net.d/80-openshift-network.conf] delay=5s timeout=1s period=5s #success=1 #failure=3 Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: api-int.huirwang-vs47.qe.devcluster.openshift.com OPENSHIFT_DNS_DOMAIN: cluster.local K8S_NODE_NAME: (v1:spec.nodeName) K8S_NODE_IP: (v1:status.hostIP) Mounts: /config from config (ro) /env from env-overrides (rw) /etc/cni/net.d from host-cni-conf (rw) /etc/sysconfig from etc-sysconfig (ro) /host from host-slash (ro) /host/opt/cni/bin from host-cni-bin (rw) /host/var/run/netns from host-var-run-netns (ro) /lib/modules from host-modules (ro) /run/netns from host-run-netns (ro) /var/lib/cni/networks/openshift-sdn from host-var-lib-cni-networks-openshift-sdn (rw) /var/run from host-var-run (rw) /var/run/dbus/ from host-var-run-dbus (ro) /var/run/kubernetes/ from host-var-run-kubernetes (ro) /var/run/openshift-sdn from host-var-run-openshift-sdn (rw) /var/run/openvswitch/ from host-var-run-ovs (ro) /var/run/secrets/kubernetes.io/serviceaccount from sdn-token-7n5zb (ro) kube-rbac-proxy: Container ID: cri-o://1a07bc1629416a7afeecdab73bf92509356fee9451bf8fe4325e77124118b5b2 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:880b93bcfc4fc37715b3ccaeead0ce8de17f27da068685f876c49dd31d52930e Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:880b93bcfc4fc37715b3ccaeead0ce8de17f27da068685f876c49dd31d52930e Port: 9101/TCP Host Port: 9101/TCP Command: /bin/bash -c #!/bin/bash set -euo pipefail TLS_PK=/etc/pki/tls/metrics-certs/tls.key TLS_CERT=/etc/pki/tls/metrics-certs/tls.crt # As the secret mount is optional we must wait for the files to be present. # The service is created in monitor.yaml and this is created in sdn.yaml. # If it isn't created there is probably an issue so we want to crashloop. TS=$(date +%s) WARN_TS=$(( ${TS} + $(( 20 * 60)) )) HAS_LOGGED_INFO=0 log_missing_certs(){ CUR_TS=$(date +%s) if [[ "${CUR_TS}" -gt "WARN_TS" ]]; then echo $(date -Iseconds) WARN: sdn-metrics-certs not mounted after 20 minutes. elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then echo $(date -Iseconds) INFO: sdn-metrics-certs not mounted. Waiting 20 minutes. HAS_LOGGED_INFO=1 fi } while [[ ! -f "${TLS_PK}" || ! -f "${TLS_CERT}" ]] ; do log_missing_certs sleep 5 done echo $(date -Iseconds) INFO: sdn-metrics-certs mounted, starting kube-rbac-proxy exec /usr/bin/kube-rbac-proxy \ --logtostderr \ --secure-listen-address=:9101 \ --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \ --upstream=http://127.0.0.1:29101/ \ --tls-private-key-file=${TLS_PK} \ --tls-cert-file=${TLS_CERT} State: Running Started: Mon, 18 Jan 2021 15:15:02 +0800 Ready: True Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /etc/pki/tls/metrics-certs from sdn-metrics-certs (ro) /var/run/secrets/kubernetes.io/serviceaccount from sdn-token-7n5zb (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config: Type: ConfigMap (a volume populated by a ConfigMap) Name: sdn-config Optional: false env-overrides: Type: ConfigMap (a volume populated by a ConfigMap) Name: env-overrides Optional: true etc-sysconfig: Type: HostPath (bare host directory volume) Path: /etc/sysconfig HostPathType: host-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: host-var-run: Type: HostPath (bare host directory volume) Path: /var/run HostPathType: host-run-netns: Type: HostPath (bare host directory volume) Path: /run/netns HostPathType: host-var-run-netns: Type: HostPath (bare host directory volume) Path: /var/run/netns HostPathType: host-var-run-dbus: Type: HostPath (bare host directory volume) Path: /var/run/dbus HostPathType: host-var-run-ovs: Type: HostPath (bare host directory volume) Path: /var/run/openvswitch HostPathType: host-var-run-kubernetes: Type: HostPath (bare host directory volume) Path: /var/run/kubernetes HostPathType: host-var-run-openshift-sdn: Type: HostPath (bare host directory volume) Path: /var/run/openshift-sdn HostPathType: host-slash: Type: HostPath (bare host directory volume) Path: / HostPathType: host-cni-bin: Type: HostPath (bare host directory volume) Path: /var/lib/cni/bin HostPathType: host-cni-conf: Type: HostPath (bare host directory volume) Path: /var/run/multus/cni/net.d HostPathType: host-var-lib-cni-networks-openshift-sdn: Type: HostPath (bare host directory volume) Path: /var/lib/cni/networks/openshift-sdn HostPathType: sdn-metrics-certs: Type: Secret (a volume populated by a Secret) SecretName: sdn-metrics-certs Optional: true sdn-token-7n5zb: Type: Secret (a volume populated by a Secret) SecretName: sdn-token-7n5zb Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> Successfully assigned openshift-sdn/sdn-nttkp to huirwang-vs47-wbnsq-rhel-1 Normal Pulled 92m kubelet, huirwang-vs47-wbnsq-rhel-1 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8cb012af8124ada25dc604fcaf8184d6f6e37b018c34ce47eef1a1d527a7c0" already present on machine Normal Created 92m kubelet, huirwang-vs47-wbnsq-rhel-1 Created container sdn Normal Started 92m kubelet, huirwang-vs47-wbnsq-rhel-1 Started container sdn Normal Pulled 92m kubelet, huirwang-vs47-wbnsq-rhel-1 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:880b93bcfc4fc37715b3ccaeead0ce8de17f27da068685f876c49dd31d52930e" already present on machine Normal Created 92m kubelet, huirwang-vs47-wbnsq-rhel-1 Created container kube-rbac-proxy Normal Started 92m kubelet, huirwang-vs47-wbnsq-rhel-1 Started container kube-rbac-proxy Warning Unhealthy 87m (x55 over 91m) kubelet, huirwang-vs47-wbnsq-rhel-1 Readiness probe failed: Warning BackOff 76m (x39 over 88m) kubelet, huirwang-vs47-wbnsq-rhel-1 Back-off restarting failed container Expected results: Should rollback successfully. Additional info:
In 4.7, only IPI clusters are supported, and RHEL workers are only supported in UPI cluster. So we will fix this issue together with UPI support in the next release.
*** Bug 1975262 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438