Description of problem: The SDN migration rollback failed if customize vxlanPort. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-05-220959 How reproducible: Steps to Reproduce: 1. Migrate SDN to OVN successfully following doc https://docs.google.com/document/d/1DX3OfzIXgd3y7W6Blfay-s92uC25Xx4J5qfA-FmcKBk/edit#heading=h.e96gyny2j1z1. In step 4, customized the genevePort to 9081 oc patch Network.operator.openshift.io cluster --type='merge' --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"genevePort": 9081}}}}' 2. Rollback OVN to SDN following below steps. 1) oc annotate Network.operator.openshift.io cluster \ 'networkoperator.openshift.io/network-migration'="" 2) oc patch MachineConfigPool master --type='merge' --patch \ '{ "spec": { "paused": true } }' oc patch MachineConfigPool worker --type='merge' --patch \ '{ "spec":{ "paused" :true } }' 3) oc patch Network.config.openshift.io cluster \ --type='merge' --patch '{ "spec": { "networkType": "OpenShiftSDN" } }' oc patch Network.operator.openshift.io cluster --type='merge' --patch '{"spec":{"defaultNetwork":{"openshiftSDNConfig":{"vxlanPort": 9081}}}}' 4) Wait multus pods recreated 5)Manually restart all the nodes Actual results: The sdn pods are in crash status. oc get pods -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-4bc6g 1/1 Running 0 35m ovs-7ckgf 1/1 Running 0 35m ovs-b5plg 1/1 Running 0 35m ovs-f8fnn 1/1 Running 0 35m ovs-n7djs 1/1 Running 0 35m ovs-wwbsn 1/1 Running 0 35m sdn-2mv7v 1/2 CrashLoopBackOff 9 35m sdn-5595p 1/2 CrashLoopBackOff 9 35m sdn-controller-6dqzx 1/1 Running 0 35m sdn-controller-6xf9f 1/1 Running 0 35m sdn-controller-flqnt 1/1 Running 0 35m sdn-lrnxc 1/2 CrashLoopBackOff 9 35m sdn-lvght 1/2 CrashLoopBackOff 9 35m sdn-qg82p 1/2 CrashLoopBackOff 9 35m sdn-sjrfr 1/2 CrashLoopBackOff 9 35m huiran-mac:script hrwang$ oc describe pod sdn-2mv7v -n openshift-sdn Name: sdn-2mv7v Namespace: openshift-sdn Priority: 2000001000 Priority Class Name: system-node-critical Node: ip-10-0-176-206.us-east-2.compute.internal/10.0.176.206 Start Time: Wed, 06 Jan 2021 17:17:11 +0800 Labels: app=sdn component=network controller-revision-hash=c6cbdf4cf openshift.io/component=network pod-template-generation=1 type=infra Annotations: <none> Status: Running IP: 10.0.176.206 IPs: IP: 10.0.176.206 Controlled By: DaemonSet/sdn Containers: sdn: Container ID: cri-o://6e48be67c8f0e8de2f5f3ae4de1f7efba9be26a1595b682ccf04e12ef9816443 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf28af9431cdae5a80c01a854671c22dd972b2f5f3a2d70835951d885efb12b7 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf28af9431cdae5a80c01a854671c22dd972b2f5f3a2d70835951d885efb12b7 Port: 10256/TCP Host Port: 10256/TCP Command: /bin/bash -c #!/bin/bash set -euo pipefail # if another process is listening on the cni-server socket, wait until it exits trap 'kill $(jobs -p); rm -f /etc/cni/net.d/80-openshift-network.conf ; exit 0' TERM retries=0 while true; do if echo 'test' | socat - UNIX-CONNECT:/var/run/openshift-sdn/cniserver/socket &>/dev/null; then echo "warning: Another process is currently listening on the CNI socket, waiting 15s ..." 2>&1 sleep 15 & wait (( retries += 1 )) else break fi if [[ "${retries}" -gt 40 ]]; then echo "error: Another process is currently listening on the CNI socket, exiting" 2>&1 exit 1 fi done # local environment overrides if [[ -f /etc/sysconfig/openshift-sdn ]]; then set -o allexport source /etc/sysconfig/openshift-sdn set +o allexport fi #BUG: cdc accidentally mounted /etc/sysconfig/openshift-sdn as DirectoryOrCreate; clean it up so we can ultimately mount /etc/sysconfig/openshift-sdn as FileOrCreate # Once this is released, then we can mount it properly if [[ -d /etc/sysconfig/openshift-sdn ]]; then rmdir /etc/sysconfig/openshift-sdn || true fi # configmap-based overrides if [[ -f /env/${K8S_NODE_NAME} ]]; then set -o allexport source /env/${K8S_NODE_NAME} set +o allexport fi # Take over network functions on the node rm -f /etc/cni/net.d/80-openshift-network.conf cp -f /opt/cni/bin/openshift-sdn /host/opt/cni/bin/ # Launch the network process exec /usr/bin/openshift-sdn-node \ --node-name ${K8S_NODE_NAME} --node-ip ${K8S_NODE_IP} \ --proxy-config /config/kube-proxy-config.yaml \ --v ${OPENSHIFT_SDN_LOG_LEVEL:-2} State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: 0, 0xc0002c6600, 0x1217, 0x1217, 0x0, 0x0, 0x0) internal/poll/fd_unix.go:159 +0x1a5 net.(*netFD).Read(0xc0000a3600, 0xc0002c6600, 0x1217, 0x1217, 0x203000, 0x67973b, 0xc0008c8be0) net/fd_posix.go:55 +0x4f net.(*conn).Read(0xc0000122f0, 0xc0002c6600, 0x1217, 0x1217, 0x0, 0x0, 0x0) net/net.go:182 +0x8e crypto/tls.(*atLeastReader).Read(0xc00056a120, 0xc0002c6600, 0x1217, 0x1217, 0x30a, 0x1212, 0xc0009dd710) crypto/tls/conn.go:779 +0x62 bytes.(*Buffer).ReadFrom(0xc0008c8d00, 0x2127400, 0xc00056a120, 0x411805, 0x1c32400, 0x1e0bb20) bytes/buffer.go:204 +0xb1 crypto/tls.(*Conn).readFromUntil(0xc0008c8a80, 0x2129860, 0xc0000122f0, 0x5, 0xc0000122f0, 0x2f9) crypto/tls/conn.go:801 +0xf3 crypto/tls.(*Conn).readRecordOrCCS(0xc0008c8a80, 0x0, 0x0, 0xc0009ddd18) crypto/tls/conn.go:608 +0x115 crypto/tls.(*Conn).readRecord(...) crypto/tls/conn.go:576 crypto/tls.(*Conn).Read(0xc0008c8a80, 0xc000266000, 0x1000, 0x1000, 0x0, 0x0, 0x0) crypto/tls/conn.go:1252 +0x15f bufio.(*Reader).Read(0xc000583980, 0xc0003c8b98, 0x9, 0x9, 0xc0009ddd18, 0x1fb4e00, 0x9cb3cb) bufio/bufio.go:227 +0x222 io.ReadAtLeast(0x2127220, 0xc000583980, 0xc0003c8b98, 0x9, 0x9, 0x9, 0xc000120050, 0x0, 0x2127600) io/io.go:314 +0x87 io.ReadFull(...) io/io.go:333 golang.org/x/net/http2.readFrameHeader(0xc0003c8b98, 0x9, 0x9, 0x2127220, 0xc000583980, 0x0, 0x0, 0xc0009dddd0, 0x473505) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:237 +0x89 golang.org/x/net/http2.(*Framer).ReadFrame(0xc0003c8b60, 0xc000894000, 0x0, 0x0, 0x0) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:492 +0xa5 golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0009ddfa8, 0x0, 0x0) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1819 +0xd8 golang.org/x/net/http2.(*ClientConn).readLoop(0xc000916000) golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1741 +0x6f created by golang.org/x/net/http2.(*Transport).newClientConn golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:705 +0x6c5 Exit Code: 255 Started: Wed, 06 Jan 2021 17:29:46 +0800 Finished: Wed, 06 Jan 2021 17:29:46 +0800 Ready: False Restart Count: 4 Requests: cpu: 100m memory: 200Mi Readiness: exec [test -f /etc/cni/net.d/80-openshift-network.conf] delay=5s timeout=1s period=5s #success=1 #failure=3 Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: api-int.huirwang-aws0106.qe.devcluster.openshift.com OPENSHIFT_DNS_DOMAIN: cluster.local K8S_NODE_NAME: (v1:spec.nodeName) K8S_NODE_IP: (v1:status.hostIP) Mounts: /config from config (ro) /env from env-overrides (rw) /etc/cni/net.d from host-cni-conf (rw) /etc/sysconfig from etc-sysconfig (ro) /host from host-slash (ro) /host/opt/cni/bin from host-cni-bin (rw) /host/var/run/netns from host-var-run-netns (ro) /lib/modules from host-modules (ro) /run/netns from host-run-netns (ro) /var/lib/cni/networks/openshift-sdn from host-var-lib-cni-networks-openshift-sdn (rw) /var/run from host-var-run (rw) /var/run/dbus/ from host-var-run-dbus (ro) /var/run/kubernetes/ from host-var-run-kubernetes (ro) /var/run/openshift-sdn from host-var-run-openshift-sdn (rw) /var/run/openvswitch/ from host-var-run-ovs (ro) /var/run/secrets/kubernetes.io/serviceaccount from sdn-token-pghzp (ro) kube-rbac-proxy: Container ID: cri-o://e52365ea313bcdaecbdff69abdc4a201ef4c0045841ec2da584c8387ad0fb997 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:067732a29149c8cddb3d6aaea525fff75357a9a2a1bbbdd63be1b1f5bce6db1f Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:067732a29149c8cddb3d6aaea525fff75357a9a2a1bbbdd63be1b1f5bce6db1f Port: 9101/TCP Host Port: 9101/TCP Command: /bin/bash -c ............ oc get network.operator -o yaml ....... spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: openshiftSDNConfig: mode: "" vxlanPort: 9081 ovnKubernetesConfig: genevePort: 9081 type: OpenShiftSDN disableNetworkDiagnostics: false logLevel: Normal managementState: Managed observedConfig: null operatorLogLevel: Normal serviceNetwork: - 172.30.0.0/16 unsupportedConfigOverrides: null ........... Expected results: OVN rollback to SDN successfully. Additional info:
The root cause of this issue is that the vxlanPort is set to the same value as the genevePort. So the port will keep being occupied until MCO applying the new MachineConfig which will remove the ovn configuration from ovs db,then release the port. The sdn pod cannot start due to the port is occupied. And the MCO cannot work before sdn is up. So we come to this deadlock. To avoid it, we shall ask users to not use the same port during migration/rollback.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633