1913226 – [Migration] The SDN migration rollback failed if customize vxlanPort

Bug 1913226 - [Migration] The SDN migration rollback failed if customize vxlanPort

Summary: [Migration] The SDN migration rollback failed if customize vxlanPort

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Peng Liu
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-06 10:20 UTC by huirwang
Modified:	2021-02-24 15:50 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:50:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-docs pull 28441	0	None	open	OSDOCS#1411 - Bug 1913226: Extending SDN Migration support to other platforms	2021-01-14 02:08:15 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:50:43 UTC

Description huirwang 2021-01-06 10:20:43 UTC

Description of problem:
The SDN migration rollback failed if customize vxlanPort.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-05-220959

How reproducible:


Steps to Reproduce:
1. Migrate SDN to OVN successfully following doc https://docs.google.com/document/d/1DX3OfzIXgd3y7W6Blfay-s92uC25Xx4J5qfA-FmcKBk/edit#heading=h.e96gyny2j1z1.

In step 4, customized the genevePort to 9081
oc patch Network.operator.openshift.io cluster --type='merge' --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"genevePort": 9081}}}}'

2. Rollback OVN to SDN following below steps.

1) oc annotate Network.operator.openshift.io cluster \
  'networkoperator.openshift.io/network-migration'=""

2) oc patch MachineConfigPool master --type='merge' --patch \
  '{ "spec": { "paused": true } }'
 oc patch MachineConfigPool worker --type='merge' --patch \
  '{ "spec":{ "paused" :true } }'

3) oc patch Network.config.openshift.io cluster \
  --type='merge' --patch '{ "spec": { "networkType": "OpenShiftSDN" } }'

oc patch Network.operator.openshift.io cluster --type='merge' --patch '{"spec":{"defaultNetwork":{"openshiftSDNConfig":{"vxlanPort": 9081}}}}'

4) Wait multus pods recreated
5)Manually restart all the nodes


Actual results:
The sdn pods are in crash status.

oc get pods -n openshift-sdn
NAME                   READY   STATUS             RESTARTS   AGE
ovs-4bc6g              1/1     Running            0          35m
ovs-7ckgf              1/1     Running            0          35m
ovs-b5plg              1/1     Running            0          35m
ovs-f8fnn              1/1     Running            0          35m
ovs-n7djs              1/1     Running            0          35m
ovs-wwbsn              1/1     Running            0          35m
sdn-2mv7v              1/2     CrashLoopBackOff   9          35m
sdn-5595p              1/2     CrashLoopBackOff   9          35m
sdn-controller-6dqzx   1/1     Running            0          35m
sdn-controller-6xf9f   1/1     Running            0          35m
sdn-controller-flqnt   1/1     Running            0          35m
sdn-lrnxc              1/2     CrashLoopBackOff   9          35m
sdn-lvght              1/2     CrashLoopBackOff   9          35m
sdn-qg82p              1/2     CrashLoopBackOff   9          35m
sdn-sjrfr              1/2     CrashLoopBackOff   9          35m

huiran-mac:script hrwang$ oc describe pod sdn-2mv7v  -n openshift-sdn
Name:                 sdn-2mv7v
Namespace:            openshift-sdn
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-0-176-206.us-east-2.compute.internal/10.0.176.206
Start Time:           Wed, 06 Jan 2021 17:17:11 +0800
Labels:               app=sdn
                      component=network
                      controller-revision-hash=c6cbdf4cf
                      openshift.io/component=network
                      pod-template-generation=1
                      type=infra
Annotations:          <none>
Status:               Running
IP:                   10.0.176.206
IPs:
  IP:           10.0.176.206
Controlled By:  DaemonSet/sdn
Containers:
  sdn:
    Container ID:  cri-o://6e48be67c8f0e8de2f5f3ae4de1f7efba9be26a1595b682ccf04e12ef9816443
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf28af9431cdae5a80c01a854671c22dd972b2f5f3a2d70835951d885efb12b7
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf28af9431cdae5a80c01a854671c22dd972b2f5f3a2d70835951d885efb12b7
    Port:          10256/TCP
    Host Port:     10256/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -euo pipefail
      
      # if another process is listening on the cni-server socket, wait until it exits
      trap 'kill $(jobs -p); rm -f /etc/cni/net.d/80-openshift-network.conf ; exit 0' TERM
      retries=0
      while true; do
        if echo 'test' | socat - UNIX-CONNECT:/var/run/openshift-sdn/cniserver/socket &>/dev/null; then
          echo "warning: Another process is currently listening on the CNI socket, waiting 15s ..." 2>&1
          sleep 15 & wait
          (( retries += 1 ))
        else
          break
        fi
        if [[ "${retries}" -gt 40 ]]; then
          echo "error: Another process is currently listening on the CNI socket, exiting" 2>&1
          exit 1
        fi
      done
      
      # local environment overrides
      if [[ -f /etc/sysconfig/openshift-sdn ]]; then
        set -o allexport
        source /etc/sysconfig/openshift-sdn
        set +o allexport
      fi
      #BUG: cdc accidentally mounted /etc/sysconfig/openshift-sdn as DirectoryOrCreate; clean it up so we can ultimately mount /etc/sysconfig/openshift-sdn as FileOrCreate
      # Once this is released, then we can mount it properly
      if [[ -d /etc/sysconfig/openshift-sdn ]]; then
        rmdir /etc/sysconfig/openshift-sdn || true
      fi
      
      # configmap-based overrides
      if [[ -f /env/${K8S_NODE_NAME} ]]; then
        set -o allexport
        source /env/${K8S_NODE_NAME}
        set +o allexport
      fi
      
      # Take over network functions on the node
      rm -f /etc/cni/net.d/80-openshift-network.conf
      cp -f /opt/cni/bin/openshift-sdn /host/opt/cni/bin/
      
      # Launch the network process
      exec /usr/bin/openshift-sdn-node \
        --node-name ${K8S_NODE_NAME} --node-ip ${K8S_NODE_IP} \
        --proxy-config /config/kube-proxy-config.yaml \
        --v ${OPENSHIFT_SDN_LOG_LEVEL:-2}
      
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   0, 0xc0002c6600, 0x1217, 0x1217, 0x0, 0x0, 0x0)
                 internal/poll/fd_unix.go:159 +0x1a5
net.(*netFD).Read(0xc0000a3600, 0xc0002c6600, 0x1217, 0x1217, 0x203000, 0x67973b, 0xc0008c8be0)
  net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0000122f0, 0xc0002c6600, 0x1217, 0x1217, 0x0, 0x0, 0x0)
  net/net.go:182 +0x8e
crypto/tls.(*atLeastReader).Read(0xc00056a120, 0xc0002c6600, 0x1217, 0x1217, 0x30a, 0x1212, 0xc0009dd710)
  crypto/tls/conn.go:779 +0x62
bytes.(*Buffer).ReadFrom(0xc0008c8d00, 0x2127400, 0xc00056a120, 0x411805, 0x1c32400, 0x1e0bb20)
  bytes/buffer.go:204 +0xb1
crypto/tls.(*Conn).readFromUntil(0xc0008c8a80, 0x2129860, 0xc0000122f0, 0x5, 0xc0000122f0, 0x2f9)
  crypto/tls/conn.go:801 +0xf3
crypto/tls.(*Conn).readRecordOrCCS(0xc0008c8a80, 0x0, 0x0, 0xc0009ddd18)
  crypto/tls/conn.go:608 +0x115
crypto/tls.(*Conn).readRecord(...)
  crypto/tls/conn.go:576
crypto/tls.(*Conn).Read(0xc0008c8a80, 0xc000266000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
  crypto/tls/conn.go:1252 +0x15f
bufio.(*Reader).Read(0xc000583980, 0xc0003c8b98, 0x9, 0x9, 0xc0009ddd18, 0x1fb4e00, 0x9cb3cb)
  bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x2127220, 0xc000583980, 0xc0003c8b98, 0x9, 0x9, 0x9, 0xc000120050, 0x0, 0x2127600)
  io/io.go:314 +0x87
io.ReadFull(...)
  io/io.go:333
golang.org/x/net/http2.readFrameHeader(0xc0003c8b98, 0x9, 0x9, 0x2127220, 0xc000583980, 0x0, 0x0, 0xc0009dddd0, 0x473505)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:237 +0x89
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0003c8b60, 0xc000894000, 0x0, 0x0, 0x0)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:492 +0xa5
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0009ddfa8, 0x0, 0x0)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1819 +0xd8
golang.org/x/net/http2.(*ClientConn).readLoop(0xc000916000)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1741 +0x6f
created by golang.org/x/net/http2.(*Transport).newClientConn
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:705 +0x6c5

      Exit Code:    255
      Started:      Wed, 06 Jan 2021 17:29:46 +0800
      Finished:     Wed, 06 Jan 2021 17:29:46 +0800
    Ready:          False
    Restart Count:  4
    Requests:
      cpu:      100m
      memory:   200Mi
    Readiness:  exec [test -f /etc/cni/net.d/80-openshift-network.conf] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      KUBERNETES_SERVICE_PORT:  6443
      KUBERNETES_SERVICE_HOST:  api-int.huirwang-aws0106.qe.devcluster.openshift.com
      OPENSHIFT_DNS_DOMAIN:     cluster.local
      K8S_NODE_NAME:             (v1:spec.nodeName)
      K8S_NODE_IP:               (v1:status.hostIP)
    Mounts:
      /config from config (ro)
      /env from env-overrides (rw)
      /etc/cni/net.d from host-cni-conf (rw)
      /etc/sysconfig from etc-sysconfig (ro)
      /host from host-slash (ro)
      /host/opt/cni/bin from host-cni-bin (rw)
      /host/var/run/netns from host-var-run-netns (ro)
      /lib/modules from host-modules (ro)
      /run/netns from host-run-netns (ro)
      /var/lib/cni/networks/openshift-sdn from host-var-lib-cni-networks-openshift-sdn (rw)
      /var/run from host-var-run (rw)
      /var/run/dbus/ from host-var-run-dbus (ro)
      /var/run/kubernetes/ from host-var-run-kubernetes (ro)
      /var/run/openshift-sdn from host-var-run-openshift-sdn (rw)
      /var/run/openvswitch/ from host-var-run-ovs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from sdn-token-pghzp (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://e52365ea313bcdaecbdff69abdc4a201ef4c0045841ec2da584c8387ad0fb997
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:067732a29149c8cddb3d6aaea525fff75357a9a2a1bbbdd63be1b1f5bce6db1f
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:067732a29149c8cddb3d6aaea525fff75357a9a2a1bbbdd63be1b1f5bce6db1f
    Port:          9101/TCP
    Host Port:     9101/TCP
    Command:
      /bin/bash
      -c
............

oc get network.operator -o yaml
.......
 spec:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    defaultNetwork:
      openshiftSDNConfig:
        mode: ""
        vxlanPort: 9081
      ovnKubernetesConfig:
        genevePort: 9081
      type: OpenShiftSDN
    disableNetworkDiagnostics: false
    logLevel: Normal
    managementState: Managed
    observedConfig: null
    operatorLogLevel: Normal
    serviceNetwork:
    - 172.30.0.0/16
    unsupportedConfigOverrides: null
...........



Expected results:
OVN rollback to SDN successfully.

Additional info:

Comment 1 Peng Liu 2021-01-07 08:06:50 UTC

The root cause of this issue is that the vxlanPort is set to the same value as the genevePort. So the port will keep being occupied until MCO applying the new MachineConfig which will remove the ovn configuration from ovs db,then release the port. The sdn pod cannot start due to the port is occupied. And the MCO cannot work before sdn is up. So we come to this deadlock. To avoid it, we shall ask users to not use the same port during migration/rollback.

Comment 8 errata-xmlrpc 2021-02-24 15:50:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.