Bug 1917282

Summary: [Migration] MCO stucked for rhel worker after enable the migration prepare state
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: aconstan, dosmith, gpei, pliu, wking, yunjiang, zhsun
Version: 4.7Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:36:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1976232    

Description huirwang 2021-01-18 09:05:12 UTC
Description of problem:
SDN migrating to OVN succeeded with rhel worker, but SDN pods got crash after rollback for rhel worker.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-18-000316

How reproducible:
Not Sure

Steps to Reproduce:
1. Setup IPI vsphere and scale up 2 rhel nodes.
2. Migrated SDN to OVN successfully. 
3. Then do rollback operation. 
Note:When I did migrating and rollback, just changed the network type, no optional customize.

4. After manually reboot all the nodes, check the nodes status and sdn pods status

Actual results:
The rhel worker nodes were in NotReady and sdn pods for rhel worker were in error status.

oc get nodes
NAME                               STATUS     ROLES    AGE     VERSION
huirwang-vs47-wbnsq-master-0       Ready      master   4h52m   v1.20.0+d9c52cc
huirwang-vs47-wbnsq-master-1       Ready      master   4h52m   v1.20.0+d9c52cc
huirwang-vs47-wbnsq-master-2       Ready      master   4h52m   v1.20.0+d9c52cc
huirwang-vs47-wbnsq-rhel-0         NotReady   worker   3h5m    v1.20.0+d9c52cc
huirwang-vs47-wbnsq-rhel-1         NotReady   worker   3h5m    v1.20.0+d9c52cc
huirwang-vs47-wbnsq-worker-4986s   Ready      worker   4h42m   v1.20.0+d9c52cc
huirwang-vs47-wbnsq-worker-x8cld   Ready      worker   4h42m   v1.20.0+d9c52cc


oc get pods -n openshift-sdn -o wide
NAME                   READY   STATUS             RESTARTS   AGE   IP               NODE                               NOMINATED NODE   READINESS GATES
ovs-56hxp              1/1     Running            0          90m   172.31.249.74    huirwang-vs47-wbnsq-rhel-0         <none>           <none>
ovs-d4ln5              1/1     Running            0          90m   172.31.249.170   huirwang-vs47-wbnsq-master-0       <none>           <none>
ovs-ng8jg              1/1     Running            0          90m   172.31.249.30    huirwang-vs47-wbnsq-rhel-1         <none>           <none>
ovs-nttnv              1/1     Running            0          90m   172.31.249.199   huirwang-vs47-wbnsq-master-1       <none>           <none>
ovs-pbxd6              1/1     Running            0          90m   172.31.249.41    huirwang-vs47-wbnsq-worker-x8cld   <none>           <none>
ovs-qckxb              1/1     Running            0          90m   172.31.249.66    huirwang-vs47-wbnsq-master-2       <none>           <none>
ovs-rq787              1/1     Running            0          90m   172.31.249.213   huirwang-vs47-wbnsq-worker-4986s   <none>           <none>
sdn-8gt9c              1/2     Error              7          90m   172.31.249.74    huirwang-vs47-wbnsq-rhel-0         <none>           <none>
sdn-controller-75xjs   1/1     Running            0          90m   172.31.249.170   huirwang-vs47-wbnsq-master-0       <none>           <none>
sdn-controller-gbmsg   1/1     Running            0          90m   172.31.249.66    huirwang-vs47-wbnsq-master-2       <none>           <none>
sdn-controller-hkcws   1/1     Running            0          90m   172.31.249.199   huirwang-vs47-wbnsq-master-1       <none>           <none>
sdn-gnr2d              2/2     Running            0          90m   172.31.249.213   huirwang-vs47-wbnsq-worker-4986s   <none>           <none>
sdn-k8tf7              2/2     Running            0          90m   172.31.249.66    huirwang-vs47-wbnsq-master-2       <none>           <none>
sdn-mdndp              2/2     Running            0          90m   172.31.249.41    huirwang-vs47-wbnsq-worker-x8cld   <none>           <none>
sdn-mx2cv              2/2     Running            0          90m   172.31.249.170   huirwang-vs47-wbnsq-master-0       <none>           <none>
sdn-nttkp              1/2     CrashLoopBackOff   7          90m   172.31.249.30    huirwang-vs47-wbnsq-rhel-1         <none>           <none>
sdn-vb9jx              2/2     Running            0          90m   172.31.249.199   huirwang-vs47-wbnsq-master-1       <none>           <none>


oc describe pod  sdn-nttkp -n openshift-sdn
Name:                 sdn-nttkp
Namespace:            openshift-sdn
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 huirwang-vs47-wbnsq-rhel-1/172.31.249.30
Start Time:           Mon, 18 Jan 2021 15:15:01 +0800
Labels:               app=sdn
                      component=network
                      controller-revision-hash=b57fcd4f8
                      openshift.io/component=network
                      pod-template-generation=1
                      type=infra
Annotations:          <none>
Status:               Running
IP:                   172.31.249.30
IPs:
  IP:           172.31.249.30
Controlled By:  DaemonSet/sdn
Containers:
  sdn:
    Container ID:  cri-o://d075c7c5f2b32472c094c64a6567a934c815394e36fbe6a7adcda172acc108fd
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8cb012af8124ada25dc604fcaf8184d6f6e37b018c34ce47eef1a1d527a7c0
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8cb012af8124ada25dc604fcaf8184d6f6e37b018c34ce47eef1a1d527a7c0
    Port:          10256/TCP
    Host Port:     10256/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -euo pipefail
      
      # if another process is listening on the cni-server socket, wait until it exits
      trap 'kill $(jobs -p); rm -f /etc/cni/net.d/80-openshift-network.conf ; exit 0' TERM
      retries=0
      while true; do
        if echo 'test' | socat - UNIX-CONNECT:/var/run/openshift-sdn/cniserver/socket &>/dev/null; then
          echo "warning: Another process is currently listening on the CNI socket, waiting 15s ..." 2>&1
          sleep 15 & wait
          (( retries += 1 ))
        else
          break
        fi
        if [[ "${retries}" -gt 40 ]]; then
          echo "error: Another process is currently listening on the CNI socket, exiting" 2>&1
          exit 1
        fi
      done
      
      # local environment overrides
      if [[ -f /etc/sysconfig/openshift-sdn ]]; then
        set -o allexport
        source /etc/sysconfig/openshift-sdn
        set +o allexport
      fi
      #BUG: cdc accidentally mounted /etc/sysconfig/openshift-sdn as DirectoryOrCreate; clean it up so we can ultimately mount /etc/sysconfig/openshift-sdn as FileOrCreate
      # Once this is released, then we can mount it properly
      if [[ -d /etc/sysconfig/openshift-sdn ]]; then
        rmdir /etc/sysconfig/openshift-sdn || true
      fi
      
      # configmap-based overrides
      if [[ -f /env/${K8S_NODE_NAME} ]]; then
        set -o allexport
        source /env/${K8S_NODE_NAME}
        set +o allexport
      fi
      
      # Take over network functions on the node
      rm -f /etc/cni/net.d/80-openshift-network.conf
      cp -f /opt/cni/bin/openshift-sdn /host/opt/cni/bin/
      
      # Launch the network process
      exec /usr/bin/openshift-sdn-node \
        --node-name ${K8S_NODE_NAME} --node-ip ${K8S_NODE_IP} \
        --proxy-config /config/kube-proxy-config.yaml \
        --v ${OPENSHIFT_SDN_LOG_LEVEL:-2}
      
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   a00, 0xc0002bf300, 0x1211, 0x1211, 0x0, 0x0, 0x0)
                 internal/poll/fd_unix.go:159 +0x1a5
net.(*netFD).Read(0xc00062da00, 0xc0002bf300, 0x1211, 0x1211, 0x203000, 0x7f753cd77fa0, 0x7f)
  net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc000012328, 0xc0002bf300, 0x1211, 0x1211, 0x0, 0x0, 0x0)
  net/net.go:182 +0x8e
crypto/tls.(*atLeastReader).Read(0xc00051efe0, 0xc0002bf300, 0x1211, 0x1211, 0x33e, 0x11c6, 0xc0006b5710)
  crypto/tls/conn.go:779 +0x62
bytes.(*Buffer).ReadFrom(0xc0007fe980, 0x2127400, 0xc00051efe0, 0x411805, 0x1c32400, 0x1e0bb40)
  bytes/buffer.go:204 +0xb1
crypto/tls.(*Conn).readFromUntil(0xc0007fe700, 0x2129860, 0xc000012328, 0x5, 0xc000012328, 0x32d)
  crypto/tls/conn.go:801 +0xf3
crypto/tls.(*Conn).readRecordOrCCS(0xc0007fe700, 0x0, 0x0, 0xc0006b5d18)
  crypto/tls/conn.go:608 +0x115
crypto/tls.(*Conn).readRecord(...)
  crypto/tls/conn.go:576
crypto/tls.(*Conn).Read(0xc0007fe700, 0xc000374000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
  crypto/tls/conn.go:1252 +0x15f
bufio.(*Reader).Read(0xc00052e7e0, 0xc00013f458, 0x9, 0x9, 0xc0006b5d18, 0x1fb4e00, 0x9cb3cb)
  bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x2127220, 0xc00052e7e0, 0xc00013f458, 0x9, 0x9, 0x9, 0xc000080060, 0x0, 0x2127600)
  io/io.go:314 +0x87
io.ReadFull(...)
  io/io.go:333
golang.org/x/net/http2.readFrameHeader(0xc00013f458, 0x9, 0x9, 0x2127220, 0xc00052e7e0, 0x0, 0x0, 0xc0006b5dd0, 0x473505)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:237 +0x89
golang.org/x/net/http2.(*Framer).ReadFrame(0xc00013f420, 0xc0001fb110, 0x0, 0x0, 0x0)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/frame.go:492 +0xa5
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0006b5fa8, 0x0, 0x0)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1819 +0xd8
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0003a5080)
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:1741 +0x6f
created by golang.org/x/net/http2.(*Transport).newClientConn
  golang.org/x/net.0-20201110031124-69a78807bb2b/http2/transport.go:705 +0x6c5

      Exit Code:    255
      Started:      Mon, 18 Jan 2021 15:32:31 +0800
      Finished:     Mon, 18 Jan 2021 15:32:31 +0800
    Ready:          False
    Restart Count:  7
    Requests:
      cpu:      100m
      memory:   200Mi
    Readiness:  exec [test -f /etc/cni/net.d/80-openshift-network.conf] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      KUBERNETES_SERVICE_PORT:  6443
      KUBERNETES_SERVICE_HOST:  api-int.huirwang-vs47.qe.devcluster.openshift.com
      OPENSHIFT_DNS_DOMAIN:     cluster.local
      K8S_NODE_NAME:             (v1:spec.nodeName)
      K8S_NODE_IP:               (v1:status.hostIP)
    Mounts:
      /config from config (ro)
      /env from env-overrides (rw)
      /etc/cni/net.d from host-cni-conf (rw)
      /etc/sysconfig from etc-sysconfig (ro)
      /host from host-slash (ro)
      /host/opt/cni/bin from host-cni-bin (rw)
      /host/var/run/netns from host-var-run-netns (ro)
      /lib/modules from host-modules (ro)
      /run/netns from host-run-netns (ro)
      /var/lib/cni/networks/openshift-sdn from host-var-lib-cni-networks-openshift-sdn (rw)
      /var/run from host-var-run (rw)
      /var/run/dbus/ from host-var-run-dbus (ro)
      /var/run/kubernetes/ from host-var-run-kubernetes (ro)
      /var/run/openshift-sdn from host-var-run-openshift-sdn (rw)
      /var/run/openvswitch/ from host-var-run-ovs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from sdn-token-7n5zb (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://1a07bc1629416a7afeecdab73bf92509356fee9451bf8fe4325e77124118b5b2
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:880b93bcfc4fc37715b3ccaeead0ce8de17f27da068685f876c49dd31d52930e
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:880b93bcfc4fc37715b3ccaeead0ce8de17f27da068685f876c49dd31d52930e
    Port:          9101/TCP
    Host Port:     9101/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -euo pipefail
      TLS_PK=/etc/pki/tls/metrics-certs/tls.key
      TLS_CERT=/etc/pki/tls/metrics-certs/tls.crt
      
      # As the secret mount is optional we must wait for the files to be present.
      # The service is created in monitor.yaml and this is created in sdn.yaml.
      # If it isn't created there is probably an issue so we want to crashloop.
      TS=$(date +%s)
      WARN_TS=$(( ${TS} + $(( 20 * 60)) ))
      HAS_LOGGED_INFO=0
      
      log_missing_certs(){
          CUR_TS=$(date +%s)
          if [[ "${CUR_TS}" -gt "WARN_TS"  ]]; then
            echo $(date -Iseconds) WARN: sdn-metrics-certs not mounted after 20 minutes.
          elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then
            echo $(date -Iseconds) INFO: sdn-metrics-certs not mounted. Waiting 20 minutes.
            HAS_LOGGED_INFO=1
          fi
      }
      
      while [[ ! -f "${TLS_PK}" ||  ! -f "${TLS_CERT}" ]] ; do
        log_missing_certs
        sleep 5
      done
      
      echo $(date -Iseconds) INFO: sdn-metrics-certs mounted, starting kube-rbac-proxy
      exec /usr/bin/kube-rbac-proxy \
        --logtostderr \
        --secure-listen-address=:9101 \
        --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \
        --upstream=http://127.0.0.1:29101/ \
        --tls-private-key-file=${TLS_PK} \
        --tls-cert-file=${TLS_CERT}
      
    State:          Running
      Started:      Mon, 18 Jan 2021 15:15:02 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/pki/tls/metrics-certs from sdn-metrics-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from sdn-token-7n5zb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      sdn-config
    Optional:  false
  env-overrides:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      env-overrides
    Optional:  true
  etc-sysconfig:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/sysconfig
    HostPathType:  
  host-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  host-var-run:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run
    HostPathType:  
  host-run-netns:
    Type:          HostPath (bare host directory volume)
    Path:          /run/netns
    HostPathType:  
  host-var-run-netns:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/netns
    HostPathType:  
  host-var-run-dbus:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/dbus
    HostPathType:  
  host-var-run-ovs:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/openvswitch
    HostPathType:  
  host-var-run-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/kubernetes
    HostPathType:  
  host-var-run-openshift-sdn:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/openshift-sdn
    HostPathType:  
  host-slash:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/bin
    HostPathType:  
  host-cni-conf:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/multus/cni/net.d
    HostPathType:  
  host-var-lib-cni-networks-openshift-sdn:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks/openshift-sdn
    HostPathType:  
  sdn-metrics-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  sdn-metrics-certs
    Optional:    true
  sdn-token-7n5zb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  sdn-token-7n5zb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     
Events:
  Type     Reason     Age                 From                                 Message
  ----     ------     ----                ----                                 -------
  Normal   Scheduled  <unknown>                                                Successfully assigned openshift-sdn/sdn-nttkp to huirwang-vs47-wbnsq-rhel-1
  Normal   Pulled     92m                 kubelet, huirwang-vs47-wbnsq-rhel-1  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8cb012af8124ada25dc604fcaf8184d6f6e37b018c34ce47eef1a1d527a7c0" already present on machine
  Normal   Created    92m                 kubelet, huirwang-vs47-wbnsq-rhel-1  Created container sdn
  Normal   Started    92m                 kubelet, huirwang-vs47-wbnsq-rhel-1  Started container sdn
  Normal   Pulled     92m                 kubelet, huirwang-vs47-wbnsq-rhel-1  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:880b93bcfc4fc37715b3ccaeead0ce8de17f27da068685f876c49dd31d52930e" already present on machine
  Normal   Created    92m                 kubelet, huirwang-vs47-wbnsq-rhel-1  Created container kube-rbac-proxy
  Normal   Started    92m                 kubelet, huirwang-vs47-wbnsq-rhel-1  Started container kube-rbac-proxy
  Warning  Unhealthy  87m (x55 over 91m)  kubelet, huirwang-vs47-wbnsq-rhel-1  Readiness probe failed:
  Warning  BackOff    76m (x39 over 88m)  kubelet, huirwang-vs47-wbnsq-rhel-1  Back-off restarting failed container


Expected results:
Should rollback successfully.

Additional info:

Comment 2 Peng Liu 2021-01-19 02:26:31 UTC
In 4.7, only IPI clusters are supported, and RHEL workers are only supported in UPI cluster. So we will fix this issue together with UPI support in the next release.

Comment 12 Tim Rozet 2021-06-25 14:59:34 UTC
*** Bug 1975262 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2021-07-27 22:36:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438