Bug 2019093

Summary: multiple pods in ContainerCreating state after migration from OpenshiftSDN to OVNKubernetes
Product: OpenShift Container Platform Reporter: Andre Costa <andcosta>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aconstan, bpickard, danili, huirwang, lmcfadde, pdsilva, pliu, tkapoor, zzhao
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1937594 Environment:
Last Closed: 2021-12-16 09:34:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1937594    
Bug Blocks:    

Comment 1 Andre Costa 2021-11-01 15:51:31 UTC
I have just followed the procedure to migrate from OpenshiftSDN to OVNKubernetes on OCP 4.7.32 and I'm getting the exact same issue.
After rebooting the nodes one of the interfaces is not being created, ending up with all ovnkube-node pods in CrashLoopBackOff because ovnkube-node container is unable to start on all nodes (RHCOS and RHEL):

 $ oc get pods -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE     IP               NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-9s48h   6/6     Running            0          3h49m    master-1.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-master-9vlhf   6/6     Running            0          3h49m    master-0.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-master-b882v   6/6     Running            2          3h49m   master-2.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-node-2q4wn     2/3     CrashLoopBackOff   5          173m   master-2.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-node-4hwkp     2/3     CrashLoopBackOff   20         173m   infra-1.prod-openshift4.redhatrules.local    <none>           <none>
ovnkube-node-7mnl5     2/3     CrashLoopBackOff   5          173m    master-1.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-node-8fvhf     2/3     CrashLoopBackOff   5          173m    master-0.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-node-gpc4z     2/3     CrashLoopBackOff   19         173m   worker-1.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-node-hj89f     2/3     CrashLoopBackOff   20         173m   infra-2.prod-openshift4.redhatrules.local    <none>           <none>
ovnkube-node-pmbtr     2/3     CrashLoopBackOff   19         173m    worker-0.prod-openshift4.redhatrules.local   <none>           <none>
ovnkube-node-rqkwt     2/3     CrashLoopBackOff   27         173m   infra-0.prod-openshift4.redhatrules.local    <none>           <none>

Many other pods are in ContainerCreating state because of this.
Looking at the container the error is the same:

    Container ID:  cri-o://f0e9430ab6537b698178895f51712b7917a93e43980f8bff1e5ddc1c3ec52f59
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ccd8b2cae88b05b57bb03d7321c2d59585471f25fda74c773b208a124cde1dfd
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ccd8b2cae88b05b57bb03d7321c2d59585471f25fda74c773b208a124cde1dfd
    Port:          29103/TCP
    Host Port:     29103/TCP
      set -xe
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db_ip addresses"
      cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
      echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on geneve port"
      iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
      iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
      while true; do
        # TODO: change to use '--request-timeout=30s', if https://github.com/kubernetes/kubernetes/issues/49343 is fixed. 
        db_ip=$(timeout 30 kubectl get ep -n ${ovn_config_namespace} ovnkube-db -o jsonpath='{.subsets[0].addresses[0].ip}')
        if [[ -n "${db_ip}" ]]; then
        (( retries += 1 ))
        if [[ "${retries}" -gt 40 ]]; then
          echo "E$(date "+%m%d %H:%M:%S.%N") - db endpoint never came up"
          exit 1
        echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db endpoint"
        sleep 5
      echo "I$(date "+%m%d %H:%M:%S.%N") - starting ovnkube-node db_ip ${db_ip}"
      # Check to see if ovs is provided by the node. This is only for upgrade from 4.5->4.6 or
      # openshift-sdn to ovn-kube conversion
      if grep -q OVNKubernetes /etc/systemd/system/ovs-configuration.service ; then
        gateway_mode_flags="--gateway-mode local --gateway-interface br-ex"
        gateway_mode_flags="--gateway-mode local --gateway-interface none"
      exec /usr/bin/ovnkube --init-node "${K8S_NODE}" \
        --nb-address "ssl:,ssl:,ssl:" \
        --sb-address "ssl:,ssl:,ssl:" \
        --nb-client-privkey /ovn-cert/tls.key \
        --nb-client-cert /ovn-cert/tls.crt \
        --nb-client-cacert /ovn-ca/ca-bundle.crt \
        --nb-cert-common-name "ovn" \
        --sb-client-privkey /ovn-cert/tls.key \
        --sb-client-cert /ovn-cert/tls.crt \
        --sb-client-cacert /ovn-ca/ca-bundle.crt \
        --sb-cert-common-name "ovn" \
        --config-file=/run/ovnkube-config/ovnkube.conf \
        --loglevel "${OVN_KUBE_LOG_LEVEL}" \
        --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
        ${gateway_mode_flags} \
        --metrics-bind-address ""
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   18438 ovs.go:173] exec(3): stderr: ""
I1101 15:44:09.603900   18438 ovs.go:169] exec(4): /usr/bin/ovs-ofctl dump-aggregate br-int
I1101 15:44:09.624070   18438 ovs.go:172] exec(4): stdout: "NXST_AGGREGATE reply (xid=0x4): packet_count=0 byte_count=0 flow_count=4160\n"
I1101 15:44:09.624139   18438 ovs.go:173] exec(4): stderr: ""
I1101 15:44:09.624280   18438 ovs.go:169] exec(5): /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-master-0.pr -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_request=1400 external-ids:iface-id=k8s-master-0.prod-openshift4.redhatrules.local
I1101 15:44:09.645430   18438 ovs.go:172] exec(5): stdout: ""
I1101 15:44:09.645535   18438 ovs.go:173] exec(5): stderr: ""
I1101 15:44:09.645637   18438 ovs.go:169] exec(6): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface ovn-k8s-mp0 mac_in_use
I1101 15:44:09.662635   18438 ovs.go:172] exec(6): stdout: "\"ee:81:e3:04:05:dc\"\n"
I1101 15:44:09.662743   18438 ovs.go:173] exec(6): stderr: ""
I1101 15:44:09.662870   18438 ovs.go:169] exec(7): /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 mac=ee\:81\:e3\:04\:05\:dc
I1101 15:44:09.679184   18438 ovs.go:172] exec(7): stdout: ""
I1101 15:44:09.679290   18438 ovs.go:173] exec(7): stderr: ""
I1101 15:44:09.796423   18438 gateway_init.go:162] Initializing Gateway Functionality
I1101 15:44:09.797592   18438 gateway_localnet.go:173] Node local addresses initialized to: map[{ fffff000}{ ff000000}{ fffffc00} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::d0e1:beff:fed1:b453:{fe80:: ffffffffffffffff0000000000000000} fe80::ec81:e3ff:fe04:5dc:{fe80:: ffffffffffffffff0000000000000000} fe80::f2e1:222:f898:9e39:{fe80:: ffffffffffffffff0000000000000000}]
I1101 15:44:09.798372   18438 helper_linux.go:73] Found default gateway interface enp1s0
F1101 15:44:09.798601   18438 ovnkube.go:130] could not find IP addresses: failed to lookup link none: Link not found

Looking at a master for example seems that many interfaces are not created and it looks that some from OpenshiftSDN are still present:

[root@master-2 ~]# nmcli -p dev
  Status of devices
DEVICE          TYPE           STATE         CONNECTION         
enp1s0          ethernet       connected     Wired connection 1 
vxlan_sys_4789  vxlan          disconnected  --                 
genev_sys_6081  geneve         unmanaged     --                 
lo              loopback       unmanaged     --                 
br-int          ovs-bridge     unmanaged     --                 
br0             ovs-bridge     unmanaged     --                 
br-int          ovs-interface  unmanaged     --                 
br0             ovs-interface  unmanaged     --                 
ovn-k8s-mp0     ovs-interface  unmanaged     --                 
tun0            ovs-interface  unmanaged     --                 
br-int          ovs-port       unmanaged     --                 
br0             ovs-port       unmanaged     --                 
ovn-007ae0-0    ovs-port       unmanaged     --                 
ovn-0fdca7-0    ovs-port       unmanaged     --                 
ovn-3e439a-0    ovs-port       unmanaged     --                 
ovn-49236f-0    ovs-port       unmanaged     --                 
ovn-d4e052-0    ovs-port       unmanaged     --                 
ovn-dd748d-0    ovs-port       unmanaged     --                 
ovn-ee6264-0    ovs-port       unmanaged     --                 
ovn-k8s-mp0     ovs-port       unmanaged     --                 
tun0            ovs-port       unmanaged     --                 
veth11d1fdb8    ovs-port       unmanaged     --                 
veth74b0121f    ovs-port       unmanaged     --                 
veth896a0680    ovs-port       unmanaged     --                 
vetha87ca073    ovs-port       unmanaged     --                 
vethd54c1fa3    ovs-port       unmanaged     --                 
vethdd4c983b    ovs-port       unmanaged     --                 
vethdea8fc48    ovs-port       unmanaged     --                 
vethff72f48b    ovs-port       unmanaged     --                 
vxlan0          ovs-port       unmanaged     --

Comment 2 Peng Liu 2021-11-02 01:47:27 UTC
When doing the migration, did you follow the instruction of 4.7's doc or 4.8's? The procedures are different between 4.7 and 4.8.

Comment 3 Peng Liu 2021-11-03 01:52:46 UTC
It looks like a regression introduced in 4.7.18 by https://github.com/openshift/ovn-kubernetes/commit/1c28b968f1b0fb6252e7f4d3061b107d5cc498e0. Tested with both 4.7.17 and 4.7.18. The migration works with 4.7.17 but not 4.7.18.

Comment 4 Andre Costa 2021-11-03 07:25:26 UTC
I have followed the documentation for OCP 4.7.
Did with OCP 4.6.46 and everything went well. After starting the mcp the nodes got updated with the rest of the network's configuration.

Comment 11 errata-xmlrpc 2021-12-16 09:34:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.40 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.