Bug 1788818 - Install OCP 4.4 failed with error "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress
Summary: Install OCP 4.4 failed with error "openshift-ovn-kubernetes/ovnkube-node" rol...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.4.0
Assignee: Ricardo Carrillo Cruz
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-08 07:49 UTC by gaoshang
Modified: 2020-05-04 11:23 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:23:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:23:40 UTC

Description gaoshang 2020-01-08 07:49:19 UTC
Description of problem:
Install OCP 4.4.0-0.nightly-2020-01-06-072200 with OVNKubernetes on AWS failed, get following ERROR:

ERROR Cluster operator network Degraded is True with RolloutHung: DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3m10s   Unable to apply 4.4.0-0.nightly-2020-01-06-072200: an unknown error has occurred

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.4.0-0.nightly-2020-01-06-072200 with OVNKubernetes on AWS
2. Installation failed, get error:
ERROR Cluster operator network Degraded is True with RolloutHung: DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress
3. In bootstrap machine, find ovnkube-node pod in CrashLoopBackOff status, check logs of ovnkube-node container, get error "kubectl: command not found", seems like kubectl package missed in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9 

# oc get pod -o wide -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-7djdn   4/4     Running            0          2m24s   10.0.130.88    ip-10-0-130-88.us-east-2.compute.internal    <none>           <none>
ovnkube-master-fkz47   4/4     Running            0          2m24s   10.0.149.152   ip-10-0-149-152.us-east-2.compute.internal   <none>           <none>
ovnkube-master-q488f   4/4     Running            0          2m24s   10.0.162.43    ip-10-0-162-43.us-east-2.compute.internal    <none>           <none>
ovnkube-node-5qkll     2/3     CrashLoopBackOff   4          2m24s   10.0.130.88    ip-10-0-130-88.us-east-2.compute.internal    <none>           <none>
ovnkube-node-h49hs     2/3     CrashLoopBackOff   4          2m24s   10.0.162.43    ip-10-0-162-43.us-east-2.compute.internal    <none>           <none>
ovnkube-node-xp4fv     2/3     CrashLoopBackOff   4          2m24s   10.0.149.152   ip-10-0-149-152.us-east-2.compute.internal   <none>           <none>

# oc get pods ovnkube-node-5qkll -n openshift-ovn-kubernetes -o jsonpath='{.spec.containers[*].name}'
ovs-daemons ovn-controller ovnkube-node

# oc logs ovnkube-node-5qkll -c ovnkube-node -n openshift-ovn-kubernetes
+ [[ -f /env/ip-10-0-130-88.us-east-2.compute.internal ]]
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
+ retries=0
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
/bin/bash: line 10: kubectl: command not found
+ db_ip=

[root@ip-10-0-9-118 ~]# oc describe pod ovnkube-node-5qkll -c ovnkube-node -n openshift-ovn-kubernetes
Error: unknown shorthand flag: 'c' in -c
See 'oc describe --help' for usage.
[root@ip-10-0-9-118 ~]# oc describe pod ovnkube-node-5qkll -n openshift-ovn-kubernetes
Name:                 ovnkube-node-5qkll
Namespace:            openshift-ovn-kubernetes
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-0-130-88.us-east-2.compute.internal/10.0.130.88
Start Time:           Wed, 08 Jan 2020 02:55:42 +0000
Labels:               app=ovnkube-node
                      component=network
                      controller-revision-hash=747fb98b88
                      kubernetes.io/os=linux
                      openshift.io/component=network
                      pod-template-generation=1
                      type=infra
Annotations:          <none>
Status:               Running
IP:                   10.0.130.88
IPs:
  IP:           10.0.130.88
Controlled By:  DaemonSet/ovnkube-node
Containers:
  ovs-daemons:
    Container ID:  cri-o://7c4b8ef9ee57640d8d800e96fa9b787f34d9c9b5f0525921cc03cd98be704959
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -e
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      fi
      if [[ -f "/old/openvswitch/conf.db" && ! -f "/etc/openvswitch/conf.db" ]]; then
        mv /old/openvswitch/conf.db /etc/openvswitch/conf.db
      fi
      chown -R openvswitch:openvswitch /run/openvswitch
      chown -R openvswitch:openvswitch /etc/openvswitch
      function quit {
          /usr/share/openvswitch/scripts/ovs-ctl stop
          exit 0
      }
      trap quit SIGTERM
      /usr/share/openvswitch/scripts/ovs-ctl start --ovs-user=openvswitch:openvswitch --system-id=random
      ovs-appctl vlog/set "file:${OVS_LOG_LEVEL}"
      /usr/share/openvswitch/scripts/ovs-ctl --protocol=udp --dport=6081 enable-protocol
      
      tail -F --pid=$(cat /var/run/openvswitch/ovs-vswitchd.pid) /var/log/openvswitch/ovs-vswitchd.log &
      tail -F --pid=$(cat /var/run/openvswitch/ovsdb-server.pid) /var/log/openvswitch/ovsdb-server.log &
      wait
      
    State:          Running
      Started:      Wed, 08 Jan 2020 02:55:55 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   300Mi
    Liveness:   exec [/usr/share/openvswitch/scripts/ovs-ctl status] delay=15s timeout=1s period=5s #success=1 #failure=3
    Readiness:  exec [/usr/share/openvswitch/scripts/ovs-ctl status] delay=15s timeout=1s period=5s #success=1 #failure=3
    Environment:
      OVS_LOG_LEVEL:  info
      K8S_NODE:        (v1:spec.nodeName)
    Mounts:
      /env from env-overrides (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /lib/modules from host-modules (ro)
      /old/openvswitch from old-openvswitch-database (rw)
      /run/openvswitch from run-openvswitch (rw)
      /sys from host-sys (ro)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-node-token-dwfgf (ro)
  ovn-controller:
    Container ID:  cri-o://d4d7969789fffa369f6cdec23932803f812644943e8902f154c0f81b4716dd7b
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      set -e
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      fi
      echo /ovn-cert/tls.key
      cat /ovn-cert/tls.key
      echo /ovn-cert/tls.crt
      cat /ovn-cert/tls.crt
      echo /ovn-ca/ca-bundle.crt
      cat /ovn-ca/ca-bundle.crt
      exec ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off \
        --no-chdir --pidfile=/var/run/openvswitch/ovn-controller.pid \
        -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt \
        -vconsole:"${OVN_LOG_LEVEL}"
      
    State:          Running
      Started:      Wed, 08 Jan 2020 02:55:55 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  300Mi
    Environment:
      OVN_LOG_LEVEL:  info
      K8S_NODE:        (v1:spec.nodeName)
    Mounts:
      /env from env-overrides (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /ovn-ca from ovn-ca (rw)
      /ovn-cert from ovn-cert (rw)
      /run/openvswitch from run-openvswitch (rw)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-node-token-dwfgf (ro)
  ovnkube-node:
    Container ID:  cri-o://9741afe526e2f7a04ec7dd07e537e0018ffd4344ba65c8f9b56f2c1480edd6a7
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9
    Port:          9101/TCP
    Host Port:     9101/TCP
    Command:
      /bin/bash
      -c
      set -xe
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      fi
      cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
      ovn_config_namespace=openshift-ovn-kubernetes
      retries=0
      while true; do
        db_ip=$(kubectl get ep -n ${ovn_config_namespace} ovnkube-db -o jsonpath='{.subsets[0].addresses[0].ip}')
        if [[ -n "${db_ip}" ]]; then
          break
        fi
        (( retries += 1 ))
        if [[ "${retries}" -gt 40 ]]; then
          echo "db endpoint never came up"
          exit 1
        fi
        echo "waiting for db endpoint"
        sleep 5
      done
      
      hybrid_overlay_flags=
      if [[ -n "" ]]; then
        hybrid_overlay_flags="--enable-hybrid-overlay"
        if [[ -n "" ]]; then
          hybrid_overlay_flags="${hybrid_overlay_flags} --hybrid-overlay-cluster-subnets="
        fi
      fi
      
      OVN_NODES_ARRAY=(ip-10-0-130-88.us-east-2.compute.internal ip-10-0-149-152.us-east-2.compute.internal ip-10-0-162-43.us-east-2.compute.internal)
      nb_addr_list=""
      sb_addr_list=""
      for i in "${!OVN_NODES_ARRAY[@]}"; do
        if [[ $i != 0 ]]; then
          nb_addr_list="${nb_addr_list},"
          sb_addr_list="${sb_addr_list},"
        fi
        host=$(getent ahostsv4 "${OVN_NODES_ARRAY[$i]}" | grep RAW | awk '{print $1}')
        nb_addr_list="${nb_addr_list}ssl://${host}:9641"
        sb_addr_list="${sb_addr_list}ssl://${host}:9642"
      done
      echo /ovn-cert/tls.key
      cat /ovn-cert/tls.key
      echo /ovn-cert/tls.crt
      cat /ovn-cert/tls.crt
      echo /ovn-ca/ca-bundle.crt
      cat /ovn-ca/ca-bundle.crt
      
      exec /usr/bin/ovnkube --init-node "${K8S_NODE}" \
        --cluster-subnets "${OVN_NET_CIDR}" \
        --k8s-service-cidr "${OVN_SVC_CIDR}" \
        --k8s-apiserver "https://api-int.sgao-cluster.qe.devcluster.openshift.com:6443" \
        --ovn-config-namespace ${ovn_config_namespace} \
        --nb-address "${nb_addr_list}" \
        --sb-address "${sb_addr_list}" \
        --nb-client-privkey /ovn-cert/tls.key \
        --nb-client-cert /ovn-cert/tls.crt \
        --nb-client-cacert /ovn-ca/ca-bundle.crt \
        --sb-client-privkey /ovn-cert/tls.key \
        --sb-client-cert /ovn-cert/tls.crt \
        --sb-client-cacert /ovn-ca/ca-bundle.crt \
        --config-file=/run/ovnkube-config/ovnkube.conf \
        --loglevel "${OVN_KUBE_LOG_LEVEL}" \
        ${hybrid_overlay_flags} \
        --pidfile /var/run/openvswitch/ovnkube-node.pid \
        --metrics-bind-address "0.0.0.0:9101"
      
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   + [[ -f /env/ip-10-0-130-88.us-east-2.compute.internal ]]
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
+ retries=0
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
/bin/bash: line 10: kubectl: command not found
+ db_ip=

      Exit Code:    127
      Started:      Wed, 08 Jan 2020 03:27:02 +0000
      Finished:     Wed, 08 Jan 2020 03:27:02 +0000
    Ready:          False
    Restart Count:  11
    Requests:
      cpu:      100m
      memory:   300Mi
    Readiness:  exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      OVN_HYBRID_OVERLAY_ENABLE:    
      OVN_HYBRID_OVERLAY_NET_CIDR:  
      KUBERNETES_SERVICE_PORT:      6443
      KUBERNETES_SERVICE_HOST:      api-int.sgao-cluster.qe.devcluster.openshift.com
      OVN_KUBE_LOG_LEVEL:           4
      K8S_NODE:                      (v1:spec.nodeName)
    Mounts:
      /cni-bin-dir from host-cni-bin (rw)
      /env from env-overrides (rw)
      /etc/cni/net.d from host-cni-netd (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /host from host-slash (ro)
      /ovn-ca from ovn-ca (rw)
      /ovn-cert from ovn-cert (rw)
      /run/netns from host-run-netns (ro)
      /run/openvswitch from run-openvswitch (rw)
      /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw)
      /run/ovnkube-config/ from ovnkube-config (rw)
      /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-node-token-dwfgf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  host-slash:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  host-run-netns:
    Type:          HostPath (bare host directory volume)
    Path:          /run/netns
    HostPathType:  
  var-lib-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/openvswitch/data
    HostPathType:  
  etc-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/openvswitch/etc
    HostPathType:  
  run-openvswitch:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  old-openvswitch-database:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  host-run-ovn-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /run/ovn-kubernetes
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  
  host-cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/bin
    HostPathType:  
  host-cni-netd:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/multus/cni/net.d
    HostPathType:  
  host-var-lib-cni-networks-ovn-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks/ovn-k8s-cni-overlay
    HostPathType:  
  ovnkube-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovnkube-config
    Optional:  false
  env-overrides:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      env-overrides
    Optional:  true
  ovn-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovn-ca
    Optional:  false
  ovn-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-cert
    Optional:    false
  ovn-kubernetes-node-token-dwfgf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-kubernetes-node-token-dwfgf
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     
Events:
  Type     Reason     Age                  From                                                Message
  ----     ------     ----                 ----                                                -------
  Normal   Scheduled  <unknown>            default-scheduler                                   Successfully assigned openshift-ovn-kubernetes/ovnkube-node-5qkll to ip-10-0-130-88.us-east-2.compute.internal
  Normal   Pulling    36m                  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9"
  Normal   Pulled     35m                  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9"
  Normal   Created    35m                  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Created container ovs-daemons
  Normal   Started    35m                  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Started container ovs-daemons
  Normal   Pulled     35m                  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9" already present on machine
  Normal   Created    35m                  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Created container ovn-controller
  Normal   Started    35m                  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Started container ovn-controller
  Normal   Pulled     35m (x4 over 35m)    kubelet, ip-10-0-130-88.us-east-2.compute.internal  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:55e9bb82599f0f3ddd65ea9b7085290f770228a163a0ca0c8b810e34ab9f38d9" already present on machine
  Normal   Created    35m (x4 over 35m)    kubelet, ip-10-0-130-88.us-east-2.compute.internal  Created container ovnkube-node
  Normal   Started    35m (x4 over 35m)    kubelet, ip-10-0-130-88.us-east-2.compute.internal  Started container ovnkube-node
  Warning  BackOff    53s (x162 over 35m)  kubelet, ip-10-0-130-88.us-east-2.compute.internal  Back-off restarting failed container

Actual results:
Installation failed

Expected results:
Installation pass

Additional info:

Comment 1 gaoshang 2020-01-13 03:36:01 UTC
This bug fixed in OCP 4.4.0-0.nightly-2020-01-12-032939, 

Version:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-01-12-032939   True        False         17m     Cluster version is 4.4.0-0.nightly-2020-01-12-032939

Steps:
1, Install OCP 4.4 4.4.0-0.nightly-2020-01-12-032939 and succeed
2, Check ovnkube-node image updated
# oc describe pod ovnkube-node-2sm9v -n openshift-ovn-kubernetes | grep "Image ID"
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba36719f8d038c93b2ba4b8de7f12846d7b96fe812c64b2c74242d31e3061092
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba36719f8d038c93b2ba4b8de7f12846d7b96fe812c64b2c74242d31e3061092

Comment 2 Ricardo Carrillo Cruz 2020-01-17 09:14:14 UTC
Hey @gaoshang , so is this fixed then , can we close the bug?

Comment 3 gaoshang 2020-01-19 02:44:22 UTC
(In reply to Ricardo Carrillo Cruz from comment #2)
> Hey @gaoshang , so is this fixed then , can we close the bug?

According to above comment, moved bug status to VERIFIED, thanks.

Comment 5 errata-xmlrpc 2020-05-04 11:23:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.