Description of problem: Setup the cluster with OVN network type, the cluster cannot be worked. Check the ovn pod: # oc get pod -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-785b7b768d-mhbhq 0/4 Pending 0 33m ovnkube-node-bq4d5 1/3 CrashLoopBackOff 9 35m ovnkube-node-jvv5p 1/3 CrashLoopBackOff 9 35m Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-02-172410 How reproducible: always Steps to Reproduce: 1. install cluster on vsphere with OVN network type 2. check the pod in openshift-ovn-kubernetes, ovn-master pod cannot be scheduled. # oc describe pod ovnkube-master-785b7b768d-mhbhq -n openshift-ovn-kubernetes Name: ovnkube-master-785b7b768d-mhbhq Namespace: openshift-ovn-kubernetes Priority: 2000000000 PriorityClassName: system-cluster-critical Node: <none> Labels: component=network kubernetes.io/os=linux name=ovnkube-master openshift.io/component=network pod-template-hash=785b7b768d type=infra Annotations: <none> Status: Pending IP: Controlled By: ReplicaSet/ovnkube-master-785b7b768d Containers: run-ovn-northd: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh run-ovn-northd Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_LOG_NORTHD: -vconsole:info OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) nb-ovsdb: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh nb-ovsdb Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_LOG_NB: -vconsole:info -vfile:info OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) sb-ovsdb: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh sb-ovsdb Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_LOG_SB: -vconsole:info -vfile:info OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/kubernetes/ from host-var-run-kubernetes (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) ovnkube-master: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh ovn-master Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_MASTER: true OVNKUBE_LOGLEVEL: 4 OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/kubernetes/ from host-var-run-kubernetes (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) Conditions: Type Status PodScheduled False Volumes: host-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: host-var-lib-ovs: Type: HostPath (bare host directory volume) Path: /var/lib/openvswitch HostPathType: host-var-run-ovs: Type: HostPath (bare host directory volume) Path: /var/run/openvswitch HostPathType: host-var-run-kubernetes: Type: HostPath (bare host directory volume) Path: /var/run/kubernetes HostPathType: ovn-kubernetes-controller-token-8dm2j: Type: Secret (a volume populated by a Secret) SecretName: ovn-kubernetes-controller-token-8dm2j Optional: false QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux kubernetes.io/os=linux node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/not-ready:NoSchedule node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 46m (x6 over 46m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. Warning FailedScheduling 37m (x6 over 37m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. Warning FailedScheduling 34m (x2 over 34m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. Warning FailedScheduling 13s (x24 over 34m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. 3. oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS compute-0 NotReady,SchedulingDisabled worker 41m v1.14.0+2b7562925 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos control-plane-0 NotReady,SchedulingDisabled master 41m v1.14.0+2b7562925 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos Actual results: cluster cannot be setup with OVN Expected results: Additional info:
this issue can be reproduced in vsphere and GCP cluster.
Have no idea why the node changed the scheduler to disable When I change the schedule by manual to 'enable' by 'oc adm uncordon', the ovn pod can be scheduled and running
Bringing this back into the tech preview for ovn-kubernetes in 4.2. We should support: - IPI AWS - IPI Azure - UPI If only IPI vSphere and IPI GCP do not work, that is okay as a limitation of the tech preview.
Note that this means we need to make vSphere UPI work for 4.2.
Whatever cloud is used needs to have the right ports open between machines. We have this for AWS in the installer right now. How does this work for vSphere? Also, can we get ovnkube master pod/container logs from a failed vSphere install?
hi, Dan When I using OVN as network type to install cluster on vsphere. the ovn master pod is pending due to all master and worker were marked as 'SchedulingDisabled', I did not find the reason why and which step make it. I tried many times and get same result. [root@dhcp-140-66 ~]# oc get node NAME STATUS ROLES AGE VERSION compute-0 NotReady,SchedulingDisabled worker 47m v1.14.6+82219910a control-plane-0 NotReady,SchedulingDisabled master 47m v1.14.6+82219910a When I tried to change it to 'scheduleing' by ` oc adm uncordon control-plane-0`, the OVN pod can be running # oc get pod -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-78c6798568-x9sbv 4/4 Running 2 56m ovnkube-node-ggpxc 3/3 Running 13 58m ovnkube-node-h5bt6 3/3 Running 13 58m but seems the other components of cluster cannot be started up: # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE cloud-credential 4.2.0-0.nightly-2019-09-08-180038 True False False 56m dns 4.2.0-0.nightly-2019-09-08-180038 True False False 55m insights 4.2.0-0.nightly-2019-09-08-180038 True True False 56m kube-apiserver 4.2.0-0.nightly-2019-09-08-180038 True False False 55m kube-controller-manager 4.2.0-0.nightly-2019-09-08-180038 False True False 56m kube-scheduler 4.2.0-0.nightly-2019-09-08-180038 False True False 56m machine-api 4.2.0-0.nightly-2019-09-08-180038 True False False 56m machine-config 4.2.0-0.nightly-2019-09-08-180038 False True False 56m network 4.2.0-0.nightly-2019-09-08-180038 True False False 7m58s openshift-apiserver 4.2.0-0.nightly-2019-09-08-180038 False False False 55m openshift-controller-manager False True False 56m operator-lifecycle-manager 4.2.0-0.nightly-2019-09-08-180038 True True False 55m operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-09-08-180038 True True False 55m operator-lifecycle-manager-packageserver False True False 55m service-ca 4.2.0-0.nightly-2019-09-08-180038 True True False 56m # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 58m Unable to apply 4.2.0-0.nightly-2019-09-08-180038: an unknown error has occurred [root@dhcp-140-66 ~]# oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2019-09-09T02:01:50Z" generation: 1 name: version resourceVersion: "11439" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: c96109c3-d2a5-11e9-86b6-0050568b99b8 spec: channel: stable-4.2 clusterID: df5120e1-96f0-408a-b518-8af75a89aa5b upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2019-09-09T02:02:06Z" status: "False" type: Available - lastTransitionTime: "2019-09-09T02:58:08Z" message: |- Multiple errors are preventing progress: * Could not update oauthclient "console" (263 of 416): the server does not recognize this resource, check extension API servers * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (214 of 416): resource may have been deleted * Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (411 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-authentication-operator/authentication-operator" (376 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (415 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-image-registry/image-registry" (382 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (393 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (397 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (401 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (152 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-machine-api/machine-api-operator" (403 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (405 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (385 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (388 of 416): the server does not recognize this resource, check extension API servers reason: MultipleErrors status: "True" type: Failing - lastTransitionTime: "2019-09-09T02:02:06Z" message: 'Unable to apply 4.2.0-0.nightly-2019-09-08-180038: an unknown error has occurred' reason: MultipleErrors status: "True" type: Progressing - lastTransitionTime: "2019-09-09T02:02:06Z" message: 'Unable to retrieve available updates: currently installed version 4.2.0-0.nightly-2019-09-08-180038 not found in the "stable-4.2" channel' reason: RemoteFailed status: "False" type: RetrievedUpdates desired: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:7862f9777e846c23fefeac77dc58c8107616acd65707c8437d06d29d2e4990ad version: 4.2.0-0.nightly-2019-09-08-180038 history: - completionTime: null image: registry.svc.ci.openshift.org/ocp/release@sha256:7862f9777e846c23fefeac77dc58c8107616acd65707c8437d06d29d2e4990ad startedTime: "2019-09-09T02:02:06Z" state: Partial verified: false version: 4.2.0-0.nightly-2019-09-08-180038 observedGeneration: 1 versionHash: MBWmxYuYaYQ= kind: List metadata: resourceVersion: "" selfLink: ""
I've filed the PR to open the ports for GCP and AWS IPI. This is not the same bug as whatever is wrong with vsphere - it doesn't have an intra-cluster firewall. Zhanqi, could you please re-test with vsphere and open a separate bug (And keep the cluster up - I don't have a vsphere cluster handy...).
Thanks. Casey. I will file another bug for vsphere
Assigning to Phil, who is actually working on this.
Created attachment 1616172 [details] OVN_logs_GCP
GCP is not a supported platform for OVN on 4.2. Bumping.
@zhaozhanqi comment #25 oc get all --all-namespaces. Everything that targets a master is pending. There is something that is preventing pods from starting on master. I can't access the cluster any more. kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-0.c.openshift-qe.internal 1/1 Running 0 9h kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-1.c.openshift-qe.internal 1/1 Running 0 9h kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-2.c.openshift-qe.internal 1/1 Running 0 9h openshift-apiserver-operator pod/openshift-apiserver-operator-6f45554457-b576t 0/1 Pending 0 9h openshift-cloud-credential-operator pod/cloud-credential-operator-7b4c65dbd5-kmrjf 0/1 Pending 0 9h openshift-cluster-machine-approver pod/machine-approver-7bf6885dff-rp9hm 0/1 Pending 0 9h openshift-cluster-version pod/cluster-version-operator-bf9c75cc4-4djcs 0/1 Pending 0 9h openshift-controller-manager-operator pod/openshift-controller-manager-operator-7c474d6cfc-wcvxf 0/1 Pending 0 9h openshift-dns-operator pod/dns-operator-79dbd8d86f-v67xv 0/1 Pending 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-0.c.openshift-qe.internal 2/2 Running 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-1.c.openshift-qe.internal 2/2 Running 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-2.c.openshift-qe.internal 2/2 Running 0 9h openshift-insights pod/insights-operator-646489b44d-jcdp8 0/1 Pending 0 9h openshift-kube-apiserver-operator pod/kube-apiserver-operator-65fc497c9-pbwp6 0/1 Pending 0 9h openshift-kube-controller-manager-operator pod/kube-controller-manager-operator-7f65ffd9b9-66rth 0/1 Pending 0 9h openshift-kube-scheduler-operator pod/openshift-kube-scheduler-operator-75bd9d6b59-v5rcp 0/1 Pending 0 9h openshift-machine-api pod/machine-api-operator-7f496594d4-mcxh2 0/1 Pending 0 9h openshift-machine-config-operator pod/machine-config-operator-55f5c9d548-m9qrw 0/1 Pending 0 9h openshift-multus pod/multus-58wt5 1/1 Running 59 9h openshift-multus pod/multus-snp52 1/1 Running 59 9h openshift-multus pod/multus-zxlcj 1/1 Running 59 9h openshift-network-operator pod/network-operator-74b8d64fc5-gncfc 1/1 Running 1 9h openshift-operator-lifecycle-manager pod/catalog-operator-57b6884cd6-gqvg5 0/1 Pending 0 9h openshift-operator-lifecycle-manager pod/olm-operator-7554464b74-kstts 0/1 Pending 0 9h openshift-ovn-kubernetes pod/ovnkube-master-644d65f44-zwmpt 0/4 Pending 0 9h openshift-ovn-kubernetes pod/ovnkube-node-2gbxg 2/3 Running 94 9h openshift-ovn-kubernetes pod/ovnkube-node-6qg4r 2/3 Running 94 9h openshift-ovn-kubernetes pod/ovnkube-node-wfld8 2/3 Running 94 9h openshift-service-ca-operator pod/service-ca-operator-674ccdc57d-55cq7 0/1 Pending 0 9h
@Phil, i am afraid it also failed on 4.3.0-0.nightly-2019-11-01-215341 with same reasoning as in comment 37 and comment 38. Did it work in your local env?
*** Bug 1769136 has been marked as a duplicate of this bug. ***
Moving it to Assigned state as per ongoing comments
*** Bug 1774594 has been marked as a duplicate of this bug. ***
https://github.com/openshift/cluster-network-operator/pull/396/ change getent hosts to getent ahostsv4 Previous was grabbing ipv6 addresses and miss handling them. This version is ipv4 only.
Problem oc commands don't work on 4.3 from laptop, do work when ssh to bootstrap node. openshift-install-linux-4.4.0-0.ci-2019-11-25-114444.tar.gz works. ovn still doesn't come up but we can once again work with debug images.
ovnkube-node fails with: time="2019-11-25T16:08:05Z" level=error msg="Error while obtaining addresses for k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal on node pcamer-qkfnm-m-0.c.openshift-gce-devel.internal - Error while obtaining dynamic addresses for k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal: OVN command '/usr/bin/ovn-nbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.0.6:9641,ssl:10.0.0.5:9641,ssl:10.0.0.3:9641 --timeout=15 get logical_switch_port k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal dynamic_addresses' failed: exit status 1" rsh onto the container and the command works. oc delete the pod and it comes back up correctly. Looking like a race. Continuing debug...
Thank you, Phil for the update.
Verified this bug on 4.4.0-0.nightly-2020-01-16-113546 OVN can be installed in GCP cluster.
*** Bug 1745546 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581