Bug 1748162
| Summary: | [GCP]Failed to install cluster with OVN network type | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | zhaozhanqi <zzhao> | ||||
| Component: | Networking | Assignee: | Phil Cameron <pcameron> | ||||
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | high | CC: | anusaxen, aos-bugs, bbennett, bsong, cdc, dcbw, pmuller, scuppett | ||||
| Version: | 4.2.0 | Keywords: | TestBlocker | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.4.0 | ||||||
| Hardware: | All | ||||||
| OS: | All | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-05-04 11:13:32 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
this issue can be reproduced in vsphere and GCP cluster. Have no idea why the node changed the scheduler to disable When I change the schedule by manual to 'enable' by 'oc adm uncordon', the ovn pod can be scheduled and running Bringing this back into the tech preview for ovn-kubernetes in 4.2. We should support: - IPI AWS - IPI Azure - UPI If only IPI vSphere and IPI GCP do not work, that is okay as a limitation of the tech preview. Note that this means we need to make vSphere UPI work for 4.2. Whatever cloud is used needs to have the right ports open between machines. We have this for AWS in the installer right now. How does this work for vSphere? Also, can we get ovnkube master pod/container logs from a failed vSphere install? hi, Dan
When I using OVN as network type to install cluster on vsphere. the ovn master pod is pending due to all master and worker were marked as 'SchedulingDisabled', I did not find the reason why and which step make it. I tried many times and get same result.
[root@dhcp-140-66 ~]# oc get node
NAME STATUS ROLES AGE VERSION
compute-0 NotReady,SchedulingDisabled worker 47m v1.14.6+82219910a
control-plane-0 NotReady,SchedulingDisabled master 47m v1.14.6+82219910a
When I tried to change it to 'scheduleing' by ` oc adm uncordon control-plane-0`, the OVN pod can be running
# oc get pod -n openshift-ovn-kubernetes
NAME READY STATUS RESTARTS AGE
ovnkube-master-78c6798568-x9sbv 4/4 Running 2 56m
ovnkube-node-ggpxc 3/3 Running 13 58m
ovnkube-node-h5bt6 3/3 Running 13 58m
but seems the other components of cluster cannot be started up:
# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
cloud-credential 4.2.0-0.nightly-2019-09-08-180038 True False False 56m
dns 4.2.0-0.nightly-2019-09-08-180038 True False False 55m
insights 4.2.0-0.nightly-2019-09-08-180038 True True False 56m
kube-apiserver 4.2.0-0.nightly-2019-09-08-180038 True False False 55m
kube-controller-manager 4.2.0-0.nightly-2019-09-08-180038 False True False 56m
kube-scheduler 4.2.0-0.nightly-2019-09-08-180038 False True False 56m
machine-api 4.2.0-0.nightly-2019-09-08-180038 True False False 56m
machine-config 4.2.0-0.nightly-2019-09-08-180038 False True False 56m
network 4.2.0-0.nightly-2019-09-08-180038 True False False 7m58s
openshift-apiserver 4.2.0-0.nightly-2019-09-08-180038 False False False 55m
openshift-controller-manager False True False 56m
operator-lifecycle-manager 4.2.0-0.nightly-2019-09-08-180038 True True False 55m
operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-09-08-180038 True True False 55m
operator-lifecycle-manager-packageserver False True False 55m
service-ca 4.2.0-0.nightly-2019-09-08-180038 True True False 56m
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 58m Unable to apply 4.2.0-0.nightly-2019-09-08-180038: an unknown error has occurred
[root@dhcp-140-66 ~]# oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
creationTimestamp: "2019-09-09T02:01:50Z"
generation: 1
name: version
resourceVersion: "11439"
selfLink: /apis/config.openshift.io/v1/clusterversions/version
uid: c96109c3-d2a5-11e9-86b6-0050568b99b8
spec:
channel: stable-4.2
clusterID: df5120e1-96f0-408a-b518-8af75a89aa5b
upstream: https://api.openshift.com/api/upgrades_info/v1/graph
status:
availableUpdates: null
conditions:
- lastTransitionTime: "2019-09-09T02:02:06Z"
status: "False"
type: Available
- lastTransitionTime: "2019-09-09T02:58:08Z"
message: |-
Multiple errors are preventing progress:
* Could not update oauthclient "console" (263 of 416): the server does not recognize this resource, check extension API servers
* Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (214 of 416): resource may have been deleted
* Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (411 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-authentication-operator/authentication-operator" (376 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (415 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-image-registry/image-registry" (382 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (393 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (397 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (401 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (152 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-machine-api/machine-api-operator" (403 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (405 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (385 of 416): the server does not recognize this resource, check extension API servers
* Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (388 of 416): the server does not recognize this resource, check extension API servers
reason: MultipleErrors
status: "True"
type: Failing
- lastTransitionTime: "2019-09-09T02:02:06Z"
message: 'Unable to apply 4.2.0-0.nightly-2019-09-08-180038: an unknown error
has occurred'
reason: MultipleErrors
status: "True"
type: Progressing
- lastTransitionTime: "2019-09-09T02:02:06Z"
message: 'Unable to retrieve available updates: currently installed version
4.2.0-0.nightly-2019-09-08-180038 not found in the "stable-4.2" channel'
reason: RemoteFailed
status: "False"
type: RetrievedUpdates
desired:
force: false
image: registry.svc.ci.openshift.org/ocp/release@sha256:7862f9777e846c23fefeac77dc58c8107616acd65707c8437d06d29d2e4990ad
version: 4.2.0-0.nightly-2019-09-08-180038
history:
- completionTime: null
image: registry.svc.ci.openshift.org/ocp/release@sha256:7862f9777e846c23fefeac77dc58c8107616acd65707c8437d06d29d2e4990ad
startedTime: "2019-09-09T02:02:06Z"
state: Partial
verified: false
version: 4.2.0-0.nightly-2019-09-08-180038
observedGeneration: 1
versionHash: MBWmxYuYaYQ=
kind: List
metadata:
resourceVersion: ""
selfLink: ""
I've filed the PR to open the ports for GCP and AWS IPI. This is not the same bug as whatever is wrong with vsphere - it doesn't have an intra-cluster firewall. Zhanqi, could you please re-test with vsphere and open a separate bug (And keep the cluster up - I don't have a vsphere cluster handy...). Thanks. Casey. I will file another bug for vsphere Assigning to Phil, who is actually working on this. Created attachment 1616172 [details]
OVN_logs_GCP
GCP is not a supported platform for OVN on 4.2. Bumping. @zhaozhanqi comment #25 oc get all --all-namespaces. Everything that targets a master is pending. There is something that is preventing pods from starting on master. I can't access the cluster any more. kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-0.c.openshift-qe.internal 1/1 Running 0 9h kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-1.c.openshift-qe.internal 1/1 Running 0 9h kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-2.c.openshift-qe.internal 1/1 Running 0 9h openshift-apiserver-operator pod/openshift-apiserver-operator-6f45554457-b576t 0/1 Pending 0 9h openshift-cloud-credential-operator pod/cloud-credential-operator-7b4c65dbd5-kmrjf 0/1 Pending 0 9h openshift-cluster-machine-approver pod/machine-approver-7bf6885dff-rp9hm 0/1 Pending 0 9h openshift-cluster-version pod/cluster-version-operator-bf9c75cc4-4djcs 0/1 Pending 0 9h openshift-controller-manager-operator pod/openshift-controller-manager-operator-7c474d6cfc-wcvxf 0/1 Pending 0 9h openshift-dns-operator pod/dns-operator-79dbd8d86f-v67xv 0/1 Pending 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-0.c.openshift-qe.internal 2/2 Running 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-1.c.openshift-qe.internal 2/2 Running 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-2.c.openshift-qe.internal 2/2 Running 0 9h openshift-insights pod/insights-operator-646489b44d-jcdp8 0/1 Pending 0 9h openshift-kube-apiserver-operator pod/kube-apiserver-operator-65fc497c9-pbwp6 0/1 Pending 0 9h openshift-kube-controller-manager-operator pod/kube-controller-manager-operator-7f65ffd9b9-66rth 0/1 Pending 0 9h openshift-kube-scheduler-operator pod/openshift-kube-scheduler-operator-75bd9d6b59-v5rcp 0/1 Pending 0 9h openshift-machine-api pod/machine-api-operator-7f496594d4-mcxh2 0/1 Pending 0 9h openshift-machine-config-operator pod/machine-config-operator-55f5c9d548-m9qrw 0/1 Pending 0 9h openshift-multus pod/multus-58wt5 1/1 Running 59 9h openshift-multus pod/multus-snp52 1/1 Running 59 9h openshift-multus pod/multus-zxlcj 1/1 Running 59 9h openshift-network-operator pod/network-operator-74b8d64fc5-gncfc 1/1 Running 1 9h openshift-operator-lifecycle-manager pod/catalog-operator-57b6884cd6-gqvg5 0/1 Pending 0 9h openshift-operator-lifecycle-manager pod/olm-operator-7554464b74-kstts 0/1 Pending 0 9h openshift-ovn-kubernetes pod/ovnkube-master-644d65f44-zwmpt 0/4 Pending 0 9h openshift-ovn-kubernetes pod/ovnkube-node-2gbxg 2/3 Running 94 9h openshift-ovn-kubernetes pod/ovnkube-node-6qg4r 2/3 Running 94 9h openshift-ovn-kubernetes pod/ovnkube-node-wfld8 2/3 Running 94 9h openshift-service-ca-operator pod/service-ca-operator-674ccdc57d-55cq7 0/1 Pending 0 9h @Phil, i am afraid it also failed on 4.3.0-0.nightly-2019-11-01-215341 with same reasoning as in comment 37 and comment 38. Did it work in your local env? *** Bug 1769136 has been marked as a duplicate of this bug. *** Moving it to Assigned state as per ongoing comments *** Bug 1774594 has been marked as a duplicate of this bug. *** https://github.com/openshift/cluster-network-operator/pull/396/ change getent hosts to getent ahostsv4 Previous was grabbing ipv6 addresses and miss handling them. This version is ipv4 only. Problem oc commands don't work on 4.3 from laptop, do work when ssh to bootstrap node. openshift-install-linux-4.4.0-0.ci-2019-11-25-114444.tar.gz works. ovn still doesn't come up but we can once again work with debug images. ovnkube-node fails with: time="2019-11-25T16:08:05Z" level=error msg="Error while obtaining addresses for k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal on node pcamer-qkfnm-m-0.c.openshift-gce-devel.internal - Error while obtaining dynamic addresses for k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal: OVN command '/usr/bin/ovn-nbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.0.6:9641,ssl:10.0.0.5:9641,ssl:10.0.0.3:9641 --timeout=15 get logical_switch_port k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal dynamic_addresses' failed: exit status 1" rsh onto the container and the command works. oc delete the pod and it comes back up correctly. Looking like a race. Continuing debug... Thank you, Phil for the update. Verified this bug on 4.4.0-0.nightly-2020-01-16-113546 OVN can be installed in GCP cluster. *** Bug 1745546 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |
Description of problem: Setup the cluster with OVN network type, the cluster cannot be worked. Check the ovn pod: # oc get pod -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-785b7b768d-mhbhq 0/4 Pending 0 33m ovnkube-node-bq4d5 1/3 CrashLoopBackOff 9 35m ovnkube-node-jvv5p 1/3 CrashLoopBackOff 9 35m Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-02-172410 How reproducible: always Steps to Reproduce: 1. install cluster on vsphere with OVN network type 2. check the pod in openshift-ovn-kubernetes, ovn-master pod cannot be scheduled. # oc describe pod ovnkube-master-785b7b768d-mhbhq -n openshift-ovn-kubernetes Name: ovnkube-master-785b7b768d-mhbhq Namespace: openshift-ovn-kubernetes Priority: 2000000000 PriorityClassName: system-cluster-critical Node: <none> Labels: component=network kubernetes.io/os=linux name=ovnkube-master openshift.io/component=network pod-template-hash=785b7b768d type=infra Annotations: <none> Status: Pending IP: Controlled By: ReplicaSet/ovnkube-master-785b7b768d Containers: run-ovn-northd: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh run-ovn-northd Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_LOG_NORTHD: -vconsole:info OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) nb-ovsdb: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh nb-ovsdb Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_LOG_NB: -vconsole:info -vfile:info OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) sb-ovsdb: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh sb-ovsdb Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_LOG_SB: -vconsole:info -vfile:info OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/kubernetes/ from host-var-run-kubernetes (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) ovnkube-master: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9c1d6e6c987fda3b4ba12af083c0a40f10eef92b1f2711cbf02f677fa61848f8 Port: <none> Host Port: <none> Command: /root/ovnkube.sh ovn-master Requests: cpu: 100m memory: 300Mi Environment: OVN_DAEMONSET_VERSION: 3 OVN_MASTER: true OVNKUBE_LOGLEVEL: 4 OVN_NET_CIDR: <set to the key 'net_cidr' of config map 'ovn-config'> Optional: false OVN_SVC_CIDR: <set to the key 'svc_cidr' of config map 'ovn-config'> Optional: false K8S_NODE: (v1:spec.nodeName) K8S_APISERVER: <set to the key 'k8s_apiserver' of config map 'ovn-config'> Optional: false OVN_KUBERNETES_NAMESPACE: openshift-ovn-kubernetes (v1:metadata.namespace) Mounts: /etc/openvswitch/ from host-var-lib-ovs (rw) /var/lib/openvswitch/ from host-var-lib-ovs (rw) /var/run/kubernetes/ from host-var-run-kubernetes (rw) /var/run/openvswitch/ from host-var-run-ovs (rw) /var/run/secrets/kubernetes.io/serviceaccount from ovn-kubernetes-controller-token-8dm2j (ro) Conditions: Type Status PodScheduled False Volumes: host-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: host-var-lib-ovs: Type: HostPath (bare host directory volume) Path: /var/lib/openvswitch HostPathType: host-var-run-ovs: Type: HostPath (bare host directory volume) Path: /var/run/openvswitch HostPathType: host-var-run-kubernetes: Type: HostPath (bare host directory volume) Path: /var/run/kubernetes HostPathType: ovn-kubernetes-controller-token-8dm2j: Type: Secret (a volume populated by a Secret) SecretName: ovn-kubernetes-controller-token-8dm2j Optional: false QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux kubernetes.io/os=linux node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/not-ready:NoSchedule node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 46m (x6 over 46m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. Warning FailedScheduling 37m (x6 over 37m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. Warning FailedScheduling 34m (x2 over 34m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. Warning FailedScheduling 13s (x24 over 34m) default-scheduler 0/2 nodes are available: 2 node(s) were unschedulable. 3. oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS compute-0 NotReady,SchedulingDisabled worker 41m v1.14.0+2b7562925 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=compute-0,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos control-plane-0 NotReady,SchedulingDisabled master 41m v1.14.0+2b7562925 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=control-plane-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos Actual results: cluster cannot be setup with OVN Expected results: Additional info: