Bug 1748162
Summary: | [GCP]Failed to install cluster with OVN network type | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | zhaozhanqi <zzhao> | ||||
Component: | Networking | Assignee: | Phil Cameron <pcameron> | ||||
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | anusaxen, aos-bugs, bbennett, bsong, cdc, dcbw, pmuller, scuppett | ||||
Version: | 4.2.0 | Keywords: | TestBlocker | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.4.0 | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-05-04 11:13:32 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
zhaozhanqi
2019-09-03 03:28:58 UTC
this issue can be reproduced in vsphere and GCP cluster. Have no idea why the node changed the scheduler to disable When I change the schedule by manual to 'enable' by 'oc adm uncordon', the ovn pod can be scheduled and running Bringing this back into the tech preview for ovn-kubernetes in 4.2. We should support: - IPI AWS - IPI Azure - UPI If only IPI vSphere and IPI GCP do not work, that is okay as a limitation of the tech preview. Note that this means we need to make vSphere UPI work for 4.2. Whatever cloud is used needs to have the right ports open between machines. We have this for AWS in the installer right now. How does this work for vSphere? Also, can we get ovnkube master pod/container logs from a failed vSphere install? hi, Dan When I using OVN as network type to install cluster on vsphere. the ovn master pod is pending due to all master and worker were marked as 'SchedulingDisabled', I did not find the reason why and which step make it. I tried many times and get same result. [root@dhcp-140-66 ~]# oc get node NAME STATUS ROLES AGE VERSION compute-0 NotReady,SchedulingDisabled worker 47m v1.14.6+82219910a control-plane-0 NotReady,SchedulingDisabled master 47m v1.14.6+82219910a When I tried to change it to 'scheduleing' by ` oc adm uncordon control-plane-0`, the OVN pod can be running # oc get pod -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-78c6798568-x9sbv 4/4 Running 2 56m ovnkube-node-ggpxc 3/3 Running 13 58m ovnkube-node-h5bt6 3/3 Running 13 58m but seems the other components of cluster cannot be started up: # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE cloud-credential 4.2.0-0.nightly-2019-09-08-180038 True False False 56m dns 4.2.0-0.nightly-2019-09-08-180038 True False False 55m insights 4.2.0-0.nightly-2019-09-08-180038 True True False 56m kube-apiserver 4.2.0-0.nightly-2019-09-08-180038 True False False 55m kube-controller-manager 4.2.0-0.nightly-2019-09-08-180038 False True False 56m kube-scheduler 4.2.0-0.nightly-2019-09-08-180038 False True False 56m machine-api 4.2.0-0.nightly-2019-09-08-180038 True False False 56m machine-config 4.2.0-0.nightly-2019-09-08-180038 False True False 56m network 4.2.0-0.nightly-2019-09-08-180038 True False False 7m58s openshift-apiserver 4.2.0-0.nightly-2019-09-08-180038 False False False 55m openshift-controller-manager False True False 56m operator-lifecycle-manager 4.2.0-0.nightly-2019-09-08-180038 True True False 55m operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-09-08-180038 True True False 55m operator-lifecycle-manager-packageserver False True False 55m service-ca 4.2.0-0.nightly-2019-09-08-180038 True True False 56m # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 58m Unable to apply 4.2.0-0.nightly-2019-09-08-180038: an unknown error has occurred [root@dhcp-140-66 ~]# oc get clusterversion -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: creationTimestamp: "2019-09-09T02:01:50Z" generation: 1 name: version resourceVersion: "11439" selfLink: /apis/config.openshift.io/v1/clusterversions/version uid: c96109c3-d2a5-11e9-86b6-0050568b99b8 spec: channel: stable-4.2 clusterID: df5120e1-96f0-408a-b518-8af75a89aa5b upstream: https://api.openshift.com/api/upgrades_info/v1/graph status: availableUpdates: null conditions: - lastTransitionTime: "2019-09-09T02:02:06Z" status: "False" type: Available - lastTransitionTime: "2019-09-09T02:58:08Z" message: |- Multiple errors are preventing progress: * Could not update oauthclient "console" (263 of 416): the server does not recognize this resource, check extension API servers * Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (214 of 416): resource may have been deleted * Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (411 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-authentication-operator/authentication-operator" (376 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (415 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-image-registry/image-registry" (382 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (393 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (397 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (401 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (152 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-machine-api/machine-api-operator" (403 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (405 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (385 of 416): the server does not recognize this resource, check extension API servers * Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (388 of 416): the server does not recognize this resource, check extension API servers reason: MultipleErrors status: "True" type: Failing - lastTransitionTime: "2019-09-09T02:02:06Z" message: 'Unable to apply 4.2.0-0.nightly-2019-09-08-180038: an unknown error has occurred' reason: MultipleErrors status: "True" type: Progressing - lastTransitionTime: "2019-09-09T02:02:06Z" message: 'Unable to retrieve available updates: currently installed version 4.2.0-0.nightly-2019-09-08-180038 not found in the "stable-4.2" channel' reason: RemoteFailed status: "False" type: RetrievedUpdates desired: force: false image: registry.svc.ci.openshift.org/ocp/release@sha256:7862f9777e846c23fefeac77dc58c8107616acd65707c8437d06d29d2e4990ad version: 4.2.0-0.nightly-2019-09-08-180038 history: - completionTime: null image: registry.svc.ci.openshift.org/ocp/release@sha256:7862f9777e846c23fefeac77dc58c8107616acd65707c8437d06d29d2e4990ad startedTime: "2019-09-09T02:02:06Z" state: Partial verified: false version: 4.2.0-0.nightly-2019-09-08-180038 observedGeneration: 1 versionHash: MBWmxYuYaYQ= kind: List metadata: resourceVersion: "" selfLink: "" I've filed the PR to open the ports for GCP and AWS IPI. This is not the same bug as whatever is wrong with vsphere - it doesn't have an intra-cluster firewall. Zhanqi, could you please re-test with vsphere and open a separate bug (And keep the cluster up - I don't have a vsphere cluster handy...). Thanks. Casey. I will file another bug for vsphere Assigning to Phil, who is actually working on this. Created attachment 1616172 [details]
OVN_logs_GCP
GCP is not a supported platform for OVN on 4.2. Bumping. @zhaozhanqi comment #25 oc get all --all-namespaces. Everything that targets a master is pending. There is something that is preventing pods from starting on master. I can't access the cluster any more. kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-0.c.openshift-qe.internal 1/1 Running 0 9h kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-1.c.openshift-qe.internal 1/1 Running 0 9h kube-system pod/gcp-routes-controller-zhaoov-9jl59-m-2.c.openshift-qe.internal 1/1 Running 0 9h openshift-apiserver-operator pod/openshift-apiserver-operator-6f45554457-b576t 0/1 Pending 0 9h openshift-cloud-credential-operator pod/cloud-credential-operator-7b4c65dbd5-kmrjf 0/1 Pending 0 9h openshift-cluster-machine-approver pod/machine-approver-7bf6885dff-rp9hm 0/1 Pending 0 9h openshift-cluster-version pod/cluster-version-operator-bf9c75cc4-4djcs 0/1 Pending 0 9h openshift-controller-manager-operator pod/openshift-controller-manager-operator-7c474d6cfc-wcvxf 0/1 Pending 0 9h openshift-dns-operator pod/dns-operator-79dbd8d86f-v67xv 0/1 Pending 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-0.c.openshift-qe.internal 2/2 Running 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-1.c.openshift-qe.internal 2/2 Running 0 9h openshift-etcd pod/etcd-member-zhaoov-9jl59-m-2.c.openshift-qe.internal 2/2 Running 0 9h openshift-insights pod/insights-operator-646489b44d-jcdp8 0/1 Pending 0 9h openshift-kube-apiserver-operator pod/kube-apiserver-operator-65fc497c9-pbwp6 0/1 Pending 0 9h openshift-kube-controller-manager-operator pod/kube-controller-manager-operator-7f65ffd9b9-66rth 0/1 Pending 0 9h openshift-kube-scheduler-operator pod/openshift-kube-scheduler-operator-75bd9d6b59-v5rcp 0/1 Pending 0 9h openshift-machine-api pod/machine-api-operator-7f496594d4-mcxh2 0/1 Pending 0 9h openshift-machine-config-operator pod/machine-config-operator-55f5c9d548-m9qrw 0/1 Pending 0 9h openshift-multus pod/multus-58wt5 1/1 Running 59 9h openshift-multus pod/multus-snp52 1/1 Running 59 9h openshift-multus pod/multus-zxlcj 1/1 Running 59 9h openshift-network-operator pod/network-operator-74b8d64fc5-gncfc 1/1 Running 1 9h openshift-operator-lifecycle-manager pod/catalog-operator-57b6884cd6-gqvg5 0/1 Pending 0 9h openshift-operator-lifecycle-manager pod/olm-operator-7554464b74-kstts 0/1 Pending 0 9h openshift-ovn-kubernetes pod/ovnkube-master-644d65f44-zwmpt 0/4 Pending 0 9h openshift-ovn-kubernetes pod/ovnkube-node-2gbxg 2/3 Running 94 9h openshift-ovn-kubernetes pod/ovnkube-node-6qg4r 2/3 Running 94 9h openshift-ovn-kubernetes pod/ovnkube-node-wfld8 2/3 Running 94 9h openshift-service-ca-operator pod/service-ca-operator-674ccdc57d-55cq7 0/1 Pending 0 9h @Phil, i am afraid it also failed on 4.3.0-0.nightly-2019-11-01-215341 with same reasoning as in comment 37 and comment 38. Did it work in your local env? *** Bug 1769136 has been marked as a duplicate of this bug. *** Moving it to Assigned state as per ongoing comments *** Bug 1774594 has been marked as a duplicate of this bug. *** https://github.com/openshift/cluster-network-operator/pull/396/ change getent hosts to getent ahostsv4 Previous was grabbing ipv6 addresses and miss handling them. This version is ipv4 only. Problem oc commands don't work on 4.3 from laptop, do work when ssh to bootstrap node. openshift-install-linux-4.4.0-0.ci-2019-11-25-114444.tar.gz works. ovn still doesn't come up but we can once again work with debug images. ovnkube-node fails with: time="2019-11-25T16:08:05Z" level=error msg="Error while obtaining addresses for k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal on node pcamer-qkfnm-m-0.c.openshift-gce-devel.internal - Error while obtaining dynamic addresses for k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal: OVN command '/usr/bin/ovn-nbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.0.6:9641,ssl:10.0.0.5:9641,ssl:10.0.0.3:9641 --timeout=15 get logical_switch_port k8s-pcamer-qkfnm-m-0.c.openshift-gce-devel.internal dynamic_addresses' failed: exit status 1" rsh onto the container and the command works. oc delete the pod and it comes back up correctly. Looking like a race. Continuing debug... Thank you, Phil for the update. Verified this bug on 4.4.0-0.nightly-2020-01-16-113546 OVN can be installed in GCP cluster. *** Bug 1745546 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |