Description of problem: Upgrade from 4.5 latest nightly to 4.6 latest nightly failed,upgrade got stuck in machine config. Two nodes are in Notready status. Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-08-31-101523->4.6.0-0.nightly-2020-09-01-070508 How reproducible: Steps to Reproduce 1. Lanch a Disconnect vsphere cluster with 4.5.0-0.nightly-2020-08-31-101523 2. Mirror the images. oc adm release mirror --from=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-01-070508 --to=vcsa2-qe.mirror-registry.qe.devcluster.openshift.com:5000/huirwang/v4testing/ocp --to-release-image=vcsa2-qe.mirror-registry.qe.devcluster.openshift.com:5000/huirwang/v4testing/ocp:4.6.0-0.nightly-2020-09-01-070508 3. Upgrade to 4.6 oc adm upgrade --to-image=vcsa2-qe.mirror-registry.qe.devcluster.openshift.com:5000/huirwang/v4testing/ocp:4.6.0-0.nightly-2020-09-01-070508 --allow-explicit-upgrade --force=true Actual Result: oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-01-070508 True False True 0s cloud-credential 4.6.0-0.nightly-2020-09-01-070508 True False False 17h cluster-autoscaler 4.6.0-0.nightly-2020-09-01-070508 True False False 17h config-operator 4.6.0-0.nightly-2020-09-01-070508 True False False 17h console 4.6.0-0.nightly-2020-09-01-070508 True False False 13h csi-snapshot-controller 4.6.0-0.nightly-2020-09-01-070508 True False False 11h dns 4.6.0-0.nightly-2020-09-01-070508 True False True 13h etcd 4.6.0-0.nightly-2020-09-01-070508 True False True 17h image-registry 4.6.0-0.nightly-2020-09-01-070508 True False False 13h ingress 4.6.0-0.nightly-2020-09-01-070508 True False True 13h insights 4.6.0-0.nightly-2020-09-01-070508 True False False 17h kube-apiserver 4.6.0-0.nightly-2020-09-01-070508 True True True 17h kube-controller-manager 4.6.0-0.nightly-2020-09-01-070508 True False True 17h kube-scheduler 4.6.0-0.nightly-2020-09-01-070508 True False True 17h kube-storage-version-migrator 4.6.0-0.nightly-2020-09-01-070508 True False False 13h machine-api 4.6.0-0.nightly-2020-09-01-070508 True False False 17h machine-approver 4.6.0-0.nightly-2020-09-01-070508 True False False 17h machine-config 4.5.0-0.nightly-2020-08-31-101523 False True True 13h marketplace 4.6.0-0.nightly-2020-09-01-070508 True False False 13h monitoring 4.6.0-0.nightly-2020-09-01-070508 False True True 12h network 4.6.0-0.nightly-2020-09-01-070508 True True True 17h node-tuning 4.6.0-0.nightly-2020-09-01-070508 True False False 13h openshift-apiserver 4.6.0-0.nightly-2020-09-01-070508 False False True 2s openshift-controller-manager 4.6.0-0.nightly-2020-09-01-070508 True False False 17h openshift-samples 4.6.0-0.nightly-2020-09-01-070508 True False False 13h operator-lifecycle-manager 4.6.0-0.nightly-2020-09-01-070508 True False False 17h operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-01-070508 True False False 17h operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-01-070508 True False False 5m48s service-ca 4.6.0-0.nightly-2020-09-01-070508 True False False 17h storage 4.6.0-0.nightly-2020-09-01-070508 True False False 13h oc get nodes NAME STATUS ROLES AGE VERSION huir-upg523-ppjsk-master-0 NotReady,SchedulingDisabled master 18h v1.18.3+6c42de8 huir-upg523-ppjsk-master-1 Ready master 18h v1.18.3+6c42de8 huir-upg523-ppjsk-master-2 Ready master 18h v1.18.3+6c42de8 huir-upg523-ppjsk-worker-5vwmd NotReady,SchedulingDisabled worker 18h v1.18.3+6c42de8 huir-upg523-ppjsk-worker-9czwk Ready worker 18h v1.18.3+6c42de8 Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.574564 1521 kubelet.go:2286] node "huir-upg523-ppjsk-master-0" not found Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.674875 1521 kubelet.go:2286] node "huir-upg523-ppjsk-master-0" not found Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.751289 1521 webhook.go:111] Failed to make webhook authenticator request: Post https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on 10.0.0.51:53: no such host Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.751463 1521 server.go:253] Unable to authenticate the request due to an error: Post https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on 10.0.0.51:53: no such host Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: I0903 04:03:40.751617 1521 httplog.go:90] verb="GET" URI="/metrics/cadvisor" latency=3.841382ms resp=401 UserAgent="Prometheus/2.20.0" srcIP="10.0.0.132:60922": Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: goroutine 1015465 [running]: Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc0005f5dc0, 0x191) Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: /builddir/build/BUILD/openshift-git-0.8ee7b00/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:225 +0xc8 Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc0005f5dc0, 0x191) Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: /builddir/build/BUILD/openshift-git-0.8ee7b00/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:204 +0x35 Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader(...) Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: /builddir/build/BUILD/openshift-git-0.8ee7b00/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/response.go:201 crictl ps -a | grep coredns da7a9c36d4922 bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0 44 seconds ago Exited coredns 187 d801e376a643a 5ddb50f1883ff db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2 16 hours ago Running coredns-monitor 0 d801e376a643a crictl ps -a | grep Exited cad54e2196d80 bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0 3 minutes ago Exited coredns 188 d801e376a643a d8e79f2b7beaa d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a 4 minutes ago Exited kube-scheduler 187 b861a400c64c6 fa344589836e1 db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2 16 hours ago Exited render-config 0 e87fef24086f5 b77302dba1235 c0e7cd8d00a12d79331047cc4df31d8fdc576c0ef1da15ae68dcae8b00be3cd8 16 hours ago Exited etcd-resources-copy 0 347a3a68ed79f 707c32b3fd319 d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a 16 hours ago Exited wait-for-host-port 0 b861a400c64c6 39981efdd1abf db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2 16 hours ago Exited verify-hostname 0 e87fef24086f5 fb874d80591dc d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a 16 hours ago Exited setup 0 d014ac6c6ec15 945229a717162 db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2 16 hours ago Exited render-config 0 d801e376a643a eabaace5df9f0 c0e7cd8d00a12d79331047cc4df31d8fdc576c0ef1da15ae68dcae8b00be3cd8 16 hours ago Exited etcd-ensure-env-vars 0 347a3a68ed79f Actual Result: Upgrade successfully.
I enabled debug log level on crio and only see below messages in crio logs. I see coredns and kube-scheduler containers being created, started and then removed. From crio journal logs: ... Sep 03 04:34:48 huir-upg523-ppjsk-master-0 systemd[1]: Starting MCO environment configuration... Sep 03 04:34:48 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:48.999492236Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL" Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.000927063Z" level=info msg="Using conmon executable: /usr/libexec/crio/conmon" Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.012879594Z" level=info msg="Conmon does support the --sync option" Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.013009748Z" level=info msg="No seccomp profile specified, using the internal default" Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.013022116Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time" Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.044833776Z" level=info msg="Found CNI network multus-cni-network (type=multus) at /etc/kubernetes/cni/net.d/00-multus.conf" Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.044900752Z" level=info msg="Update default CNI network name to multus-cni-network" Sep 03 04:34:51 huir-upg523-ppjsk-master-0 systemd[1]: Started MCO environment configuration. Sep 03 04:35:34 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:35:34.359808792Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d6146706df8aaa55895f138cd9b23f4c7e8ab1f65753c77c71dfdc3e68f2a726" id=032f816e-d561-48c0-8352-bb3076baa169 name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:35:34 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:35:34.363673182Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:739921e736ad32eb15ff01515e373a62b160d739db8f2f006cfd261988757188,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d6146706df8aaa55895f138cd9b23f4c7e8ab1f65753c77c71dfdc3e68f2a726],Size_:254910365,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=032f816e-d561-48c0-8352-bb3076baa169 name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.436475536Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d" id=5ffce620-46f1-416e-8aab-1fa72d60d2cc name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.446878296Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d],Size_:758781034,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=5ffce620-46f1-416e-8aab-1fa72d60d2cc name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.448881079Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d" id=49a419cf-363e-48f2-90f2-1c2031875a4a name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.451539887Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d],Size_:758781034,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=49a419cf-363e-48f2-90f2-1c2031875a4a name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.457504475Z" level=info msg="Creating container: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=009c6798-c49c-49b0-ba22-32eac3ea6867 name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.884010765Z" level=info msg="Created container bf9047af2c772019cd27d09d078e7de6ea8b4e49a27521fff45fc033d267877d: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=009c6798-c49c-49b0-ba22-32eac3ea6867 name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.886506212Z" level=info msg="Starting container: bf9047af2c772019cd27d09d078e7de6ea8b4e49a27521fff45fc033d267877d" id=28f06c46-4882-4378-b8a6-20ba053c4524 name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.922614986Z" level=info msg="Started container bf9047af2c772019cd27d09d078e7de6ea8b4e49a27521fff45fc033d267877d: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=28f06c46-4882-4378-b8a6-20ba053c4524 name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 03 04:37:12 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:12.124189874Z" level=info msg="Removing container: 90fba0fd63ce3326676a31d13b6505e81482ab75117228d88aec3c836ea1ef3f" id=2a527bc5-6e8a-4a01-862c-b20e39544ea3 name=/runtime.v1alpha2.RuntimeService/RemoveContainer Sep 03 04:37:12 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:12.266239536Z" level=info msg="Removed container 90fba0fd63ce3326676a31d13b6505e81482ab75117228d88aec3c836ea1ef3f: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=2a527bc5-6e8a-4a01-862c-b20e39544ea3 name=/runtime.v1alpha2.RuntimeService/RemoveContainer Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.438377666Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7" id=e58c2a5f-bd37-4bcb-a12d-c936f8ce7fd5 name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.442539457Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7],Size_:309609322,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=e58c2a5f-bd37-4bcb-a12d-c936f8ce7fd5 name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.445691422Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7" id=1ac320cf-b969-46ef-a7a2-bdc5de6729e3 name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.448559814Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7],Size_:309609322,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=1ac320cf-b969-46ef-a7a2-bdc5de6729e3 name=/runtime.v1alpha2.ImageService/ImageStatus Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.452526050Z" level=info msg="Creating container: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=7b38756a-80cc-444d-8eda-d2e7fd125516 name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.827877037Z" level=info msg="Created container 9f8d11d744911d562c574355320f33fe603ffe19427179378061f865a8119ddf: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=7b38756a-80cc-444d-8eda-d2e7fd125516 name=/runtime.v1alpha2.RuntimeService/CreateContainer Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.837175589Z" level=info msg="Starting container: 9f8d11d744911d562c574355320f33fe603ffe19427179378061f865a8119ddf" id=5e9936ad-df44-4b22-b18b-97a5bda3aa4b name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.907719696Z" level=info msg="Started container 9f8d11d744911d562c574355320f33fe603ffe19427179378061f865a8119ddf: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=5e9936ad-df44-4b22-b18b-97a5bda3aa4b name=/runtime.v1alpha2.RuntimeService/StartContainer Sep 03 04:38:11 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:11.559654900Z" level=info msg="Removing container: d797263bf4e49785dd87a410a81631605063045144c96181c95bdb27e47a2dee" id=c786e3ea-115f-4c17-97e3-f72c3dfe593e name=/runtime.v1alpha2.RuntimeService/RemoveContainer Sep 03 04:38:11 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:11.790608945Z" level=info msg="Removed container d797263bf4e49785dd87a410a81631605063045144c96181c95bdb27e47a2dee: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=c786e3ea-115f-4c17-97e3-f72c3dfe593e name=/runtime.v1alpha2.RuntimeService/RemoveContainer ...
server.go:253] Unable to authenticate the request due to an error: Post https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on 10.0.0.51:53: no such host Looks like disruption in DNS is resulting in the kubelet unable to resolve the internal api LB name.
Need must-gather to debug further
Yeah, kubelet can't talk to the API server due to DNS failure Sep 04 14:56:55 huir-upg523-ppjsk-master-0 hyperkube[1521]: E0904 14:56:55.304306 1521 kubelet_node_status.go:92] Unable to register node "huir-upg523-ppjsk-master-0" with API server: Post https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/api/v1/nodes: dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on 10.0.0.51:53: no such host $ journalctl -u kubelet --since="1 hour ago" | grep "10.0.0.51:53: no such host" | wc -l 5944 Sending to Networking.
[core@huir-upg523-ppjsk-master-0 ~]$ ping api-int.huir-upg523.qe.devcluster.openshift.com ping: api-int.huir-upg523.qe.devcluster.openshift.com: Name or service not known [core@huir-upg523-ppjsk-master-0 ~]$ host api-int.huir-upg523.qe.devcluster.openshift.com host: parse of /etc/resolv.conf failed [core@huir-upg523-ppjsk-master-0 ~]$ cat /etc/resolv.conf # Generated by KNI resolv prepender NM dispatcher script search nameserver 10.0.0.162 nameserver 10.0.0.51
The issue is reproduced when upgrading from 4.5.0-0.nightly-2020-09-07-062006 to 4.6.0-0.nightly-2020-09-07-1911344.5.0-0.nightly-2020-09-07-062006 on ipi on vsphere. I found that coredns could not be started, and error is shown in coredns container log: [root@jima4508-lh5qg-master-1 ~]# crictl ps -a CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID 0690cb6d59479 229dbdee63b061c21623adb8ae30dad4ad25ddf8f48b3ffca0509a95bf9b0d13 30 seconds ago Exited coredns 18 d4807f530d630 ... [root@jima4508-lh5qg-master-1 ~]# crictl logs 0690cb6d59479 plugin/mdns: Invalid minSRV: strconv.Atoi: parsing "jima4508": invalid syntax
Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1872742. Could be an additional issue on top of that in mDNS?
Issue is not only reproduced in disconnected env, can also be reproduced when upgrading from 4.5 to 4.6 on common cluster which installed ipi on vsphere. When upgrading machine-config co, and first two nodes(one master and one worker) will be restarted, and then detect both of them are in NotReady status. Login to node, and found that coredns are in Exited state and restarted repeatedly. Then checking the log and got below error: [root@jima4509-x95qh-worker-d9dl9 ~]# crictl logs f974663dcbc8c plugin/mdns: Invalid minSRV: strconv.Atoi: parsing "jima4509": invalid syntax According to doc https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md#api-int-hostname-resolution, coredns will resolve api-int, but it could not start now.
It looks like something went wrong while rendering the coredns configuration. Can you upload the contents of /etc/coredns/Corefile here?
Okay, looks like EtcdDiscoveryDomain is empty on the failing node for some reason. I'm not sure how that could happen. I'll check with the MCO team and see if they have any ideas.
I attempted to check this on aws clusters via upgrades 4.5->4.6 and they are correctly populating EtcdDiscoveryDomain in the Controller Config post upgrade, so I'm not sure if this is a platform template issue or some other issue? Further with so many degraded operators: dns, auth, apiserver, etc, I would certainly expect the MCO also be degraded during this upgrade. Has this been reproduced on other vsphere clusters? If reproducible, can we figure out when this stopped working?
So out of curiosity I tried this upgrade on cluster-bot as metal. The upgrade looked very similar to the original post above, with the operators on and off degraded but the worker pool progressed slowly and then the masters all upgraded..slowly resulting in a successful upgrade from 4.5.10-> 4.6.0-0.nightly-2020-09-17-152859: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-17-152859 True False 4m2s Cluster version is 4.6.0-0.nightly-2020-09-17-152859 $ oc get nodes NAME STATUS ROLES AGE VERSION master-0.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com Ready master 132m v1.19.0+b4ffb45 master-1.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com Ready master 131m v1.19.0+b4ffb45 master-2.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com Ready master 131m v1.19.0+b4ffb45 worker-0.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com Ready worker 122m v1.19.0+b4ffb45 worker-1.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com Ready worker 122m v1.19.0+b4ffb45 worker-2.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com Ready worker 123m v1.19.0+b4ffb45 It seemed slower than a normal aws upgrade, but I'm not an expert in metal clusters. I did also notice an error in the MCD logs: I0917 23:32:10.685683 2940 daemon.go:883] systemctl --failed: loaded failed failed Network Manager Wait On Not sure if that's a metal bug (i dont see it in aws runs) or due to this being cluster bot but looks related to: https://github.com/openshift/machine-config-operator/commit/0704ea85bfd2ac3b6be2653b902b3d1d849658b1
As another note when talking to @Jinyun we saw: ``` # ./oc logs -n openshift-machine-config-operator machine-config-daemon-f7frm -c machine-config-daemon Error from server: Get "https://136.144.52.250:10250/containerLogs/openshift-machine-config-operator/machine-config-daemon-f7frm/machine-config-daemon": x509: certificate signed by unknown authority ``` # ./oc get csr No resources found
Not sure if it's of any value, but this is the clusterbot metal 4.5->4.6 upgrade I did which did indeed pass: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-metal/1306706449262448640/ I reached the end of my investigations and will step back to let the vsphere knowledgeable devs look further into this.
I was able to reproduce this on a baremetal cluster, so it looks like it's not specific to vsphere. I believe Kirsten's cluster was UPI baremetal, which doesn't work the same way as IPI baremetal. I don't have a solution to the problem yet, but I think we've made some progress on understanding what is going on. It looks like EtcdDiscoveryDomain is missing from the controllerconfigs after upgrade, but because MCO is one of the last things to upgrade it tries to re-render the 4.5 templates with the new controllerconfig and that leaves us with this empty domain problem. I'm out of time today, but we'll pick this up again on Monday.
*** Bug 1872742 has been marked as a duplicate of this bug. ***
upgrade on ipi on vsphere from 4.5.0-0.nightly-2020-09-20-185910 to 4.6.0-0.nightly-2020-09-24-030538 is successful, so move bug to VERIFIED. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-20-185910 True True 9s Working towards 4.6.0-0.nightly-2020-09-24-030538: 0% complete # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-24-030538 True False 5m44s Cluster version is 4.6.0-0.nightly-2020-09-24-030538 # ./oc get nodes NAME STATUS ROLES AGE VERSION jima4524-nnw9r-master-0 Ready master 113m v1.19.0+8a39924 jima4524-nnw9r-master-1 Ready master 113m v1.19.0+8a39924 jima4524-nnw9r-master-2 Ready master 113m v1.19.0+8a39924 jima4524-nnw9r-worker-8z5j9 Ready worker 102m v1.19.0+8a39924 jima4524-nnw9r-worker-dwfdl Ready worker 101m v1.19.0+8a39924 jima4524-nnw9r-worker-hfr6g Ready worker 102m v1.19.0+8a39924
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475