Bug 1875237 - [Upgrade][Vsphere]Upgrade from 4.5 latest nightly to 4.6 latest nightly failed, some containers exited on master node.
Summary: [Upgrade][Vsphere]Upgrade from 4.5 latest nightly to 4.6 latest nightly faile...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.6.0
Assignee: Ben Nemec
QA Contact: Victor Voronkov
URL:
Whiteboard:
: 1872742 (view as bug list)
Depends On:
Blocks: 1865308
TreeView+ depends on / blocked
 
Reported: 2020-09-03 05:15 UTC by huirwang
Modified: 2021-04-05 17:47 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:37:26 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2113 0 None closed Bug 1875237: Add etcdDiscoveryDomain back to controllerconfig crd 2020-12-01 19:29:44 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:37:43 UTC

Description huirwang 2020-09-03 05:15:06 UTC
Description of problem:
Upgrade from 4.5 latest nightly to 4.6 latest nightly failed,upgrade got stuck in machine config. Two nodes are in Notready status. 

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-08-31-101523->4.6.0-0.nightly-2020-09-01-070508

How reproducible:

Steps to Reproduce
1. Lanch a Disconnect vsphere cluster with 4.5.0-0.nightly-2020-08-31-101523

2. Mirror the images.
oc adm release mirror  --from=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-01-070508  --to=vcsa2-qe.mirror-registry.qe.devcluster.openshift.com:5000/huirwang/v4testing/ocp --to-release-image=vcsa2-qe.mirror-registry.qe.devcluster.openshift.com:5000/huirwang/v4testing/ocp:4.6.0-0.nightly-2020-09-01-070508

3. Upgrade to 4.6
oc adm upgrade --to-image=vcsa2-qe.mirror-registry.qe.devcluster.openshift.com:5000/huirwang/v4testing/ocp:4.6.0-0.nightly-2020-09-01-070508  --allow-explicit-upgrade --force=true

Actual Result:

oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-01-070508   True        False         True       0s
cloud-credential                           4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
cluster-autoscaler                         4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
config-operator                            4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
console                                    4.6.0-0.nightly-2020-09-01-070508   True        False         False      13h
csi-snapshot-controller                    4.6.0-0.nightly-2020-09-01-070508   True        False         False      11h
dns                                        4.6.0-0.nightly-2020-09-01-070508   True        False         True       13h
etcd                                       4.6.0-0.nightly-2020-09-01-070508   True        False         True       17h
image-registry                             4.6.0-0.nightly-2020-09-01-070508   True        False         False      13h
ingress                                    4.6.0-0.nightly-2020-09-01-070508   True        False         True       13h
insights                                   4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
kube-apiserver                             4.6.0-0.nightly-2020-09-01-070508   True        True          True       17h
kube-controller-manager                    4.6.0-0.nightly-2020-09-01-070508   True        False         True       17h
kube-scheduler                             4.6.0-0.nightly-2020-09-01-070508   True        False         True       17h
kube-storage-version-migrator              4.6.0-0.nightly-2020-09-01-070508   True        False         False      13h
machine-api                                4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
machine-approver                           4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
machine-config                             4.5.0-0.nightly-2020-08-31-101523   False       True          True       13h
marketplace                                4.6.0-0.nightly-2020-09-01-070508   True        False         False      13h
monitoring                                 4.6.0-0.nightly-2020-09-01-070508   False       True          True       12h
network                                    4.6.0-0.nightly-2020-09-01-070508   True        True          True       17h
node-tuning                                4.6.0-0.nightly-2020-09-01-070508   True        False         False      13h
openshift-apiserver                        4.6.0-0.nightly-2020-09-01-070508   False       False         True       2s
openshift-controller-manager               4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
openshift-samples                          4.6.0-0.nightly-2020-09-01-070508   True        False         False      13h
operator-lifecycle-manager                 4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-01-070508   True        False         False      5m48s
service-ca                                 4.6.0-0.nightly-2020-09-01-070508   True        False         False      17h
storage                                    4.6.0-0.nightly-2020-09-01-070508   True        False         False      13h

oc get nodes
NAME                             STATUS                        ROLES    AGE   VERSION
huir-upg523-ppjsk-master-0       NotReady,SchedulingDisabled   master   18h   v1.18.3+6c42de8
huir-upg523-ppjsk-master-1       Ready                         master   18h   v1.18.3+6c42de8
huir-upg523-ppjsk-master-2       Ready                         master   18h   v1.18.3+6c42de8
huir-upg523-ppjsk-worker-5vwmd   NotReady,SchedulingDisabled   worker   18h   v1.18.3+6c42de8
huir-upg523-ppjsk-worker-9czwk   Ready                         worker   18h   v1.18.3+6c42de8

Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.574564     1521 kubelet.go:2286] node "huir-upg523-ppjsk-master-0" not found
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.674875     1521 kubelet.go:2286] node "huir-upg523-ppjsk-master-0" not found
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.751289     1521 webhook.go:111] Failed to make webhook authenticator request: Post   https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/apis/authentication.k8s.io/v1/tokenreviews:  dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on  10.0.0.51:53: no such host
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]: E0903 04:03:40.751463     1521 server.go:253] Unable to authenticate the request due to an error:  Post  https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/apis/authentication.k8s.io/v1/tokenreviews:  dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on  10.0.0.51:53: no such host
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]: I0903 04:03:40.751617     1521 httplog.go:90] verb="GET" URI="/metrics/cadvisor"  latency=3.841382ms resp=401 UserAgent="Prometheus/2.20.0"  srcIP="10.0.0.132:60922":
Sep 03 04:03:40 huir-upg523-ppjsk-master-0 hyperkube[1521]: goroutine 1015465 [running]:
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]:  github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc0005f5dc0,  0x191)
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]:          /builddir/build/BUILD/openshift-git-0.8ee7b00/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:225  +0xc8
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]:  github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc0005f5dc0,  0x191)
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]:          /builddir/build/BUILD/openshift-git-0.8ee7b00/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:204  +0x35
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]:  github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader(...)
Sep 03 04:03:40  huir-upg523-ppjsk-master-0 hyperkube[1521]:          /builddir/build/BUILD/openshift-git-0.8ee7b00/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/response.go:201

 crictl ps -a | grep coredns
da7a9c36d4922       bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0   44 seconds ago       Exited              coredns                                       187                 d801e376a643a
5ddb50f1883ff       db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2   16 hours ago         Running             coredns-monitor                               0                   d801e376a643a

crictl ps -a | grep Exited 
cad54e2196d80       bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0   3 minutes ago       Exited              coredns                                       188                 d801e376a643a
d8e79f2b7beaa       d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a   4 minutes ago       Exited              kube-scheduler                                187                 b861a400c64c6
fa344589836e1       db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2   16 hours ago        Exited              render-config                                 0                   e87fef24086f5
b77302dba1235       c0e7cd8d00a12d79331047cc4df31d8fdc576c0ef1da15ae68dcae8b00be3cd8   16 hours ago        Exited              etcd-resources-copy                           0                   347a3a68ed79f
707c32b3fd319       d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a   16 hours ago        Exited              wait-for-host-port                            0                   b861a400c64c6
39981efdd1abf       db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2   16 hours ago        Exited              verify-hostname                               0                   e87fef24086f5
fb874d80591dc       d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a   16 hours ago        Exited              setup                                         0                   d014ac6c6ec15
945229a717162       db66f3ca0f13718bfbfcd2d57ac8f401707afa39168fde592edeae2219030ac2   16 hours ago        Exited              render-config                                 0                   d801e376a643a
eabaace5df9f0       c0e7cd8d00a12d79331047cc4df31d8fdc576c0ef1da15ae68dcae8b00be3cd8   16 hours ago        Exited              etcd-ensure-env-vars                          0                   347a3a68ed79f


Actual Result:
Upgrade successfully.

Comment 1 Sunil Choudhary 2020-09-03 05:21:40 UTC
I enabled debug log level on crio and only see below messages in crio logs. I see coredns and kube-scheduler containers being created, started and then removed.

From crio journal logs:
...
Sep 03 04:34:48 huir-upg523-ppjsk-master-0 systemd[1]: Starting MCO environment configuration...
Sep 03 04:34:48 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:48.999492236Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL"
Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.000927063Z" level=info msg="Using conmon executable: /usr/libexec/crio/conmon"
Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.012879594Z" level=info msg="Conmon does support the --sync option"
Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.013009748Z" level=info msg="No seccomp profile specified, using the internal default"
Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.013022116Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.044833776Z" level=info msg="Found CNI network multus-cni-network (type=multus) at /etc/kubernetes/cni/net.d/00-multus.conf"
Sep 03 04:34:49 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:34:49.044900752Z" level=info msg="Update default CNI network name to multus-cni-network"
Sep 03 04:34:51 huir-upg523-ppjsk-master-0 systemd[1]: Started MCO environment configuration.
Sep 03 04:35:34 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:35:34.359808792Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d6146706df8aaa55895f138cd9b23f4c7e8ab1f65753c77c71dfdc3e68f2a726" id=032f816e-d561-48c0-8352-bb3076baa169 name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:35:34 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:35:34.363673182Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:739921e736ad32eb15ff01515e373a62b160d739db8f2f006cfd261988757188,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d6146706df8aaa55895f138cd9b23f4c7e8ab1f65753c77c71dfdc3e68f2a726],Size_:254910365,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=032f816e-d561-48c0-8352-bb3076baa169 name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.436475536Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d" id=5ffce620-46f1-416e-8aab-1fa72d60d2cc name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.446878296Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d],Size_:758781034,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=5ffce620-46f1-416e-8aab-1fa72d60d2cc name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.448881079Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d" id=49a419cf-363e-48f2-90f2-1c2031875a4a name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.451539887Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:d7fd98c644173d3d4a9175ec3e18b115400bd50ffa3ce882eac68dcb420d4b5a,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:48e2f3f64f28be7c89c278cc2b65d9efab682f5db691e6920f7be82e3ee0183d],Size_:758781034,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=49a419cf-363e-48f2-90f2-1c2031875a4a name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.457504475Z" level=info msg="Creating container: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=009c6798-c49c-49b0-ba22-32eac3ea6867 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.884010765Z" level=info msg="Created container bf9047af2c772019cd27d09d078e7de6ea8b4e49a27521fff45fc033d267877d: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=009c6798-c49c-49b0-ba22-32eac3ea6867 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.886506212Z" level=info msg="Starting container: bf9047af2c772019cd27d09d078e7de6ea8b4e49a27521fff45fc033d267877d" id=28f06c46-4882-4378-b8a6-20ba053c4524 name=/runtime.v1alpha2.RuntimeService/StartContainer
Sep 03 04:37:09 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:09.922614986Z" level=info msg="Started container bf9047af2c772019cd27d09d078e7de6ea8b4e49a27521fff45fc033d267877d: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=28f06c46-4882-4378-b8a6-20ba053c4524 name=/runtime.v1alpha2.RuntimeService/StartContainer
Sep 03 04:37:12 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:12.124189874Z" level=info msg="Removing container: 90fba0fd63ce3326676a31d13b6505e81482ab75117228d88aec3c836ea1ef3f" id=2a527bc5-6e8a-4a01-862c-b20e39544ea3 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
Sep 03 04:37:12 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:37:12.266239536Z" level=info msg="Removed container 90fba0fd63ce3326676a31d13b6505e81482ab75117228d88aec3c836ea1ef3f: openshift-kube-scheduler/openshift-kube-scheduler-huir-upg523-ppjsk-master-0/kube-scheduler" id=2a527bc5-6e8a-4a01-862c-b20e39544ea3 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.438377666Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7" id=e58c2a5f-bd37-4bcb-a12d-c936f8ce7fd5 name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.442539457Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7],Size_:309609322,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=e58c2a5f-bd37-4bcb-a12d-c936f8ce7fd5 name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.445691422Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7" id=1ac320cf-b969-46ef-a7a2-bdc5de6729e3 name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.448559814Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bc16542ca93086aca9716f54f5ab433c6890c7d530aafb312c01f66bbd19e7a0,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd3d0cb105c9909dccea7d01443299de00b8e2c890823c30b7e4a161f8f974f7],Size_:309609322,Uid:&Int64Value{Value:0,},Username:,},Info:map[string]string{},}" id=1ac320cf-b969-46ef-a7a2-bdc5de6729e3 name=/runtime.v1alpha2.ImageService/ImageStatus
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.452526050Z" level=info msg="Creating container: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=7b38756a-80cc-444d-8eda-d2e7fd125516 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.827877037Z" level=info msg="Created container 9f8d11d744911d562c574355320f33fe603ffe19427179378061f865a8119ddf: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=7b38756a-80cc-444d-8eda-d2e7fd125516 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.837175589Z" level=info msg="Starting container: 9f8d11d744911d562c574355320f33fe603ffe19427179378061f865a8119ddf" id=5e9936ad-df44-4b22-b18b-97a5bda3aa4b name=/runtime.v1alpha2.RuntimeService/StartContainer
Sep 03 04:38:10 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:10.907719696Z" level=info msg="Started container 9f8d11d744911d562c574355320f33fe603ffe19427179378061f865a8119ddf: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=5e9936ad-df44-4b22-b18b-97a5bda3aa4b name=/runtime.v1alpha2.RuntimeService/StartContainer
Sep 03 04:38:11 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:11.559654900Z" level=info msg="Removing container: d797263bf4e49785dd87a410a81631605063045144c96181c95bdb27e47a2dee" id=c786e3ea-115f-4c17-97e3-f72c3dfe593e name=/runtime.v1alpha2.RuntimeService/RemoveContainer
Sep 03 04:38:11 huir-upg523-ppjsk-master-0 crio[1280851]: time="2020-09-03 04:38:11.790608945Z" level=info msg="Removed container d797263bf4e49785dd87a410a81631605063045144c96181c95bdb27e47a2dee: openshift-vsphere-infra/coredns-huir-upg523-ppjsk-master-0/coredns" id=c786e3ea-115f-4c17-97e3-f72c3dfe593e name=/runtime.v1alpha2.RuntimeService/RemoveContainer
...

Comment 2 Seth Jennings 2020-09-03 16:00:02 UTC
server.go:253] Unable to authenticate the request due to an error:  Post  https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/apis/authentication.k8s.io/v1/tokenreviews:  dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on  10.0.0.51:53: no such host

Looks like disruption in DNS is resulting in the kubelet unable to resolve the internal api LB name.

Comment 3 Seth Jennings 2020-09-03 16:10:33 UTC
Need must-gather to debug further

Comment 6 Seth Jennings 2020-09-04 15:00:34 UTC
Yeah, kubelet can't talk to the API server due to DNS failure

Sep 04 14:56:55 huir-upg523-ppjsk-master-0 hyperkube[1521]: E0904 14:56:55.304306    1521 kubelet_node_status.go:92] Unable to register node "huir-upg523-ppjsk-master-0" with API server: Post https://api-int.huir-upg523.qe.devcluster.openshift.com:6443/api/v1/nodes: dial tcp: lookup api-int.huir-upg523.qe.devcluster.openshift.com on 10.0.0.51:53: no such host

$ journalctl -u kubelet --since="1 hour ago" | grep "10.0.0.51:53: no such host" | wc -l
5944

Sending to Networking.

Comment 7 Seth Jennings 2020-09-04 15:08:50 UTC
[core@huir-upg523-ppjsk-master-0 ~]$ ping api-int.huir-upg523.qe.devcluster.openshift.com
ping: api-int.huir-upg523.qe.devcluster.openshift.com: Name or service not known

[core@huir-upg523-ppjsk-master-0 ~]$ host api-int.huir-upg523.qe.devcluster.openshift.com
host: parse of /etc/resolv.conf failed

[core@huir-upg523-ppjsk-master-0 ~]$ cat /etc/resolv.conf 
# Generated by KNI resolv prepender NM dispatcher script
search 
nameserver 10.0.0.162
nameserver 10.0.0.51

Comment 8 jima 2020-09-08 07:22:03 UTC
The issue is reproduced when upgrading from 4.5.0-0.nightly-2020-09-07-062006 to 4.6.0-0.nightly-2020-09-07-1911344.5.0-0.nightly-2020-09-07-062006 on ipi on vsphere.

I found that coredns could not be started, and error is shown in coredns container log:
[root@jima4508-lh5qg-master-1 ~]# crictl ps -a
CONTAINER           IMAGE                                                              CREATED             STATE               NAME                                          ATTEMPT             POD ID
0690cb6d59479       229dbdee63b061c21623adb8ae30dad4ad25ddf8f48b3ffca0509a95bf9b0d13   30 seconds ago      Exited              coredns                                       18                  d4807f530d630
...
[root@jima4508-lh5qg-master-1 ~]# crictl logs 0690cb6d59479
plugin/mdns: Invalid minSRV: strconv.Atoi: parsing "jima4508": invalid syntax

Comment 9 sdasu 2020-09-08 16:30:49 UTC
Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1872742. Could be an additional issue on top of that in mDNS?

Comment 10 jima 2020-09-09 05:30:25 UTC
Issue is not only reproduced in disconnected env, can also be reproduced when upgrading from 4.5 to 4.6 on common cluster which installed ipi on vsphere.

When upgrading machine-config co, and first two nodes(one master and one worker) will be restarted, and then detect both of them are in NotReady status. 
Login to node, and found that coredns are in Exited state and restarted repeatedly. Then checking the log and got below error:
[root@jima4509-x95qh-worker-d9dl9 ~]# crictl logs f974663dcbc8c
plugin/mdns: Invalid minSRV: strconv.Atoi: parsing "jima4509": invalid syntax

According to doc https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md#api-int-hostname-resolution, coredns will resolve api-int, but it could not start now.

Comment 11 Ben Nemec 2020-09-15 17:08:17 UTC
It looks like something went wrong while rendering the coredns configuration. Can you upload the contents of /etc/coredns/Corefile here?

Comment 13 Ben Nemec 2020-09-16 16:32:58 UTC
Okay, looks like EtcdDiscoveryDomain is empty on the failing node for some reason. I'm not sure how that could happen. I'll check with the MCO team and see if they have any ideas.

Comment 16 Kirsten Garrison 2020-09-17 21:11:36 UTC
I attempted to check this on aws clusters via upgrades 4.5->4.6 and they are correctly populating EtcdDiscoveryDomain in the Controller Config post upgrade, so I'm not sure if this is a platform template issue or some other issue?

Further with so many degraded operators: dns, auth, apiserver, etc, I would certainly expect the MCO also be degraded during this upgrade.

Has this been reproduced on other vsphere clusters? If reproducible, can we figure out when this stopped working?

Comment 17 Kirsten Garrison 2020-09-18 00:25:24 UTC
So out of curiosity I tried this upgrade on cluster-bot as metal. The upgrade looked very similar to the original post above, with the operators on and off degraded but the worker pool progressed slowly and then the masters all upgraded..slowly resulting in a successful upgrade from 4.5.10-> 4.6.0-0.nightly-2020-09-17-152859:

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-17-152859   True        False         4m2s    Cluster version is 4.6.0-0.nightly-2020-09-17-152859
$ oc get nodes
NAME                                                             STATUS   ROLES    AGE    VERSION
master-0.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com   Ready    master   132m   v1.19.0+b4ffb45
master-1.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com   Ready    master   131m   v1.19.0+b4ffb45
master-2.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com   Ready    master   131m   v1.19.0+b4ffb45
worker-0.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com   Ready    worker   122m   v1.19.0+b4ffb45
worker-1.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com   Ready    worker   122m   v1.19.0+b4ffb45
worker-2.ci-ln-bijm4ik-86010.origin-ci-int-aws.dev.rhcloud.com   Ready    worker   123m   v1.19.0+b4ffb45


It seemed slower than a normal aws upgrade, but I'm not an expert in metal clusters.

I did also notice an error in the MCD logs: 

I0917 23:32:10.685683    2940 daemon.go:883] systemctl --failed:
 loaded failed failed Network Manager Wait On

Not sure if that's a metal bug (i dont see it in aws runs) or due to this being cluster bot but looks related to: https://github.com/openshift/machine-config-operator/commit/0704ea85bfd2ac3b6be2653b902b3d1d849658b1

Comment 18 Kirsten Garrison 2020-09-18 00:42:25 UTC
As another note when talking to @Jinyun we saw:

```
# ./oc logs -n openshift-machine-config-operator machine-config-daemon-f7frm -c machine-config-daemon
Error from server: Get "https://136.144.52.250:10250/containerLogs/openshift-machine-config-operator/machine-config-daemon-f7frm/machine-config-daemon": x509: certificate signed by unknown authority
```

# ./oc get csr
No resources found

Comment 19 Kirsten Garrison 2020-09-18 01:17:33 UTC
Not sure if it's of any value, but this is the clusterbot metal 4.5->4.6 upgrade I did which did indeed pass: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-metal/1306706449262448640/

I reached the end of my investigations and will step back to let the vsphere knowledgeable devs look further into this.

Comment 20 Ben Nemec 2020-09-18 22:07:06 UTC
I was able to reproduce this on a baremetal cluster, so it looks like it's not specific to vsphere. I believe Kirsten's cluster was UPI baremetal, which doesn't work the same way as IPI baremetal.

I don't have a solution to the problem yet, but I think we've made some progress on understanding what is going on. It looks like EtcdDiscoveryDomain is missing from the controllerconfigs after upgrade, but because MCO is one of the last things to upgrade it tries to re-render the 4.5 templates with the new controllerconfig and that leaves us with this empty domain problem.

I'm out of time today, but we'll pick this up again on Monday.

Comment 21 Brad P. Crochet 2020-09-22 16:51:09 UTC
*** Bug 1872742 has been marked as a duplicate of this bug. ***

Comment 23 jima 2020-09-24 10:55:10 UTC
upgrade on ipi on vsphere from 4.5.0-0.nightly-2020-09-20-185910  to 4.6.0-0.nightly-2020-09-24-030538 is successful, so move bug to VERIFIED.

# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-20-185910   True        True          9s      Working towards 4.6.0-0.nightly-2020-09-24-030538: 0% complete

# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-24-030538   True        False         5m44s   Cluster version is 4.6.0-0.nightly-2020-09-24-030538

# ./oc get nodes
NAME                          STATUS   ROLES    AGE    VERSION
jima4524-nnw9r-master-0       Ready    master   113m   v1.19.0+8a39924
jima4524-nnw9r-master-1       Ready    master   113m   v1.19.0+8a39924
jima4524-nnw9r-master-2       Ready    master   113m   v1.19.0+8a39924
jima4524-nnw9r-worker-8z5j9   Ready    worker   102m   v1.19.0+8a39924
jima4524-nnw9r-worker-dwfdl   Ready    worker   101m   v1.19.0+8a39924
jima4524-nnw9r-worker-hfr6g   Ready    worker   102m   v1.19.0+8a39924

Comment 26 errata-xmlrpc 2020-10-27 16:37:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 27 W. Trevor King 2021-04-05 17:47:11 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.