Hide Forgot
Description of problem: [2021-01-12T04:50:31.078Z] Message: Unable to apply 4.7.0-0.nightly-2021-01-10-070949: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-generated-kubelet expected ad33d39ebe36919d96466d426145ff7aa722574f has eab9c35dfbeb0d21be6e1db3887acbbb93592d34: all 3 nodes are at latest configuration rendered-master-49261a4c52aacdde78ed360503334d63, retrying Version-Release number of selected component (if applicable): 4.6.10-x86_64 -> 4.7.0-fc.2-x86_64 Platfrom: Disconnected UPI on Azure with RHCOS & Private Cluster 4.6.10-x86_64 -> 4.7.0-0.nightly-2021-01-10-070949 Platfrom: IPI on Azure & fully private How reproducible: always Steps to Reproduce: 1. Install OCP 4.6.10-x86_64. 2. Upgrade it to 4.7.0-0.nightly-2021-01-10-070949 or 4.7.0-fc.2-x86_64 (force upgrade) Actual results: Both failed. machine-config still in 4.6.10 version. [2021-01-12T04:50:29.163Z] machine-approver 4.7.0-0.nightly-2021-01-10-070949 True False False 4h [2021-01-12T04:50:29.163Z] machine-config 4.6.10 False True True 146m [2021-01-12T04:50:29.163Z] marketplace 4.7.0-0.nightly-2021-01-10-070949 True False False 129m ... [2021-01-12T04:50:31.078Z] Name: machine-config [2021-01-12T04:50:31.078Z] Namespace: [2021-01-12T04:50:31.078Z] Labels: <none> [2021-01-12T04:50:31.078Z] Annotations: exclude.release.openshift.io/internal-openshift-hosted: true [2021-01-12T04:50:31.078Z] API Version: config.openshift.io/v1 [2021-01-12T04:50:31.078Z] Kind: ClusterOperator ... ... [2021-01-12T04:50:31.078Z] Message: Unable to apply 4.7.0-0.nightly-2021-01-10-070949: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-generated-kubelet expected ad33d39ebe36919d96466d426145ff7aa722574f has eab9c35dfbeb0d21be6e1db3887acbbb93592d34: all 3 nodes are at latest configuration rendered-master-49261a4c52aacdde78ed360503334d63, retrying ... Expected results: Upgrade to 4.7 successfully. Additional info: Related logs: 1, http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8979/console 2, http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8984/console
Created attachment 1746838 [details] Must Gather I was able to get the must-gather from the creds in Comment #1
Have you seen this issue on other nightlies or only 4.7.0-0.nightly-2021-01-10-070949 ?
Error is perculated up from the Kubelet: https://github.com/openshift/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L571-L581 Kicking this over to the node team based on the error.
Hit it again when upgrading from 4.6.0-0.nightly-2021-01-13-215839 to 4.7.0-0.nightly-2021-01-13-124141. Add test blocker keyword, since it's blocking 4.6->4.7 upgrade test.
The error "CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars" is due to the kubelet not being connected to the API server. Jan 12 12:07:18.165564 Get "https://api-int.wsun124706.qe.gcp.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/wsun12-tf9jr-m-1.c.openshift-qe.internal?timeout=10s": dial tcp 34.68.182.209:6443: connect: openshift-apiserver-operator-77c97c89bf-xrdlk/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log: lots and lots of throttled requests are taking 1.3s+ The etcd-quorum-guard being stopped looks extremely suspicious: 2:21.647859 wsun12-tf9jr-m-0.c.openshift-qe.internal crio[1580]: time="2021-01-12 11:22:21.647784830Z" level=info msg="Stopped container 6cc17d4cb4206ef7c80357d5de0fab62e5e9938195ce76010de71cf0513b61e0: openshift-etcd/etcd-quorum-guard-67cbf954d4-kd29f/guard" id=e5860997-10ca-4f1b-9369-dcae55d701c7 name=/runtime.v1alpha2.RuntimeService/StopContainer │ 2:38.983587 wsun12-tf9jr-m-2.c.openshift-qe.internal crio[1584]: time="2021-01-12 11:22:38.983498054Z" level=info msg="Stopped container 8656bad9901fafd80c1837297ae533dc9ee3240f553e34e357bcb454b8ff14de: openshift-etcd/etcd-quorum-guard-67cbf954d4-j2js7/guard" id=9479c621-d1af-409c-b581-6a46bd83af33 name=/runtime.v1alpha2.RuntimeService/StopContainer │ 2:40.531898 wsun12-tf9jr-m-2.c.openshift-qe.internal crio[1584]: time="2021-01-12 11:22:40.531823640Z" level=info msg="Stopped container 8656bad9901fafd80c1837297ae533dc9ee3240f553e34e357bcb454b8ff14de: openshift-etcd/etcd-quorum-guard-67cbf954d4-j2js7/guard" id=06128c52-7742-41e0-bdf1-3c18248898bb name=/runtime.v1alpha2.RuntimeService/StopContainer │ 3:05.504527 wsun12-tf9jr-m-1.c.openshift-qe.internal crio[1580]: time="2021-01-12 11:23:05.504405526Z" level=info msg="Stopped container c61268ef524c11f1ee50e93497ceea67e2b0317f3d9afd55b1a0c141564b547a: openshift-etcd/etcd-quorum-guard-67cbf954d4-9p4wf/guard" id=a0d341b7-0057-4e5e-9f59-a14a26775ba8 name=/runtime.v1alpha2.RuntimeService/StopContainer
I upgraded from registry.ci.openshift.org/ocp/release:4.6.0-0.ci-2021-01-20-111943 to registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-21-012810 without issue. I also did a clusterbot upgrade from 4.6.13 to registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-21-090809 without issue. Weinan: Can you try and reproduce again with later releases?
Based on Comment 10. *** This bug has been marked as a duplicate of bug 1845414 ***
Reopening. 1845414 is about API disruption on Azure, c10 is about etcd topics on gcp.
The KubletConfiguration received some validation and a couple fields were missed. Added a PR to fix it. Marking as a blocker.
Blocked by 1920027
4.6.13-x86_64 4.7.0-0.nightly-2021-02-03-165316 4.6.16-x86_64 4.7.0-0.nightly-2021-02-03-165316 Verified to be fixed for both of the pathes
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Wei added UpgradeBlocker to this back in January, but it was fixed in 4.7 before GA, so we never ended up blocking any updates on it. Replacing with Upgrades to remove this bug from our suspect queue [1]. [1]: https://github.com/openshift/enhancements/pull/475