Bug 1915235
| Summary: | Failed to upgrade to 4.7 from 4.6 due to the machine-config failure | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jian Zhang <jiazha> | ||||
| Component: | Node | Assignee: | Ryan Phillips <rphillips> | ||||
| Node sub component: | Kubelet | QA Contact: | Weinan Liu <weinliu> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | urgent | ||||||
| Priority: | urgent | CC: | aos-bugs, behoward, kgarriso, nagrawal, sttts, weinliu, wking, wsun, xxia, yanyang | ||||
| Version: | 4.7 | Keywords: | Reopened, TestBlocker, Upgrades | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.7.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-02-24 15:52:05 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1920027, 1923874, 1933075 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Jian Zhang
2021-01-12 10:11:07 UTC
Created attachment 1746838 [details] Must Gather I was able to get the must-gather from the creds in Comment #1 Have you seen this issue on other nightlies or only 4.7.0-0.nightly-2021-01-10-070949 ? Error is perculated up from the Kubelet: https://github.com/openshift/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L571-L581 Kicking this over to the node team based on the error. Hit it again when upgrading from 4.6.0-0.nightly-2021-01-13-215839 to 4.7.0-0.nightly-2021-01-13-124141. Add test blocker keyword, since it's blocking 4.6->4.7 upgrade test. The error "CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars" is due to the kubelet not being connected to the API server. Jan 12 12:07:18.165564 Get "https://api-int.wsun124706.qe.gcp.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/wsun12-tf9jr-m-1.c.openshift-qe.internal?timeout=10s": dial tcp 34.68.182.209:6443: connect: openshift-apiserver-operator-77c97c89bf-xrdlk/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log: lots and lots of throttled requests are taking 1.3s+ The etcd-quorum-guard being stopped looks extremely suspicious: 2:21.647859 wsun12-tf9jr-m-0.c.openshift-qe.internal crio[1580]: time="2021-01-12 11:22:21.647784830Z" level=info msg="Stopped container 6cc17d4cb4206ef7c80357d5de0fab62e5e9938195ce76010de71cf0513b61e0: openshift-etcd/etcd-quorum-guard-67cbf954d4-kd29f/guard" id=e5860997-10ca-4f1b-9369-dcae55d701c7 name=/runtime.v1alpha2.RuntimeService/StopContainer │ 2:38.983587 wsun12-tf9jr-m-2.c.openshift-qe.internal crio[1584]: time="2021-01-12 11:22:38.983498054Z" level=info msg="Stopped container 8656bad9901fafd80c1837297ae533dc9ee3240f553e34e357bcb454b8ff14de: openshift-etcd/etcd-quorum-guard-67cbf954d4-j2js7/guard" id=9479c621-d1af-409c-b581-6a46bd83af33 name=/runtime.v1alpha2.RuntimeService/StopContainer │ 2:40.531898 wsun12-tf9jr-m-2.c.openshift-qe.internal crio[1584]: time="2021-01-12 11:22:40.531823640Z" level=info msg="Stopped container 8656bad9901fafd80c1837297ae533dc9ee3240f553e34e357bcb454b8ff14de: openshift-etcd/etcd-quorum-guard-67cbf954d4-j2js7/guard" id=06128c52-7742-41e0-bdf1-3c18248898bb name=/runtime.v1alpha2.RuntimeService/StopContainer │ 3:05.504527 wsun12-tf9jr-m-1.c.openshift-qe.internal crio[1580]: time="2021-01-12 11:23:05.504405526Z" level=info msg="Stopped container c61268ef524c11f1ee50e93497ceea67e2b0317f3d9afd55b1a0c141564b547a: openshift-etcd/etcd-quorum-guard-67cbf954d4-9p4wf/guard" id=a0d341b7-0057-4e5e-9f59-a14a26775ba8 name=/runtime.v1alpha2.RuntimeService/StopContainer I upgraded from registry.ci.openshift.org/ocp/release:4.6.0-0.ci-2021-01-20-111943 to registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-21-012810 without issue. I also did a clusterbot upgrade from 4.6.13 to registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-21-090809 without issue. Weinan: Can you try and reproduce again with later releases? Based on Comment 10. *** This bug has been marked as a duplicate of bug 1845414 *** Reopening. 1845414 is about API disruption on Azure, c10 is about etcd topics on gcp. The KubletConfiguration received some validation and a couple fields were missed. Added a PR to fix it. Marking as a blocker. Blocked by 1920027 4.6.13-x86_64 4.7.0-0.nightly-2021-02-03-165316 4.6.16-x86_64 4.7.0-0.nightly-2021-02-03-165316 Verified to be fixed for both of the pathes Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 Wei added UpgradeBlocker to this back in January, but it was fixed in 4.7 before GA, so we never ended up blocking any updates on it. Replacing with Upgrades to remove this bug from our suspect queue [1]. [1]: https://github.com/openshift/enhancements/pull/475 |