Bug 1915235 - Failed to upgrade to 4.7 from 4.6 due to the machine-config failure
Summary: Failed to upgrade to 4.7 from 4.6 due to the machine-config failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Ryan Phillips
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On: 1920027 1923874 1933075
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-12 10:11 UTC by Jian Zhang
Modified: 2021-03-31 04:24 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:52:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Must Gather (18.51 MB, application/x-xz)
2021-01-12 23:45 UTC, Ben Howard
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2361 0 None closed Bug 1915235: add imagefs.inodesFree to resourceFields 2021-02-18 05:53:20 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:52:28 UTC

Description Jian Zhang 2021-01-12 10:11:07 UTC
Description of problem:

[2021-01-12T04:50:31.078Z]     Message:               Unable to apply 4.7.0-0.nightly-2021-01-10-070949: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-generated-kubelet expected ad33d39ebe36919d96466d426145ff7aa722574f has eab9c35dfbeb0d21be6e1db3887acbbb93592d34: all 3 nodes are at latest configuration rendered-master-49261a4c52aacdde78ed360503334d63, retrying

Version-Release number of selected component (if applicable):
4.6.10-x86_64 -> 4.7.0-fc.2-x86_64 Platfrom: Disconnected UPI on Azure with RHCOS & Private Cluster
4.6.10-x86_64 -> 4.7.0-0.nightly-2021-01-10-070949	Platfrom: IPI on Azure & fully private

How reproducible:
always

Steps to Reproduce:
1. Install OCP 4.6.10-x86_64.
2. Upgrade it to 4.7.0-0.nightly-2021-01-10-070949 or 4.7.0-fc.2-x86_64 (force upgrade)


Actual results:
Both failed. machine-config still in 4.6.10 version.
[2021-01-12T04:50:29.163Z] machine-approver                           4.7.0-0.nightly-2021-01-10-070949   True        False         False      4h
[2021-01-12T04:50:29.163Z] machine-config                             4.6.10                              False       True          True       146m
[2021-01-12T04:50:29.163Z] marketplace                                4.7.0-0.nightly-2021-01-10-070949   True        False         False      129m
...

[2021-01-12T04:50:31.078Z] Name:         machine-config
[2021-01-12T04:50:31.078Z] Namespace:    
[2021-01-12T04:50:31.078Z] Labels:       <none>
[2021-01-12T04:50:31.078Z] Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
[2021-01-12T04:50:31.078Z] API Version:  config.openshift.io/v1
[2021-01-12T04:50:31.078Z] Kind:         ClusterOperator
...
...
[2021-01-12T04:50:31.078Z]     Message:               Unable to apply 4.7.0-0.nightly-2021-01-10-070949: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-generated-kubelet expected ad33d39ebe36919d96466d426145ff7aa722574f has eab9c35dfbeb0d21be6e1db3887acbbb93592d34: all 3 nodes are at latest configuration rendered-master-49261a4c52aacdde78ed360503334d63, retrying
...

Expected results:
Upgrade to 4.7 successfully.


Additional info:
Related logs:
1, http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8979/console
2, http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/8984/console

Comment 4 Ben Howard 2021-01-12 23:45:20 UTC
Created attachment 1746838 [details]
Must Gather

I was able to get the must-gather from the creds in Comment #1

Comment 5 Kirsten Garrison 2021-01-12 23:48:38 UTC
Have you seen this issue on other nightlies or only 4.7.0-0.nightly-2021-01-10-070949 ?

Comment 7 Ben Howard 2021-01-12 23:57:19 UTC
Error is perculated up from the Kubelet: https://github.com/openshift/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L571-L581

Kicking this over to the node team based on the error.

Comment 8 Wei Sun 2021-01-14 14:00:02 UTC
Hit it again when upgrading from 4.6.0-0.nightly-2021-01-13-215839 to 4.7.0-0.nightly-2021-01-13-124141. Add test blocker keyword, since it's blocking 4.6->4.7 upgrade test.

Comment 10 Ryan Phillips 2021-01-21 14:54:34 UTC
The error "CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars" is due to the kubelet not being connected to the API server.

Jan 12 12:07:18.165564 Get "https://api-int.wsun124706.qe.gcp.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/wsun12-tf9jr-m-1.c.openshift-qe.internal?timeout=10s": dial tcp 34.68.182.209:6443: connect: 

openshift-apiserver-operator-77c97c89bf-xrdlk/openshift-apiserver-operator/openshift-apiserver-operator/logs/current.log:

  lots and lots of throttled requests are taking 1.3s+

The etcd-quorum-guard being stopped looks extremely suspicious:

  2:21.647859 wsun12-tf9jr-m-0.c.openshift-qe.internal crio[1580]: time="2021-01-12 11:22:21.647784830Z" level=info msg="Stopped container 6cc17d4cb4206ef7c80357d5de0fab62e5e9938195ce76010de71cf0513b61e0: openshift-etcd/etcd-quorum-guard-67cbf954d4-kd29f/guard" id=e5860997-10ca-4f1b-9369-dcae55d701c7 name=/runtime.v1alpha2.RuntimeService/StopContainer                                                                              │
2:38.983587 wsun12-tf9jr-m-2.c.openshift-qe.internal crio[1584]: time="2021-01-12 11:22:38.983498054Z" level=info msg="Stopped container 8656bad9901fafd80c1837297ae533dc9ee3240f553e34e357bcb454b8ff14de: openshift-etcd/etcd-quorum-guard-67cbf954d4-j2js7/guard" id=9479c621-d1af-409c-b581-6a46bd83af33 name=/runtime.v1alpha2.RuntimeService/StopContainer                                                                              │
2:40.531898 wsun12-tf9jr-m-2.c.openshift-qe.internal crio[1584]: time="2021-01-12 11:22:40.531823640Z" level=info msg="Stopped container 8656bad9901fafd80c1837297ae533dc9ee3240f553e34e357bcb454b8ff14de: openshift-etcd/etcd-quorum-guard-67cbf954d4-j2js7/guard" id=06128c52-7742-41e0-bdf1-3c18248898bb name=/runtime.v1alpha2.RuntimeService/StopContainer                                                                              │
3:05.504527 wsun12-tf9jr-m-1.c.openshift-qe.internal crio[1580]: time="2021-01-12 11:23:05.504405526Z" level=info msg="Stopped container c61268ef524c11f1ee50e93497ceea67e2b0317f3d9afd55b1a0c141564b547a: openshift-etcd/etcd-quorum-guard-67cbf954d4-9p4wf/guard" id=a0d341b7-0057-4e5e-9f59-a14a26775ba8 name=/runtime.v1alpha2.RuntimeService/StopContainer

Comment 11 Ryan Phillips 2021-01-21 20:54:12 UTC
I upgraded from registry.ci.openshift.org/ocp/release:4.6.0-0.ci-2021-01-20-111943 to registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-21-012810 without issue.

I also did a clusterbot upgrade from 4.6.13 to registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-01-21-090809 without issue.

Weinan: Can you try and reproduce again with later releases?

Comment 13 Neelesh Agrawal 2021-01-22 19:11:37 UTC
Based on Comment 10.

*** This bug has been marked as a duplicate of bug 1845414 ***

Comment 14 Stefan Schimanski 2021-01-25 11:47:20 UTC
Reopening. 1845414 is about API disruption on Azure, c10 is about etcd topics on gcp.

Comment 16 Ryan Phillips 2021-01-25 16:10:27 UTC
The KubletConfiguration received some validation and a couple fields were missed. Added a PR to fix it.

Marking as a blocker.

Comment 18 Weinan Liu 2021-01-26 08:58:45 UTC
Blocked by 1920027

Comment 27 Weinan Liu 2021-02-04 07:31:53 UTC
4.6.13-x86_64	4.7.0-0.nightly-2021-02-03-165316
4.6.16-x86_64	4.7.0-0.nightly-2021-02-03-165316
Verified to be fixed for both of the pathes

Comment 30 errata-xmlrpc 2021-02-24 15:52:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 31 W. Trevor King 2021-03-31 04:24:18 UTC
Wei added UpgradeBlocker to this back in January, but it was fixed in 4.7 before GA, so we never ended up blocking any updates on it.  Replacing with Upgrades to remove this bug from our suspect queue [1].

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.