Bug 2001243 - Enforce OpenShift's defined kubelet version skew policies [NEEDINFO]
Summary: Enforce OpenShift's defined kubelet version skew policies
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.7.z
Assignee: Luis Sanchez
QA Contact: Rahul Gangwar
URL:
Whiteboard: LifecycleReset
Depends On: 2001244
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-04 19:48 UTC by OpenShift BugZilla Robot
Modified: 2022-01-19 13:30 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-19 13:30:41 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1223 0 None open [release-4.7] Bug 2001243: Enforce OpenShift's defined kubelet version skew policies 2021-10-29 23:07:08 UTC
Red Hat Product Errata RHBA-2022:0117 0 None None None 2022-01-19 13:30:55 UTC

Description OpenShift BugZilla Robot 2021-09-04 19:48:19 UTC
+++ This bug was initially created as a clone of Bug #1998552 +++

The API Server Operator will set Upgradeable=False whenever any of the nodes within the cluster are at the skew limit; that is, when an upgrade of the API Server would exceed the allowable kubelet version skew.

Comment 1 Michal Fojtik 2021-10-22 09:01:58 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 2 Michal Fojtik 2021-12-13 15:34:06 UTC
The LifecycleStale keyword was removed because the bug moved to QE.
The bug assignee was notified.

Comment 5 Rahul Gangwar 2021-12-17 17:36:32 UTC
@luis After pausing machine-config-pool for worker, upgrade is failing for 4.5->4.6->4.7 and multiple cluster operators are degraded.

oc get clusterversion                 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-12-15-152825   True        True          9h      Unable to apply 4.7.0-0.nightly-2021-12-17-022306: an unknown error has occurred: MultipleErrors

NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-12-17-022306   False       True          True       111m
baremetal                                  4.7.0-0.nightly-2021-12-17-022306   True        False         False      117m
cloud-credential                           4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h53m
cluster-autoscaler                         4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h38m
config-operator                            4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h38m
console                                    4.7.0-0.nightly-2021-12-17-022306   True        False         True       114m
csi-snapshot-controller                    4.7.0-0.nightly-2021-12-17-022306   True        False         False      115m
dns                                        4.6.0-0.nightly-2021-12-15-152825   True        False         False      175m
etcd                                       4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h42m
image-registry                             4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h33m
ingress                                    4.7.0-0.nightly-2021-12-17-022306   True        False         True       3h12m
insights                                   4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h39m
kube-apiserver                             4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h41m
kube-controller-manager                    4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h41m
kube-scheduler                             4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h41m
kube-storage-version-migrator              4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h33m
machine-api                                4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h36m
machine-approver                           4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h39m
machine-config                             4.6.0-0.nightly-2021-12-15-152825   True        False         False      152m
marketplace                                4.7.0-0.nightly-2021-12-17-022306   True        False         False      114m
monitoring                                 4.7.0-0.nightly-2021-12-17-022306   True        False         False      164m
network                                    4.6.0-0.nightly-2021-12-15-152825   True        True          True       5h43m
node-tuning                                4.7.0-0.nightly-2021-12-17-022306   True        False         False      115m
openshift-apiserver                        4.7.0-0.nightly-2021-12-17-022306   True        False         False      3h
openshift-controller-manager               4.7.0-0.nightly-2021-12-17-022306   True        False         False      3h1m
openshift-samples                          4.7.0-0.nightly-2021-12-17-022306   True        False         False      115m
operator-lifecycle-manager                 4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h43m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h43m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-12-17-022306   True        False         False      3h1m
service-ca                                 4.7.0-0.nightly-2021-12-17-022306   True        False         False      5h44m
storage                                    4.7.0-0.nightly-2021-12-17-022306   True        False         False      161m
 
Details - http://pastebin.test.redhat.com/1016562
http://pastebin.test.redhat.com/1016594
http://pastebin.test.redhat.com/1016598

must-gather-link  https://drive.google.com/file/d/1PXHgSRiDbliTSqOgYdH5AfV_oo1CkjfJ/view?usp=sharing

Comment 6 W. Trevor King 2021-12-18 07:30:44 UTC
Must gather from comment 5:

$ tar xz --strip-components=1 <must-gather.local.4931375460397206808.tar.gz
$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions.yaml | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
2021-12-17T08:24:55Z - Partial 4.7.0-0.nightly-2021-12-17-022306
2021-12-17T07:14:20Z 2021-12-17T08:20:10Z Completed 4.6.0-0.nightly-2021-12-15-152825
2021-12-17T04:57:57Z 2021-12-17T05:29:20Z Completed 4.5.0-0.nightly-2021-09-07-164108
$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions.yaml | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2021-12-17T05:29:20Z Available=True : Done applying 4.6.0-0.nightly-2021-12-15-152825
2021-12-17T13:56:13Z Failing=False : 
2021-12-17T08:24:55Z Progressing=True : Working towards 4.7.0-0.nightly-2021-12-17-022306: 505 of 668 done (75% complete)
2021-12-17T04:57:57Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.7.0-0.nightly-2021-12-17-022306 not found in the "stable-4.5" channel
2021-12-17T08:31:49Z Upgradeable=False KubeletMinorVersion_KubeletMinorVersionUnsupported: Cluster operator kube-apiserver cannot be upgraded between minor versions: KubeletMinorVersionUpgradeable: Unsupported kubelet minor versions on nodes ip-10-0-129-213.us-east-2.compute.internal, ip-10-0-186-172.us-east-2.compute.internal, and ip-10-0-213-183.us-east-2.compute.internal are too far behind the target API server version (1.20.11).

Hmm...  No cluster-version operator logs in this must-gather?

$ ls namespaces/openshift-cluster-version/
monitoring.coreos.com

Hard to know for sure without CVO logs, but I suspect bug 2018368 may be involved in part of this.  Doesn't explain why the other operators would be degraded though, and I haven't poked into those.

Anyhow, for the purpose of this backport, you can see that we got far enough into 4.7 to update the Kube API-server operator, and that new API-server operator is appropriately complaining about the old 1.18 kubelets from the stuck-on-4.5 compute nodes:

$ grep -r ' kubeletVersion' cluster-scoped-resources/core/nodes
cluster-scoped-resources/core/nodes/ip-10-0-166-226.us-east-2.compute.internal.yaml:    kubeletVersion: v1.19.16+845f228
cluster-scoped-resources/core/nodes/ip-10-0-152-231.us-east-2.compute.internal.yaml:    kubeletVersion: v1.19.16+845f228
cluster-scoped-resources/core/nodes/ip-10-0-186-172.us-east-2.compute.internal.yaml:    kubeletVersion: v1.18.3+d8ef5ad
cluster-scoped-resources/core/nodes/ip-10-0-129-213.us-east-2.compute.internal.yaml:    kubeletVersion: v1.18.3+d8ef5ad
cluster-scoped-resources/core/nodes/ip-10-0-221-161.us-east-2.compute.internal.yaml:    kubeletVersion: v1.19.16+845f228
cluster-scoped-resources/core/nodes/ip-10-0-213-183.us-east-2.compute.internal.yaml:    kubeletVersion: v1.18.3+d8ef5ad

Bug 2018356 is up with some wording suggestions, but that would be its own backport series if those get picked up.

So I think this bug can be marked VERIFIED as it stands, with it's successful demonstration that 4.7 Kube API-server operator complains as we'd expect it to.  And the other issues that kept it from being a nice, clean update can be followed up in other bugs.

Comment 7 Rahul Gangwar 2021-12-20 05:14:59 UTC
@Trevor Please move the bug again on QA then we can move the bug to verified.

Comment 8 Rahul Gangwar 2021-12-20 10:06:01 UTC
@Trevor As per your comment and checked with SDN team, upgrade is not failing due to PR code. It is failing due to SDN issue https://bugzilla.redhat.com/show_bug.cgi?id=1916029. Moving bug to verified.

Comment 11 errata-xmlrpc 2022-01-19 13:30:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.41 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0117


Note You need to log in before you can comment on or make changes to this bug.