Bug 1786274

Summary: RHEL7 worker nodes may go to NotReady,SchedulingDisabled while upgrading from 4.2.12 to 4.3.0
Product: OpenShift Container Platform Reporter: Weinan Liu <weinliu>
Component: DocumentationAssignee: Kathryn Alexander <kalexand>
Status: CLOSED CURRENTRELEASE QA Contact: Gaoyun Pei <gpei>
Severity: high Docs Contact: Vikram Goyal <vigoyal>
Priority: high    
Version: 4.3.0CC: amurdaca, aos-bugs, gpei, jokerman, juzhao, kalexand, mifiedle, rphillips, scuppett, sdodson, sjenning, wking, wsun, xtian
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1792139 (view as bug list) Environment:
Last Closed: 2020-01-24 21:02:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1792139    

Description Weinan Liu 2019-12-24 07:23:07 UTC
Description of problem:
Upgrading from 4.2.12 to 4.3.0 may make RHEL7.7 worker node NotReady,SchedulingDisabled


How reproducible:
Somtimes

Steps to Reproduce:
Initial status:

Cluster with 3 RHEL7.7 worker nodes instanced

$  oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.12    True        False         85m     Cluster version is 4.2.12

$ oc get no
NAME                                 STATUS   ROLES    AGE    VERSION
weinliu-1223-7674g-compute-0         Ready    worker   94m    v1.14.6+cebabbf4a
weinliu-1223-7674g-compute-1         Ready    worker   94m    v1.14.6+cebabbf4a
weinliu-1223-7674g-control-plane-0   Ready    master   107m   v1.14.6+cebabbf4a
weinliu-1223-7674g-control-plane-1   Ready    master   107m   v1.14.6+cebabbf4a
weinliu-1223-7674g-control-plane-2   Ready    master   107m   v1.14.6+cebabbf4a
weinliu-1223-7674g-rhel-0            Ready    worker   63m    v1.14.6+b69672ada
weinliu-1223-7674g-rhel-1            Ready    worker   63m    v1.14.6+b69672ada
weinliu-1223-7674g-rhel-2            Ready    worker   63m    v1.14.6+b69672ada


1. Perform upgrading
$oc adm upgrade --to-image="registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-22-223447" --allow-explicit-upgrade --force
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-22-223447


Actual results:
Worker nodes failed to get upgraded. Operator got upgraded but, failed on the worker nodes
$ oc get no
NAME                                 STATUS                        ROLES    AGE   VERSION
weinliu-1223-7674g-compute-0         Ready                         worker   22h   v1.16.2
weinliu-1223-7674g-compute-1         Ready                         worker   22h   v1.16.2
weinliu-1223-7674g-control-plane-0   Ready                         master   23h   v1.16.2
weinliu-1223-7674g-control-plane-1   Ready                         master   23h   v1.16.2
weinliu-1223-7674g-control-plane-2   Ready                         master   23h   v1.16.2
weinliu-1223-7674g-rhel-0            NotReady,SchedulingDisabled   worker   22h   v1.14.6+b69672ada
weinliu-1223-7674g-rhel-1            Ready                         worker   22h   v1.14.6+b69672ada
weinliu-1223-7674g-rhel-2            Ready                         worker   22h   v1.14.6+b69672ada

$ oc get no -o wide
NAME                                 STATUS                        ROLES    AGE   VERSION             INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                      KERNEL-VERSION                CONTAINER-RUNTIME
weinliu-1223-7674g-compute-0         Ready                         worker   22h   v1.16.2             10.0.98.54    <none>        Red Hat Enterprise Linux CoreOS 43.81.201912221553.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.16.1-7.dev.rhaos4.3.gitcee3d66.el8
weinliu-1223-7674g-compute-1         Ready                         worker   22h   v1.16.2             10.0.97.252   <none>        Red Hat Enterprise Linux CoreOS 43.81.201912221553.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.16.1-7.dev.rhaos4.3.gitcee3d66.el8
weinliu-1223-7674g-control-plane-0   Ready                         master   23h   v1.16.2             10.0.97.217   <none>        Red Hat Enterprise Linux CoreOS 43.81.201912221553.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.16.1-7.dev.rhaos4.3.gitcee3d66.el8
weinliu-1223-7674g-control-plane-1   Ready                         master   23h   v1.16.2             10.0.98.89    <none>        Red Hat Enterprise Linux CoreOS 43.81.201912221553.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.16.1-7.dev.rhaos4.3.gitcee3d66.el8
weinliu-1223-7674g-control-plane-2   Ready                         master   23h   v1.16.2             10.0.98.160   <none>        Red Hat Enterprise Linux CoreOS 43.81.201912221553.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.16.1-7.dev.rhaos4.3.gitcee3d66.el8
weinliu-1223-7674g-rhel-0            NotReady,SchedulingDisabled   worker   22h   v1.14.6+b69672ada   10.0.98.83    <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.9.1.el7.x86_64    cri-o://1.14.11-2.dev.rhaos4.2.git179ea6b.el7
weinliu-1223-7674g-rhel-1            Ready                         worker   22h   v1.14.6+b69672ada   10.0.98.170   <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.9.1.el7.x86_64    cri-o://1.14.11-2.dev.rhaos4.2.git179ea6b.el7
weinliu-1223-7674g-rhel-2            Ready                         worker   22h   v1.14.6+b69672ada   10.0.98.65    <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.9.1.el7.x86_64    cri-o://1.14.11-2.dev.rhaos4.2.git179ea6b.el7


$  oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-22-223447   True        False         20h     Cluster version is 4.3.0-0.nightly-2019-12-22-223447

$  oc get clusteroperator
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2019-12-22-223447   True        False         False      20h
cloud-credential                           4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
cluster-autoscaler                         4.3.0-0.nightly-2019-12-22-223447   True        False         False      20h
console                                    4.3.0-0.nightly-2019-12-22-223447   True        False         False      18h
dns                                        4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
image-registry                             4.3.0-0.nightly-2019-12-22-223447   True        False         False      20h
ingress                                    4.3.0-0.nightly-2019-12-22-223447   True        False         False      20h
insights                                   4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
kube-apiserver                             4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
kube-controller-manager                    4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
kube-scheduler                             4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
machine-api                                4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
machine-config                             4.3.0-0.nightly-2019-12-22-223447   True        False         False      18h
marketplace                                4.3.0-0.nightly-2019-12-22-223447   True        False         False      18h
monitoring                                 4.3.0-0.nightly-2019-12-22-223447   False       True          True       18h
network                                    4.3.0-0.nightly-2019-12-22-223447   True        True          True       21h
node-tuning                                4.3.0-0.nightly-2019-12-22-223447   True        False         False      18h
openshift-apiserver                        4.3.0-0.nightly-2019-12-22-223447   True        False         False      18h
openshift-controller-manager               4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
openshift-samples                          4.3.0-0.nightly-2019-12-22-223447   True        False         False      18h
operator-lifecycle-manager                 4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2019-12-22-223447   True        False         False      125m
service-ca                                 4.3.0-0.nightly-2019-12-22-223447   True        False         False      21h
service-catalog-apiserver                  4.3.0-0.nightly-2019-12-22-223447   True        False         False      18h
service-catalog-controller-manager         4.3.0-0.nightly-2019-12-22-223447   True        False         False      20h
storage                                    4.3.0-0.nightly-2019-12-22-223447   True        False         False      19h


Expected results:
Upgrading succeeded without errors

Additional info:
[kublet logs] on the rhel worker
Dec 24 00:33:52 weinliu-1223-7674g-rhel-0 hyperkube[19781]: W1224 00:33:52.415479   19781 options.go:263] unknown 'kubernetes.io' or 'k8s.io' labels specified with --node-labels: [node-role.kubernetes.io/worker]
Dec 24 00:33:52 weinliu-1223-7674g-rhel-0 hyperkube[19781]: W1224 00:33:52.415488   19781 options.go:264] in 1.16, --node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix (kubelet.kubernetes.io, node.kubernetes.io) or be in the specifically allowed set (beta.kubernetes.io/arch, beta.kubernetes.io/instance-type, beta.kubernetes.io/os, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone, failure-domain.kubernetes.io/region, failure-domain.kubernetes.io/zone, kubernetes.io/arch, kubernetes.io/hostname, kubernetes.io/instance-type, kubernetes.io/os)
Dec 24 00:33:52 weinliu-1223-7674g-rhel-0 hyperkube[19781]: Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Dec 24 00:33:52 weinliu-1223-7674g-rhel-0 hyperkube[19781]: F1224 00:33:52.417436   19781 server.go:206] unrecognized feature gate: LegacyNodeRoleBehavior
Dec 24 00:34:02 weinliu-1223-7674g-rhel-0 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Dec 24 00:34:02 weinliu-1223-7674g-rhel-0 systemd[1]: Stopped Kubernetes Kubelet.
-- Subject: Unit kubelet.service has finished shutting down

Comment 6 Antonio Murdaca 2020-01-03 13:29:52 UTC
Dec 24 00:33:52 weinliu-1223-7674g-rhel-0 hyperkube[19781]: F1224 00:33:52.417436   19781 server.go:206] unrecognized feature gate: LegacyNodeRoleBehavior

Ryan can you take a look at this? not sure if it's the cause of the upgrade failure on rhel nodes but worth checking as I don't see anything wrong MCO-wise

Comment 7 Seth Jennings 2020-01-03 17:10:25 UTC
openshift/api PR that put in this feature gate (new to 4.3)
https://github.com/openshift/api/pull/467

The kubelet config controller in the MCO currently assumes that the set of OCP features gates is equal to the set of kube feature gates.  LegacyNodeRoleBehavior, and the others introduced in that PR, are the first to introduce a OCP feature gate that is not a kube feature gate, leading to this issue.
https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/kubelet-config/kubelet_config_features.go#L189-L212

Comment 8 Seth Jennings 2020-01-03 18:19:22 UTC
Wow, ok completely wrong.  LegacyNodeRoleBehavior _is_ an upstream feature gate introduced in 1.16:
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/

So this is a skew issue.  The kubelet needs to be updated on the node before the the machine-config operator is updated. Updating the MCO will update the MCC thus the kubelet-config controller, which imports the new openshift/api and will include new feature gates in the config that may (and in this case, are) incompatible with the old kubelet.
https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L113-L124

Comment 9 Seth Jennings 2020-01-03 18:44:28 UTC
Additionally, we test this upgrade with RHCOS workers in CI and it works.  This issue is only for RHEL workers.

I'm pretty sure that when the MachineConfig changes during RHCOS upgrade, we pivot the ostree first, upgrading the kubelet, then reboot and get the new files, including the new kubelet config file.  That explains why we don't see this in the RHCOS worker case.

Antonio, can you confirm this?

Comment 10 Antonio Murdaca 2020-01-06 11:39:25 UTC
(In reply to Seth Jennings from comment #9)
> Additionally, we test this upgrade with RHCOS workers in CI and it works. 
> This issue is only for RHEL workers.
> 
> I'm pretty sure that when the MachineConfig changes during RHCOS upgrade, we
> pivot the ostree first, upgrading the kubelet, then reboot and get the new
> files, including the new kubelet config file.  That explains why we don't
> see this in the RHCOS worker case.
> 
> Antonio, can you confirm this?

That is the case indeed and explains why we only see this on RHEL7 workers, is this something we need to take care on the MCC-kubelet controller or is RHEL/ansible responsability to do this?

Comment 11 Seth Jennings 2020-01-06 16:06:14 UTC
It seems to me that the RHEL worker upgrade pattern should be to upgrade the workers first, then upgrade the cluster.  Newer kubelet will be compatible with the older config, at least n-1 skewed.

Am I missing some obvious issue with that?

Comment 12 Seth Jennings 2020-01-06 18:47:09 UTC
Attempting this, the playbook currently reads the running cluster version and will only install openshift rpms that match the cluster version, even if a repo that has the newer version is installed
https://github.com/openshift/openshift-ansible/blob/91645ed18b8e0b6c84dcc0229d02aee77db3fae2/roles/openshift_node/tasks/install.yml#L23-L60

This currently forces the "upgrade cluster then upgrade workers" ordering.

Comment 13 Scott Dodson 2020-01-06 19:29:08 UTC
(In reply to Seth Jennings from comment #11)
> It seems to me that the RHEL worker upgrade pattern should be to upgrade the
> workers first, then upgrade the cluster.  Newer kubelet will be compatible
> with the older config, at least n-1 skewed.
> 
> Am I missing some obvious issue with that?

Just that we'd have to ensure that the API were upgraded prior to kubelet because we don't support kubelet > api.

Comment 14 Scott Dodson 2020-01-07 16:24:29 UTC
The behavior during a 4.2 to 4.3 upgrade is that when the MCO rolls out the new configuration it will cordon and mark unavailable the number of hosts specified by the `maxUnavailable` field on the machine configuration pool. It will then apply new configuration and reboot the host. When doing so on a RHEL Worker this process does not update the kubelet therefore configuration specified by 4.3 will be applied to a 4.2 kubelet, because of this the host never returns to Ready state. This will stop the rollout until that host becomes available again and under the assumption that maxUnavailable, which defaults to 1, has been configured at a level acceptable to ensure normal cluster operation this should not be seen as a critical situation. 

Therefore, we will amend the documentation to make it clear that this will happen during 4.2 to 4.3 upgrades in clusters with RHEL workers and that the admin will need to run the RHEL Worker upgrade playbooks to complete the upgrade. Running the upgrade playbooks will update the kubelet on all specified RHEL workers and reboot them one by one. Once the RHEL worker has been updated it will return to ready state and the upgrade will complete as expected. This process also ensures that the API will have been upgraded prior to upgrading the kubelets where as other patterns may not.

We will evaluate additional changes in the future to make this a more seamless upgrade.

Comment 18 Seth Jennings 2020-01-08 14:30:43 UTC
Junqi, did you install the 4.3 repo on the RHEL worker? I would have thought the upgrade playbook would fail if you had not, but maybe not

Comment 19 Scott Dodson 2020-01-08 14:37:34 UTC
(In reply to Seth Jennings from comment #18)
> Junqi, did you install the 4.3 repo on the RHEL worker? I would have thought
> the upgrade playbook would fail if you had not, but maybe not

Hmm, I'd assumed that's in the docs for rhel worker upgrade but I'm not finding that. We need to make sure that the repo toggling bits are added too.

Roughly the same as the subscription manager snippet here https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#preparing-for-an-automated-upgrade

Comment 29 Kathryn Alexander 2020-01-15 15:01:33 UTC
PR's here: https://github.com/openshift/openshift-docs/pull/19059

Gaoyun Pei, will you PTAL?

Comment 30 Gaoyun Pei 2020-01-16 07:15:02 UTC
Add comment to the doc PR.

Comment 32 Gaoyun Pei 2020-01-17 07:06:17 UTC
The proposed doc PR lgtm, move this bug to verified for 4.3.0.

And also cloned the bug to 4.4.0 to see if we could have some better solution.