1786993 – RHCOS master and worker nodes may go to NotReady,SchedulingDisabled while upgrading from 4.2.12 to 4.3.0 sometimes

Bug 1786993 - RHCOS master and worker nodes may go to NotReady,SchedulingDisabled while upgrading from 4.2.12 to 4.3.0 sometimes

Summary: RHCOS master and worker nodes may go to NotReady,SchedulingDisabled while upg...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1791061 (view as bug list)
Depends On:	1789581
Blocks:	1789565
TreeView+	depends on / blocked

Reported:	2019-12-30 08:11 UTC by Cuiping HUO
Modified:	2020-01-23 11:20 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1789565 1789581 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:19:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1359	0	None	closed	Bug 1786993: [release-4.3] Fix osImageURL upgrade race	2021-02-03 10:38:19 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:20:00 UTC

Description Cuiping HUO 2019-12-30 08:11:18 UTC

Description of problem:
RHCOS master and worker nodes may go to NotReady,SchedulingDisabled while upgrading from 4.2.12 to 4.3.0 for Azure

Version-Release number of selected component (if applicable):
4.2.12 to 4.3.0-0.nightly-2019-12-29-214527

How reproducible:
Always

Steps to Reproduce:
Initial:
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.12    True        False         121m    Cluster version is 4.2.12


$ oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
upgrade-chuo-8xqk6-master-0                  Ready    master   42m     v1.14.6+cebabbf4a
upgrade-chuo-8xqk6-master-1                  Ready    master   43m     v1.14.6+cebabbf4a
upgrade-chuo-8xqk6-master-2                  Ready    master   43m     v1.14.6+cebabbf4a
upgrade-chuo-8xqk6-worker-centralus1-v4kk4   Ready    worker   34m     v1.14.6+cebabbf4a
upgrade-chuo-8xqk6-worker-centralus2-qmm6f   Ready    worker   35m     v1.14.6+cebabbf4a
upgrade-chuo-8xqk6-worker-centralus3-2b7lm   Ready    worker   2m20s   v1.14.6+cebabbf4a
upgrade-chuo-8xqk6-worker-centralus3-xjr5r   Ready    worker   2m1s    v1.14.6+cebabbf4a

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.12    True        False         False      120m
cloud-credential                           4.2.12    True        False         False      136m
cluster-autoscaler                         4.2.12    True        False         False      128m
console                                    4.2.12    True        False         False      123m
dns                                        4.2.12    True        False         False      135m
image-registry                             4.2.12    True        False         False      126m
ingress                                    4.2.12    True        False         False      126m
insights                                   4.2.12    True        False         False      136m
kube-apiserver                             4.2.12    True        False         False      134m
kube-controller-manager                    4.2.12    True        False         False      133m
kube-scheduler                             4.2.12    True        False         False      133m
machine-api                                4.2.12    True        False         False      136m
machine-config                             4.2.12    True        False         False      135m
marketplace                                4.2.12    True        False         False      129m
monitoring                                 4.2.12    True        False         False      124m
network                                    4.2.12    True        False         False      135m
node-tuning                                4.2.12    True        False         False      132m
openshift-apiserver                        4.2.12    True        False         False      131m
openshift-controller-manager               4.2.12    True        False         False      134m
openshift-samples                          4.2.12    True        False         False      128m
operator-lifecycle-manager                 4.2.12    True        False         False      135m
operator-lifecycle-manager-catalog         4.2.12    True        False         False      135m
operator-lifecycle-manager-packageserver   4.2.12    True        False         False      133m
service-ca                                 4.2.12    True        False         False      136m
service-catalog-apiserver                  4.2.12    True        False         False      38m
service-catalog-controller-manager         4.2.12    True        False         False      39m
storage                                    4.2.12    True        False         False      129m


Upgrade: 
oc adm upgrade --to=4.3.0-0.nightly-2019-12-29-214527 --force


Actual results:
one master and one node status is NotReady,SchedulingDisabled, and the machine-config operator version is still 4.2.12. Besides, the clusterversion status jumps between "Unable to apply 4.3.0-0.nightly-2019-12-29-214527: the cluster operator kube-apiserver is degraded" and "Working towards 4.3.0-0.nightly-2019-12-29-214527: 13% complete"

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.12    True        True          45m     Unable to apply 4.3.0-0.nightly-2019-12-29-214527: the cluster operator kube-apiserver is degraded
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.12    True        True          51m     Working towards 4.3.0-0.nightly-2019-12-29-214527: 13% complete

$ oc get nodes -o wide
NAME                                         STATUS                        ROLES    AGE     VERSION             INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                   KERNEL-VERSION                CONTAINER-RUNTIME
upgrade-chuo-8xqk6-master-0                  Ready                         master   5h54m   v1.14.6+cebabbf4a   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 42.81.20191210.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
upgrade-chuo-8xqk6-master-1                  Ready                         master   5h55m   v1.14.6+cebabbf4a   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 42.81.20191210.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
upgrade-chuo-8xqk6-master-2                  NotReady,SchedulingDisabled   master   5h54m   v1.14.6+cebabbf4a   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 42.81.20191210.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
upgrade-chuo-8xqk6-worker-centralus1-v4kk4   Ready                         worker   5h45m   v1.14.6+cebabbf4a   10.0.32.5     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191210.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
upgrade-chuo-8xqk6-worker-centralus2-qmm6f   NotReady,SchedulingDisabled   worker   5h47m   v1.14.6+cebabbf4a   10.0.32.4     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191210.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
upgrade-chuo-8xqk6-worker-centralus3-2b7lm   Ready                         worker   5h13m   v1.14.6+cebabbf4a   10.0.32.6     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191210.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
upgrade-chuo-8xqk6-worker-centralus3-xjr5r   Ready                         worker   5h13m   v1.14.6+cebabbf4a   10.0.32.7     <none>        Red Hat Enterprise Linux CoreOS 42.81.20191210.1 (Ootpa)   4.18.0-147.0.3.el8_1.x86_64   cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8

$ oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h35m
cloud-credential                           4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h50m
cluster-autoscaler                         4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h42m
console                                    4.3.0-0.nightly-2019-12-29-214527   True        False         False      59m
dns                                        4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h50m
image-registry                             4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h41m
ingress                                    4.3.0-0.nightly-2019-12-29-214527   True        False         False      59m
insights                                   4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h50m
kube-apiserver                             4.3.0-0.nightly-2019-12-29-214527   True        False         True       5h49m
kube-controller-manager                    4.3.0-0.nightly-2019-12-29-214527   True        False         True       5h47m
kube-scheduler                             4.3.0-0.nightly-2019-12-29-214527   True        False         True       5h48m
machine-api                                4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h50m
machine-config                             4.2.12                              False       True          True       67m
marketplace                                4.3.0-0.nightly-2019-12-29-214527   True        False         False      72m
monitoring                                 4.3.0-0.nightly-2019-12-29-214527   False       True          True       59m
network                                    4.3.0-0.nightly-2019-12-29-214527   True        True          True       5h49m
node-tuning                                4.3.0-0.nightly-2019-12-29-214527   True        False         False      83m
openshift-apiserver                        4.3.0-0.nightly-2019-12-29-214527   True        False         False      73m
openshift-controller-manager               4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h49m
openshift-samples                          4.3.0-0.nightly-2019-12-29-214527   True        False         False      75m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h49m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h49m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2019-12-29-214527   True        False         False      71m
service-ca                                 4.3.0-0.nightly-2019-12-29-214527   True        False         False      5h50m
service-catalog-apiserver                  4.3.0-0.nightly-2019-12-29-214527   True        False         False      57m
service-catalog-controller-manager         4.3.0-0.nightly-2019-12-29-214527   True        False         False      4h13m
storage                                    4.3.0-0.nightly-2019-12-29-214527   True        False         False      83m


Expected results:
Upgrade succeeded without errors

Additional info:
$ oc get co/machine-config -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-12-30T01:56:06Z"
  generation: 1
  name: machine-config
  resourceVersion: "192890"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: 8ad918e2-2aa7-11ea-a1f6-000d3aa4b2a3
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-30T06:39:22Z"
    message: Cluster not available for 4.3.0-0.nightly-2019-12-29-214527
    status: "False"
    type: Available
  - lastTransitionTime: "2019-12-30T06:41:41Z"
    message: Working towards 4.3.0-0.nightly-2019-12-29-214527
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-12-30T06:39:22Z"
    message: 'Unable to apply 4.3.0-0.nightly-2019-12-29-214527: timed out waiting
      for the condition during syncRequiredMachineConfigPools: pool master has not
      progressed to latest configuration: controller version mismatch for rendered-master-6c22c0d3a20a20eb2551cae5b958c120
      expected 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a has d780d197a9c5848ba786982c0c4aaa7487297046,
      retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-12-30T01:57:01Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    lastSyncError: 'pool master has not progressed to latest configuration: controller
      version mismatch for rendered-master-6c22c0d3a20a20eb2551cae5b958c120 expected
      23a6e6fb37e73501bc3216183ef5e6ebb15efc7a has d780d197a9c5848ba786982c0c4aaa7487297046,
      retrying'
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: machine-config-controller
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.2.12
$ oc -n openshift-machine-config-operator get MachineConfigPools
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT
master   rendered-master-6c22c0d3a20a20eb2551cae5b958c120   False     True       False      3              0                   0                     0
worker   rendered-worker-220ff630d5173233329ea21c9fad6cc6   False     True       False      4              0                   0                     0

Comment 1 Zhang Cheng 2019-12-30 08:49:31 UTC

Updating version to 4.3

Comment 2 Zhang Cheng 2019-12-30 08:51:48 UTC

Adding ”Regression“ keyword

Comment 4 Antonio Murdaca 2020-01-02 17:36:57 UTC

Looks like something else happened on some nodes, can you check this out?


2019-12-30T06:49:20.642514Z I1230 06:49:20.642461       1 node_controller.go:433] Pool master: node upgrade-chuo-8xqk6-master-2 is now reporting unready: node upgrade-chuo-8xqk6-master-2 is reporting OutOfDisk=Unknown
2019-12-30T06:49:21.1309067Z I1230 06:49:21.130316       1 node_controller.go:433] Pool worker: node upgrade-chuo-8xqk6-worker-centralus2-qmm6f is now reporting unready: node upgrade-chuo-8xqk6-worker-centralus2-qmm6f is reporting OutOfDisk=Unknown

Comment 5 Antonio Murdaca 2020-01-02 17:39:10 UTC

it also appears that the MCC switched to the correct rendered MC for the upgrade at some point:

2019-12-30T06:57:06.6632866Z I1230 06:57:06.662336       1 render_controller.go:516] Pool worker: now targeting: rendered-worker-bf76917619cbd12243da5f3912aa7300
2019-12-30T06:57:06.6632866Z I1230 06:57:06.662640       1 render_controller.go:516] Pool master: now targeting: rendered-master-3b3f60c81f1f6a0a10b731dca180a89e

Any chance the cluster is still around to check if it progressed?

Comment 6 Antonio Murdaca 2020-01-02 17:57:13 UTC

The must-gather shows that a master node is being updated to what the MCO expects and this could have been a result of a transient issue in rolling the upgrade. It would be helpful to check back on the cluster.

Comment 7 Cuiping HUO 2020-01-03 01:10:17 UTC

The cluster has been purged. I will launch another env to do the upgrade and leave it for debugging.

Comment 8 Cuiping HUO 2020-01-03 08:05:55 UTC

Launch up another Azure cluster to upgrade from 4.2.12 to 4.3.0-0.nightly-2020-01-03-034242. It succeeds. 
And this result seems like upgrade fail not always happen.

Reproduce steps:
Steps to Reproduce:
Initial:

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
upgrade-chuo-nhznh-master-0                  Ready    master   4h9m    v1.14.6+cebabbf4a
upgrade-chuo-nhznh-master-1                  Ready    master   4h10m   v1.14.6+cebabbf4a
upgrade-chuo-nhznh-master-2                  Ready    master   4h10m   v1.14.6+cebabbf4a
upgrade-chuo-nhznh-worker-centralus1-w5f87   Ready    worker   4h2m    v1.14.6+cebabbf4a
upgrade-chuo-nhznh-worker-centralus2-sg92l   Ready    worker   4h2m    v1.14.6+cebabbf4a
upgrade-chuo-nhznh-worker-centralus3-jbr4j   Ready    worker   3h5m    v1.14.6+cebabbf4a
upgrade-chuo-nhznh-worker-centralus3-xgjck   Ready    worker   3h10m   v1.14.6+cebabbf4a

$ oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.12                              True        False         False      3h54m
cloud-credential                           4.2.12                              True        False         False      4h9m
cluster-autoscaler                         4.2.12                              True        False         False      4h3m
console                                    4.2.12                              True        False         False      166m
dns                                        4.2.12                              True        False         False      4h8m
image-registry                             4.2.12                              True        False         False      3h59m
ingress                                    4.2.12                              True        False         False      4h
insights                                   4.2.12                              True        False         False      4h9m
kube-apiserver                             4.2.12                              True        False         False      4h6m
kube-controller-manager                    4.2.12                              True        False         False      4h6m
kube-scheduler                             4.2.12                              True        False         False      4h6m
machine-api                                4.2.12                              True        False         False      4h9m
machine-config                             4.2.12                              True        False         False      4h5m
marketplace                                4.2.12                              True        False         False      165m
monitoring                                 4.2.12                              True        False         False      168m
network                                    4.2.12                              True        False         False      4h8m
node-tuning                                4.2.12                              True        False         False      165m
openshift-apiserver                        4.2.12                              True        False         False      165m
openshift-controller-manager               4.2.12                              True        False         False      4h6m
openshift-samples                          4.2.12                              True        False         False      4h4m
operator-lifecycle-manager                 4.2.12                              True        False         False      4h8m
operator-lifecycle-manager-catalog         4.2.12                              True        False         False      4h8m
operator-lifecycle-manager-packageserver   4.2.12                              True        False         False      165m
service-ca                                 4.2.12                              True        False         False      4h9m
service-catalog-apiserver                  4.2.12                              True        False         False      25m
service-catalog-controller-manager         4.2.12                              True        False         False      25m
storage                                    4.2.12                              True        False         False      4h4m


Upgrade: 
oc adm upgrade --to=4.3.0-0.nightly-2020-01-03-034242 --force


After upgrade:

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-01-03-034242   True        False         29m     Cluster version is 4.3.0-0.nightly-2020-01-03-034242
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h4m
cloud-credential                           4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h19m
cluster-autoscaler                         4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h13m
console                                    4.3.0-0.nightly-2020-01-03-034242   True        False         False      38m
dns                                        4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h18m
image-registry                             4.3.0-0.nightly-2020-01-03-034242   True        False         False      30m
ingress                                    4.3.0-0.nightly-2020-01-03-034242   True        False         False      35m
insights                                   4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h19m
kube-apiserver                             4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h17m
kube-controller-manager                    4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h16m
kube-scheduler                             4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h16m
machine-api                                4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h19m
machine-config                             4.3.0-0.nightly-2020-01-03-034242   True        False         False      30m
marketplace                                4.3.0-0.nightly-2020-01-03-034242   True        False         False      32m
monitoring                                 4.3.0-0.nightly-2020-01-03-034242   True        False         False      31m
network                                    4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h18m
node-tuning                                4.3.0-0.nightly-2020-01-03-034242   True        False         False      33m
openshift-apiserver                        4.3.0-0.nightly-2020-01-03-034242   True        False         False      30m
openshift-controller-manager               4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h17m
openshift-samples                          4.3.0-0.nightly-2020-01-03-034242   True        False         False      65m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h18m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h18m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-01-03-034242   True        False         False      33m
service-ca                                 4.3.0-0.nightly-2020-01-03-034242   True        False         False      5h19m
service-catalog-apiserver                  4.3.0-0.nightly-2020-01-03-034242   True        False         False      31m
service-catalog-controller-manager         4.3.0-0.nightly-2020-01-03-034242   True        False         False      61m
storage                                    4.3.0-0.nightly-2020-01-03-034242   True        False         False      65m

$ oc get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
upgrade-chuo-nhznh-master-0                  Ready    master   5h49m   v1.16.2
upgrade-chuo-nhznh-master-1                  Ready    master   5h49m   v1.16.2
upgrade-chuo-nhznh-master-2                  Ready    master   5h49m   v1.16.2
upgrade-chuo-nhznh-worker-centralus1-w5f87   Ready    worker   5h41m   v1.16.2
upgrade-chuo-nhznh-worker-centralus2-sg92l   Ready    worker   5h41m   v1.16.2
upgrade-chuo-nhznh-worker-centralus3-jbr4j   Ready    worker   4h44m   v1.16.2
upgrade-chuo-nhznh-worker-centralus3-xgjck   Ready    worker   4h49m   v1.16.2

Comment 9 Antonio Murdaca 2020-01-03 13:24:26 UTC

(In reply to Cuiping HUO from comment #8)
> Launch up another Azure cluster to upgrade from 4.2.12 to
> 4.3.0-0.nightly-2020-01-03-034242. It succeeds. 
> And this result seems like upgrade fail not always happen.
> 

I think what you experienced with the first cluster is the result of some temporary error and it would have likely reconciled itself as I understood from the must-gather data.

Comment 11 Vadim Rutkovsky 2020-01-06 14:44:57 UTC

This may also happen on masters, which would take the cluster down. Kubelet starts with new params, but OS image has not been updated.

$ oc get machineconfigs rendered-master-91d1f2b8144d9a25c4cc750316fee063 -o yaml
...
osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07204299cf05ca2f09d0ecc444258e35cd229b7618534468a0b10d9ec237b68e

That's original 4.2 image, however workers have been updated to 4.3:

$ oc get machineconfigs rendered-worker-c0e9a9562b37b4ce573cffed373cea13 -o yaml
...
osImageURL: registry.svc.ci.openshift.org/ocp/4.3-2020-01-05-154126@sha256:5f41036e3db3b6596043251e7d6407924d46206c24ebe4e890bb9156092962ef

^ that's 4.3 nightly os image

Comment 13 Colin Walters 2020-01-06 15:23:48 UTC

This is a variant of the etcd issue https://bugzilla.redhat.com/show_bug.cgi?id=1761557
except for kubelet instead of etcd.

I don't think we have a reliable way to fix this until we land https://github.com/openshift/machine-config-operator/issues/1190

Comment 14 Vadim Rutkovsky 2020-01-06 17:51:26 UTC

It appears to be happening when MCC is being updated - the previous pod is generating an invalid master config, which may be found by MCDs before a proper config is being rendered

Comment 17 Colin Walters 2020-01-07 14:18:40 UTC

> Is this a regression in some way? Seems important, but not something that we must block the 4.3.0 release on. right?

This needs a deeper root cause analysis.  It seems possible that the fundamental MCO issue exists in 4.2 and below, but we're not changing the kubelet config incompatibly between 4.1 and 4.2 or so, so nothing actively breaks?

Given that we're actively hitting this in 4.2 to 4.3 upgrades...well, we may not need to block the 4.3 release, but we probably shouldn't add an edge from 4.2 until we fix this.

Antonio said in Slack:

btw, it looks like definitely a race
this is the new MC rendered w/o the 4.3 image rendered-master-cff37a23cf3c5f33c388505a92d401c8            23a6e6fb37e73501bc3216183ef5e6ebb15efc7a   2.2.0             64m
and minutes after that, a correct one is generated
rendered-master-e90c7a6b509dd1f4858009a3dca4c244            23a6e6fb37e73501bc3216183ef5e6ebb15efc7a   2.2.0             51m
Antonio Murdaca  23 hours ago
but the pool is now stuck

We landed a lot of code in the MCO to try to guard against a race like this I thought, but there may be a gap.

Comment 18 Antonio Murdaca 2020-01-07 14:27:52 UTC

(In reply to Colin Walters from comment #17)
> > Is this a regression in some way? Seems important, but not something that we must block the 4.3.0 release on. right?
> 
> This needs a deeper root cause analysis.  It seems possible that the
> fundamental MCO issue exists in 4.2 and below, but we're not changing the
> kubelet config incompatibly between 4.1 and 4.2 or so, so nothing actively
> breaks?
> 
> Given that we're actively hitting this in 4.2 to 4.3 upgrades...well, we may
> not need to block the 4.3 release, but we probably shouldn't add an edge
> from 4.2 until we fix this.
> 
> Antonio said in Slack:
> 
> btw, it looks like definitely a race
> this is the new MC rendered w/o the 4.3 image
> rendered-master-cff37a23cf3c5f33c388505a92d401c8           
> 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a   2.2.0             64m
> and minutes after that, a correct one is generated
> rendered-master-e90c7a6b509dd1f4858009a3dca4c244           
> 23a6e6fb37e73501bc3216183ef5e6ebb15efc7a   2.2.0             51m
> Antonio Murdaca  23 hours ago
> but the pool is now stuck
> 
> We landed a lot of code in the MCO to try to guard against a race like this
> I thought, but there may be a gap.

We've been hitting a similar race during 4.1->4.2 but that got correctly fixed by versioning the image.json configmap. This time, it seems, somehow, that the MCC goes ahead and generates a rendered configuration using a (old) osimageurl (even tho, the configmap holding that value seems to be pointing to 4.3 once it's computed).

I'm actively investigating this with Vadim as well who seems to be able to reproduce (although no luck today yet)

Comment 21 Colin Walters 2020-01-07 15:46:22 UTC

OK cool will let you guys chase it down.  Offhand though, maybe we need versioning in the osimageurl configmap too?

Comment 22 Antonio Murdaca 2020-01-07 15:51:18 UTC

(In reply to Colin Walters from comment #21)
> OK cool will let you guys chase it down.  Offhand though, maybe we need
> versioning in the osimageurl configmap too?

yes, I've prototyped it already but I want to check what's going on on a broken cluster first w/o shipping something not needed

Comment 36 Vadim Rutkovsky 2020-01-09 12:37:13 UTC

This seems to be a CVO bug. Sometimes it only syncs the metrics service:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/1069/artifacts/e2e-aws-upgrade/pods/openshift-cluster-version_cluster-version-operator-654cbcccd-hfftq_cluster-version-operator.log

osimage configmap is thus not synced, it still refers to 4.2 payload and the install breaks

In some cases CVO does find all manifests:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/1068/artifacts/e2e-aws-upgrade/pods/openshift-cluster-version_cluster-version-operator-654cbcccd-8vmmp_cluster-version-operator.log

This install completes correctly.


Not sure what triggers this - could be that MCO declares metrics service as "0001_00" (https://github.com/openshift/machine-config-operator/blob/04cd2198cae247fabcd3154669618d74f124f27f/install/0001_00_machine-config-operator_00_service.yaml). In the resulting payload this manifest is stored as "0000_50" while the rest of MCO is "0000_80"

Keeping this assigned to MCO for now

Comment 40 Antonio Murdaca 2020-01-14 21:20:02 UTC

*** Bug 1791061 has been marked as a duplicate of this bug. ***

Comment 41 Cuiping HUO 2020-01-15 03:15:30 UTC

Verified. 
Total 15 envs are upgraded from 4.2.0-0.nightly-2020-01-13-060909 to 4.3.0-0.nightly-2020-01-14-000626 successfully.

Comment 43 errata-xmlrpc 2020-01-23 11:19:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.