Bug 1737678

Summary:	machine-config operator cannot be upgraded from 4.1.9 to 4.2
Product:	OpenShift Container Platform	Reporter:	Hongan Li <hongli>
Component:	Machine Config Operator	Assignee:	Stephen Greene <sgreene>
Status:	CLOSED DUPLICATE	QA Contact:	Micah Abbott <miabbott>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	kgarriso, kincljc, mifiedle, schoudha, wsun
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-20 20:10:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongan Li 2019-08-06 03:19:44 UTC

Description of problem:
Upgrade cluster from 4.1.9 to 4.2.0-0.nightly-2019-08-01-113533, only machine-config operator is not upgraded.


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-01-113533

How reproducible:
50%

Steps to Reproduce:
1. 
2.
3.

Actual results:
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-08-01-113533   True        False         False      22h
cloud-credential                           4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
cluster-autoscaler                         4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
console                                    4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
dns                                        4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
image-registry                             4.2.0-0.nightly-2019-08-01-113533   True        False         False      7h19m
ingress                                    4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
kube-apiserver                             4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
kube-controller-manager                    4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
kube-scheduler                             4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
machine-api                                4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
machine-config                             4.1.9                               False       True          True       7h50m
marketplace                                4.2.0-0.nightly-2019-08-01-113533   True        False         False      8h
monitoring                                 4.2.0-0.nightly-2019-08-01-113533   False       True          True       7h4m
network                                    4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
node-tuning                                4.2.0-0.nightly-2019-08-01-113533   True        False         False      8h
openshift-apiserver                        4.2.0-0.nightly-2019-08-01-113533   True        False         False      92m
openshift-controller-manager               4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
openshift-samples                          4.2.0-0.nightly-2019-08-01-113533   True        False         False      18h
operator-lifecycle-manager                 4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-08-01-113533   True        False         False      4h7m
service-ca                                 4.2.0-0.nightly-2019-08-01-113533   True        False         False      23h
service-catalog-apiserver                  4.2.0-0.nightly-2019-08-01-113533   True        False         False      4h6m
service-catalog-controller-manager         4.2.0-0.nightly-2019-08-01-113533   True        False         False      17h
storage                                    4.2.0-0.nightly-2019-08-01-113533   True        False         False      18h
support                                    4.2.0-0.nightly-2019-08-01-113533   True        False         False      18h

$ oc get co/machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-08-05T03:47:57Z"
  generation: 1
  name: machine-config
  resourceVersion: "1094784"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: d04f2a1b-b733-11e9-ad70-02e77de128dc
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-08-05T19:05:28Z"
    message: Cluster not available for 4.2.0-0.nightly-2019-08-01-113533
    status: "False"
    type: Available
  - lastTransitionTime: "2019-08-05T18:42:37Z"
    message: Working towards 4.2.0-0.nightly-2019-08-01-113533
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-08-05T19:05:28Z"
    message: 'Unable to apply 4.2.0-0.nightly-2019-08-01-113533: timed out waiting
      for the condition during syncRequiredMachineConfigPools: pool master has not
      progressed to latest configuration: controller version mismatch for rendered-master-ce8adbbe7a871e63d2f9fe30bf489c6f
      expected 6e75b3fe9bb02eeef9756d8b6ff1a85e790944e3 has 83392b13a5c17e56656acf3f7b0031e3303fb5c0,
      retrying'
    reason: FailedToSync
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-08-05T19:05:28Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    lastSyncError: 'pool master has not progressed to latest configuration: controller
      version mismatch for rendered-master-ce8adbbe7a871e63d2f9fe30bf489c6f expected
      6e75b3fe9bb02eeef9756d8b6ff1a85e790944e3 has 83392b13a5c17e56656acf3f7b0031e3303fb5c0,
      retrying'
    worker: all 3 nodes are at latest configuration rendered-worker-5f6dd4e5c2ad1322fbf6120f4d0916d7
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: cluster
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.1.9


Expected results:
machine-config operator can be upgraded from 4.1 to 4.2.

Additional info:
Tried to upgrade another cluster with same upgrade path and succeed.

Comment 2 Jason Kincl 2019-08-07 19:07:39 UTC

This looks similar to the error in BZ#1734531

Comment 3 Antonio Murdaca 2019-08-08 07:29:30 UTC

Does this reconcile eventually? That message is saying that the MCC hasn't generated the newest rendered machineconfigs for the new version which is ok as the MCC may have not run yet.

Comment 4 Hongan Li 2019-08-08 08:32:09 UTC

No, it stuck in that status more than one day until cluster was destroyed.

This issue is not 100% reproducible, another cluster was upgraded successfully.

Comment 5 Mike Fiedler 2019-08-08 11:47:17 UTC

Two other QE (including myself) reproduced this yesterday.   Let me know what additional info is required and I can attempt again.

Comment 6 Antonio Murdaca 2019-08-19 14:17:01 UTC

This could mean the new MCC hasn't rolled out yet uhm we need must-gather to check system logs as well

Comment 7 Kirsten Garrison 2019-08-20 20:04:25 UTC

@sgreen can you confirm that you are seeing etcdquorum guard issues that might be impacting this upgrade? If so, you can mark as a dupe for 1742744

Comment 8 Stephen Greene 2019-08-20 20:10:23 UTC

Yep, can confirm that this is an etcdquorum guard issue.  Visible in must-gather/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-z8l6v/machine-config-daemon/machine-config-daemon/logs/current.log

Several thousand lines of the following error:

2019-08-06T03:14:57.218366261Z I0806 03:14:57.218315  123397 update.go:89] error when evicting pod "etcd-quorum-guard-8646778784-phtjq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.


Marking as dupe.

Comment 9 Stephen Greene 2019-08-20 20:10:49 UTC


*** This bug has been marked as a duplicate of bug 1742744 ***