1677198 – upgraded cluster throws "the cluster operator machine-config is failing" error

Bug 1677198 - upgraded cluster throws "the cluster operator machine-config is failing" error

Summary: upgraded cluster throws "the cluster operator machine-config is failing" error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:	buildcop
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-14 09:18 UTC by liujia
Modified:	2019-09-23 02:35 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-20 14:14:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Must-gather info (831.58 KB, application/gzip) 2019-03-26 08:26 UTC, zhou ying	no flags	Details
rendered-worker-ec1202835a931d3cf83b34760ee45095.yaml (29.64 KB, text/plain) 2019-03-26 08:51 UTC, Johnny Liu	no flags	Details
rendered-worker-fb0bade95cda29515460a5dddf46bce6.yaml (29.77 KB, text/plain) 2019-03-26 08:51 UTC, Johnny Liu	no flags	Details
rendered-master-13131f3a8f1d80a10d2149723b4bed3f.yaml (40.79 KB, text/plain) 2019-03-26 08:54 UTC, Johnny Liu	no flags	Details
rendered-master-8c108b7752cb2545da64b96e15241d8c.yaml (40.66 KB, text/plain) 2019-03-26 08:55 UTC, Johnny Liu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:44:07 UTC

Description liujia 2019-02-14 09:18:23 UTC

Description of problem:
Do manually upgrade with a specified update payload, upgrade failed and stunk at the stage of the cluster operator machine-config update.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-13-204401   True        True          90m     Unable to apply 4.0.0-0.nightly-2019-02-13-204401: the cluster operator machine-config is failing

# oc adm upgrade
error: Unable to apply 4.0.0-0.nightly-2019-02-13-204401: the cluster operator machine-config is failing:

  Reason: ClusterOperatorFailing
  Message: Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)


# oc get clusterversion -o json|jq ".items[0].status"
{
  "availableUpdates": null,
  "conditions": [
    {
      "lastTransitionTime": "2019-02-13T10:02:55Z",
      "message": "Done applying 4.0.0-0.nightly-2019-02-12-150919",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2019-02-14T07:03:30Z",
      "message": "Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2019-02-14T06:58:40Z",
      "message": "Unable to apply 4.0.0-0.nightly-2019-02-13-204401: the cluster operator machine-config is failing",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2019-02-14T06:59:30Z",
      "status": "True",
      "type": "RetrievedUpdates"
    }
  ],
  "desired": {
    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401",
    "version": "4.0.0-0.nightly-2019-02-13-204401"
  },
  "history": [
    {
      "completionTime": null,
      "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401",
      "startedTime": "2019-02-14T06:59:30Z",
      "state": "Partial",
      "version": "4.0.0-0.nightly-2019-02-13-204401"
    },
    {
      "completionTime": "2019-02-14T06:59:30Z",
      "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401",
      "startedTime": "2019-02-14T06:58:40Z",
      "state": "Partial",
      "version": "4.0.0-0.nightly-2019-02-13-204401"
    },
    {
      "completionTime": "2019-02-14T06:58:40Z",
      "image": "registry.svc.ci.openshift.org/ocp/release@sha256:7bd57da7777e65f6cd4c8aa726b90ab00b6804ce97819cc83093bf9a1841e32b",
      "startedTime": "2019-02-13T09:34:06Z",
      "state": "Completed",
      "version": "4.0.0-0.nightly-2019-02-12-150919"
    }
  ],
  "observedGeneration": 3,
  "versionHash": "C6QROhGXMC8="
}

========================================
Some error logs from cvo pod.
# oc logs -f cluster-version-operator-67b8598cf8-m7th8|grep error
E0214 07:00:52.841603       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Get https://127.0.0.1:6443/apis/config.openshift.io/v1/clusteroperators/machine-config: dial tcp 127.0.0.1:6443: connect: connection refused
E0214 07:02:02.865496       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:03:15.878955       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:04:53.287868       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:06:03.296365       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:07:16.303944       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:09:18.579255       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:10:28.587613       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:11:41.595463       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:14:38.076932       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success
E0214 07:15:48.084508       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
E0214 07:17:01.091884       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
I0214 07:17:01.092000       1 task_graph.go:518] Result of work: [Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)]
E0214 07:17:01.092021       1 sync_worker.go:263] unable to synchronize image (waiting 3m19.747206386s): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
E0214 07:22:00.358999       1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)


One master get Degraded during upgrade.
[root@preserve-jliu-worker 0213]# oc get clusteroperators machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: 2019-02-13T09:45:45Z
  generation: 1
  name: machine-config
  resourceVersion: "798380"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: 22c5bb1c-2f74-11e9-a9af-02a295b947c2
spec: {}
status:
  conditions:
  - lastTransitionTime: 2019-02-13T09:46:23Z
    message: Cluster is available at 4.0.0-0.171.0.1-dirty
    status: "True"
    type: Available
  - lastTransitionTime: 2019-02-14T07:05:26Z
    message: Running resync for 4.0.0-0.171.0.1-dirty
    status: "True"
    type: Progressing
  - lastTransitionTime: 2019-02-14T07:15:30Z
    message: 'Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed
      out waiting for the condition during syncRequiredMachineConfigPools: error pool
      master is not ready. status: (total: 3, updated: 0, unavailable: 1)'
    reason: 'error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools:
      error pool master is not ready. status: (total: 3, updated: 0, unavailable:
      1)'
    status: "True"
    type: Failing
  extension:
    master: pool is degraded because of 1 nodes are reporting degraded status on update.
      Cannot proceed.
    worker: all 3 nodes are at latest configuration worker-368aaa977e43afad36e2103a38d1dd6d
  relatedObjects: null
  versions:
  - name: machineconfigcontroller
    version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4c089d6750a907f773eb5f06fcf768f9ce9e33bd54920634420b42a9e31c97f6
  - name: machineconfigdaemon
    version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c1a503a17c86e3beb47d6006466bfb13788f6a264e44ca08f8a0ce8934c5dd0b
  - name: machineconfigserver
    version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6c850fba1c46122712ee8d6e5d8c0ba90a6ee020bd488323cde337a4bc60d9d3
  - name: operator
    version: 4.0.0-0.171.0.1-dirty
[root@preserve-jliu-worker 0213]# machineconfigpool master -o yaml
bash: machineconfigpool: command not found
[root@preserve-jliu-worker 0213]# oc get machineconfigpool master -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: 2019-02-13T09:45:45Z
  generation: 1
  labels:
    operator.machineconfiguration.openshift.io/required-for-upgrade: ""
  name: master
  resourceVersion: "769155"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  uid: 22c991ba-2f74-11e9-a9af-02a295b947c2
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: master
  machineSelector:
    matchLabels:
      node-role.kubernetes.io/master: ""
  maxUnavailable: null
  paused: false
status:
  conditions:
  - lastTransitionTime: 2019-02-14T07:05:25Z
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: 2019-02-14T07:05:25Z
    message: ""
    reason: All nodes are updating to master-aa5545a458765bfde2ba66c68a13ad3c
    status: "True"
    type: Updating
  - lastTransitionTime: 2019-02-14T07:05:30Z
    message: ""
    reason: 1 nodes are reporting degraded status on update. Cannot proceed.
    status: "True"
    type: Degraded
  configuration:
    name: master-aa5545a458765bfde2ba66c68a13ad3c
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-master
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-master-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-kubelet
  machineCount: 3
  observedGeneration: 1
  readyMachineCount: 0
  unavailableMachineCount: 1
  updatedMachineCount: 0
[root@preserve-jliu-worker 0213]# oc get nodes -l node-role.kubernetes.io/master=
NAME                                       STATUS   ROLES    AGE   VERSION
ip-10-0-31-39.us-east-2.compute.internal   Ready    master   22h   v1.12.4+a532756e37
ip-10-0-47-24.us-east-2.compute.internal   Ready    master   22h   v1.12.4+a532756e37
ip-10-0-6-220.us-east-2.compute.internal   Ready    master   22h   v1.12.4+a532756e37

[root@preserve-jliu-worker 0213]# oc get nodes -o yaml | grep -e name: -e machineconfiguration
...
      machineconfiguration.openshift.io/currentConfig: master-a7f013187d3f2d0f01781946ed61fe1b
      machineconfiguration.openshift.io/desiredConfig: master-a7f013187d3f2d0f01781946ed61fe1b
      machineconfiguration.openshift.io/ssh: accessed
      machineconfiguration.openshift.io/state: Degraded
      kubernetes.io/hostname: ip-10-0-6-220
    name: ip-10-0-6-220.us-east-2.compute.internal


Version-Release number of the following components:
sh-4.2# cluster-version-operator version
ClusterVersionOperator v4.0.0-0.171.0.0-dirty

How reproducible:
always

Steps to Reproduce:
1. Install ocp with 4.0.0-0.nightly-2019-02-12-150919 succeed.
2. Run upgrade manually with a specified release image
# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401
3.

Actual results:
upgrade failed.

Expected results:
upgrade succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2019-02-15 08:38:19 UTC

> # oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401

I've changed the component from Installer to Upgrade, since the issue is this post-install step.

Comment 2 liujia 2019-03-11 07:39:08 UTC

Hit the issue again when do upgrade manually from 4.0.0-0.nightly-2019-03-04-234414 to 4.0.0-0.nightly-2019-03-06-074438.

The same steps with those in description.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-06-074438   True        True          44m     Unable to apply 4.0.0-0.nightly-2019-03-06-074438: the cluster operator machine-config is failing

# oc get clusteroperators machine-config
NAME             VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config             False       False         True      25m

# oc adm upgrade
error: Unable to apply 4.0.0-0.nightly-2019-03-06-074438: the cluster operator machine-config is failing:

  Reason: ClusterOperatorFailing
  Message: Cluster operator machine-config is reporting a failure: Failed to resync 4.0.16-1-dirty because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)

# oc describe clusteroperators machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-03-11T02:54:59Z
  Generation:          1
  Resource Version:    300749
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 0f2b1a8b-43a9-11e9-899e-0626753ec480
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-03-11T07:07:02Z
    Message:               Cluster not available for 4.0.16-1-dirty
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-03-11T07:07:02Z
    Message:               Cluster version is 4.0.16-1-dirty
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-03-11T07:07:02Z
    Message:               Failed to resync 4.0.16-1-dirty because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Reason:                timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Status:                True
    Type:                  Failing
  Extension:
    Master:  pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed.
    Worker:  pool is degraded because of 4 nodes are reporting degraded status on update. Cannot proceed.
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
  Versions:
    Name:     machineconfigcontroller
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd23c28661f8e4b885bb52c485eb36f6a844e0e1d43fd19a30d602aed56f237d
    Name:     machineconfigdaemon
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6527c163c19a84c5df242732ead867e2ac6fdf3b9a428fa4a26bc0be42a17c4d
    Name:     machineconfigserver
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c72b03fd8dba3477ca3e6cbb962047e5d7572603dc9bdab41f17ef5c02877a
    Name:     operator
    Version:  4.0.16-1-dirty
Events:       <none>

Please refer to detail cvo log in attachment.

Comment 5 Johnny Liu 2019-03-13 08:42:28 UTC

Today I hit the issue too when do upgrade from quay.io/openshift-release-dev/ocp-release:4.0.0-0.7 to registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438.

Steps:
1. install beta2 released cluster as a base cluster

2. change cincinnati server to my dummy build graph.
# oc get clusterversion -o yaml
<--snip-->
  spec:
    channel: stable-4.0
    clusterID: b98615d8-224c-4737-8934-c822e3dcbb58
    upstream: http://3.86.146.114/cincinnati_build_graph
<--snip-->
# curl http://3.86.146.114/cincinnati_build_graph
{
  "nodes": [
    {
      "version": "4.0.0-0.7",
      "payload": "quay.io/openshift-release-dev/ocp-release:4.0.0-0.7",
      "metadata": {"description": "Beta 2"}
    },
    {
      "version": "4.0.0-0.nightly-2019-03-06-074438",
      "payload": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438"
    }
  ],
  "edges": [
    [
      0, 
      1
    ]
  ]
}

3. Trigger upgrade
# oc adm upgrade --to=4.0.0-0.nightly-2019-03-06-074438
Updating to 4.0.0-0.nightly-2019-03-06-074438

4. Upgrade is completed.
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-06-074438   True        False         6m58s   Cluster version is 4.0.0-0.nightly-2019-03-06-074438

5. After 12 hours later, check cluster again.
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-06-074438   True        False         23h     Error while reconciling 4.0.0-0.nightly-2019-03-06-074438: the cluster operator machine-config is failing
# oc get clusteroperators machine-config
NAME             VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config             False       False         True      22h

# oc describe clusteroperators machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-03-12T07:18:08Z
  Generation:          1
  Resource Version:    940930
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 fcbe9c26-4496-11e9-a8a9-064890e55fee
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-03-12T09:38:20Z
    Message:               Cluster not available for 4.0.16-1-dirty
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-03-12T09:38:20Z
    Message:               Cluster version is 4.0.16-1-dirty
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-03-12T09:38:20Z
    Message:               Failed to resync 4.0.16-1-dirty because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Reason:                timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Status:                True
    Type:                  Failing
  Extension:
    Master:  pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed.
    Worker:  pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed.
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
  Versions:
    Name:     machineconfigcontroller
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd23c28661f8e4b885bb52c485eb36f6a844e0e1d43fd19a30d602aed56f237d
    Name:     machineconfigdaemon
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6527c163c19a84c5df242732ead867e2ac6fdf3b9a428fa4a26bc0be42a17c4d
    Name:     machineconfigserver
    Version:  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c72b03fd8dba3477ca3e6cbb962047e5d7572603dc9bdab41f17ef5c02877a
    Name:     operator
    Version:  4.0.16-1-dirty
Events:       <none>

# oc get machineconfigpool
NAME     CONFIG                                    UPDATED   UPDATING   DEGRADED
master   master-3b03dd678e723c6304a35ca76f8194fd   False     True       True
worker   worker-6d9525da78e51f902c8074b02e95061a   True      False      True

# oc describe machineconfigpool master
Name:         master
Namespace:    
Labels:       operator.machineconfiguration.openshift.io/required-for-upgrade=
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2019-03-12T07:18:08Z
  Generation:          1
  Resource Version:    94213
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  UID:                 fcc1e5c8-4496-11e9-a8a9-064890e55fee
Spec:
  Machine Config Selector:
    Match Labels:
      Machineconfiguration . Openshift . Io / Role:  master
  Machine Selector:
    Match Labels:
      Node - Role . Kubernetes . Io / Master:  
  Max Unavailable:                             <nil>
  Paused:                                      false
Status:
  Conditions:
    Last Transition Time:  2019-03-12T09:27:45Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-03-12T09:27:45Z
    Message:               
    Reason:                All nodes are updating to master-3b03dd678e723c6304a35ca76f8194fd
    Status:                True
    Type:                  Updating
    Last Transition Time:  2019-03-12T09:27:50Z
    Message:               
    Reason:                1 nodes are reporting degraded status on update. Cannot proceed.
    Status:                True
    Type:                  Degraded
  Configuration:
    Name:  master-3b03dd678e723c6304a35ca76f8194fd
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-master
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-master-ssh
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-master-fcc1e5c8-4496-11e9-a8a9-064890e55fee-registries
  Machine Count:              3
  Observed Generation:        1
  Ready Machine Count:        0
  Unavailable Machine Count:  1
  Updated Machine Count:      0
Events:                       <none>


# oc describe machineconfigpool worker
Name:         worker
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2019-03-12T07:18:08Z
  Generation:          1
  Resource Version:    40789
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  UID:                 fcc305dc-4496-11e9-a8a9-064890e55fee
Spec:
  Machine Config Selector:
    Match Labels:
      Machineconfiguration . Openshift . Io / Role:  worker
  Machine Selector:
    Match Labels:
      Node - Role . Kubernetes . Io / Worker:  
  Max Unavailable:                             <nil>
  Paused:                                      false
Status:
  Conditions:
    Last Transition Time:  2019-03-12T07:24:15Z
    Message:               
    Reason:                All nodes are updated with worker-6d9525da78e51f902c8074b02e95061a
    Status:                True
    Type:                  Updated
    Last Transition Time:  2019-03-12T07:24:15Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updating
    Last Transition Time:  2019-03-12T08:03:52Z
    Message:               
    Reason:                1 nodes are reporting degraded status on update. Cannot proceed.
    Status:                True
    Type:                  Degraded
  Configuration:
    Name:  worker-6d9525da78e51f902c8074b02e95061a
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker-ssh
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
  Machine Count:              1
  Observed Generation:        1
  Ready Machine Count:        1
  Unavailable Machine Count:  0
  Updated Machine Count:      1
Events:                       <none>

# oc describe nodes|grep machineconfig
                    machineconfiguration.openshift.io/currentConfig: worker-6d9525da78e51f902c8074b02e95061a
                    machineconfiguration.openshift.io/desiredConfig: worker-6d9525da78e51f902c8074b02e95061a
                    machineconfiguration.openshift.io/state: Degraded
                    machineconfiguration.openshift.io/currentConfig: master-5268e4d86a7d26076a01ef2ff2ceec96
                    machineconfiguration.openshift.io/desiredConfig: master-3b03dd678e723c6304a35ca76f8194fd
                    machineconfiguration.openshift.io/state: Degraded
                    machineconfiguration.openshift.io/currentConfig: master-5268e4d86a7d26076a01ef2ff2ceec96
                    machineconfiguration.openshift.io/desiredConfig: master-5268e4d86a7d26076a01ef2ff2ceec96
                    machineconfiguration.openshift.io/state: Degraded
                    machineconfiguration.openshift.io/currentConfig: master-5268e4d86a7d26076a01ef2ff2ceec96
                    machineconfiguration.openshift.io/desiredConfig: master-5268e4d86a7d26076a01ef2ff2ceec96
                    machineconfiguration.openshift.io/state: Degraded

# oc get machineconfig
NAME                                                        GENERATEDBYCONTROLLER   IGNITIONVERSION   CREATED
00-master                                                   4.0.16-1-dirty          2.2.0             25h
00-master-ssh                                               4.0.16-1-dirty                            25h
00-worker                                                   4.0.16-1-dirty          2.2.0             25h
00-worker-ssh                                               4.0.16-1-dirty                            25h
01-master-container-runtime                                 4.0.16-1-dirty          2.2.0             25h
01-master-kubelet                                           4.0.16-1-dirty          2.2.0             25h
01-worker-container-runtime                                 4.0.16-1-dirty          2.2.0             25h
01-worker-kubelet                                           4.0.16-1-dirty          2.2.0             25h
99-master-fcc1e5c8-4496-11e9-a8a9-064890e55fee-registries   4.0.16-1-dirty                            25h
99-worker-fcc305dc-4496-11e9-a8a9-064890e55fee-registries   4.0.16-1-dirty                            25h
master-3b03dd678e723c6304a35ca76f8194fd                     4.0.16-1-dirty          2.2.0             23h
master-5268e4d86a7d26076a01ef2ff2ceec96                     4.0.15-1-dirty          2.2.0             25h
worker-6d9525da78e51f902c8074b02e95061a                     4.0.16-1-dirty          2.2.0             25h

# oc logs machine-config-daemon-rxftf -n openshift-machine-config-operator
I0312 09:27:44.004669   98001 start.go:52] Version: 4.0.16-1-dirty
I0312 09:27:44.005952   98001 start.go:88] starting node writer
I0312 09:27:44.015714   98001 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0312 09:27:44.122266   98001 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)
I0312 09:27:44.122444   98001 daemon.go:247] Managing node: ip-10-0-139-185.us-east-2.compute.internal
I0312 09:27:44.145575   98001 start.go:146] Calling chroot("/rootfs")
I0312 09:27:44.145610   98001 run.go:22] Running captured: rpm-ostree status
I0312 09:27:44.171489   98001 daemon.go:577] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b
              CustomOrigin: Provisioned from oscontainer
                   Version: 47.330 (2019-02-23T04:17:13Z)
I0312 09:27:44.171526   98001 daemon.go:477] In bootstrap mode
I0312 09:27:44.180165   98001 daemon.go:505] Current+desired config: worker-6d9525da78e51f902c8074b02e95061a
I0312 09:27:44.181376   98001 daemon.go:598] Node is degraded; going to sleep

# oc logs machine-config-daemon-s2hr5 -n openshift-machine-config-operator
I0312 09:27:49.193723    5699 start.go:52] Version: 4.0.16-1-dirty
I0312 09:27:49.194096    5699 start.go:88] starting node writer
I0312 09:27:49.200065    5699 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0312 09:27:49.276403    5699 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)
I0312 09:27:49.276671    5699 daemon.go:247] Managing node: ip-10-0-164-78.us-east-2.compute.internal
I0312 09:27:49.300762    5699 start.go:146] Calling chroot("/rootfs")
I0312 09:27:49.300794    5699 run.go:22] Running captured: rpm-ostree status
I0312 09:27:49.326492    5699 daemon.go:577] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b
              CustomOrigin: Provisioned from oscontainer
                   Version: 47.330 (2019-02-23T04:17:13Z)
I0312 09:27:49.326525    5699 daemon.go:477] In bootstrap mode
I0312 09:27:49.335480    5699 daemon.go:505] Current+desired config: master-5268e4d86a7d26076a01ef2ff2ceec96
I0312 09:27:49.336907    5699 daemon.go:598] Node is degraded; going to sleep

Comment 6 Wenjing Zheng 2019-03-13 09:05:42 UTC

I hit issue from 4.0.0-0.nightly-2019-03-04-234414 to 4.0.0-0.nightly-2019-03-05-045224.

Comment 7 Wei Sun 2019-03-13 09:15:03 UTC

Add testblocker keyword,since once we hit this issue,then the upgrade testing could not be moved forward.

Comment 8 Johnny Liu 2019-03-13 16:40:37 UTC

Maybe this bug is some side effect introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1688321, even before upgrade, base cluster probably already would get into `Degraded` state (have to wait for some time).

Comment 9 W. Trevor King 2019-03-13 16:58:52 UTC

> Maybe this bug is some side effect introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1688321, even before upgrade, base cluster probably already would get into `Degraded` state (have to wait for some time).

That sounds plausible to me.  Try with a RHCOS newer than 47.330 (see [1])?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1688321#c3

Comment 10 Colin Walters 2019-03-14 18:23:34 UTC

Yes, we need to update RHCOS.  Also I think it should be fairly easy to do a quick sanity check for this by mirroring the release payload to any registry that requires auth and injecting the auth required into the pull secret.

Comment 11 Johnny Liu 2019-03-15 05:47:23 UTC

And yesterday I used 400.7.20190306.0 as AMI for based cluster, then upgrade the cluster, did not reproduce such issue any more.

According to https://bugzilla.redhat.com/show_bug.cgi?id=1688321#c6, we would not support beta2 to beta3 upgrade, so this issue is out of our test plan. So I am okay with close it as 'WONTFIX'.

Comment 12 zhou ying 2019-03-25 07:17:11 UTC

Today I hit the same issue when do upgrade from 4.0.0-0.nightly-2019-03-22-191219 to registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709:


Steps:
1. install ocp env from Payload: registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-22-191219,  as a base cluster

2. change cincinnati server to OCP graph.
# oc get clusterversion -o yaml
<--snip-->
  spec:
    channel: stable-4.0
    clusterID: 0623d282-fdfb-4178-b7ac-1889f8213f49
    desiredUpdate:
      image: registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709
      version: 4.0.0-0.nightly-2019-03-23-183709
    upstream: https://openshift-release.svc.ci.openshift.org/graph
<--snip-->


3. Trigger upgrade
[root@dhcp-140-138 yamlfile]# oc adm upgrade  --to 4.0.0-0.nightly-2019-03-23-183709
Updating to 4.0.0-0.nightly-2019-03-23-183709


4. Upgrade is completed.
[root@dhcp-140-138 yamlfile]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-23-183709   True        True          21m     Unable to apply 4.0.0-0.nightly-2019-03-23-183709: the cluster operator machine-config is failing

[root@dhcp-140-138 yamlfile]# oc get clusteroperator machine-config
NAME             VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config   4.0.0-0.nightly-2019-03-23-183709   False       False         True      53m

[root@dhcp-140-138 yamlfile]# oc describe clusteroperators machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-03-25T01:56:23Z
  Generation:          1
  Resource Version:    232307
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 312bc461-4ea1-11e9-8737-022231078d74
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-03-25T06:14:37Z
    Message:               Cluster not available for 4.0.0-0.nightly-2019-03-23-183709
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-03-25T06:14:32Z
    Message:               Cluster version is 4.0.0-0.nightly-2019-03-23-183709
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-03-25T06:14:37Z
    Message:               Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Reason:                error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
    Status:                True
    Type:                  Failing
  Extension:
    Master:  0 out of 3 nodes have updated to latest configuration rendered-master-6c540afa18c09b5ed95dff3629b55c7d
    Worker:  0 out of 2 nodes have updated to latest configuration rendered-worker-38aa0c99ae25717b214181710d572963
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
  Versions:
    Name:     operator
    Version:  4.0.0-0.nightly-2019-03-23-183709
Events:       <none>


[root@dhcp-140-138 yamlfile]# oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING
master   rendered-master-6c540afa18c09b5ed95dff3629b55c7d   False     True
worker   rendered-worker-38aa0c99ae25717b214181710d572963   False     True



[root@dhcp-140-138 yamlfile]# oc describe machineconfigpool master
Name:         master
Namespace:    
Labels:       operator.machineconfiguration.openshift.io/required-for-upgrade=
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2019-03-25T01:56:23Z
  Generation:          1
  Resource Version:    178001
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  UID:                 3131f023-4ea1-11e9-8737-022231078d74
Spec:
  Machine Config Selector:
    Match Labels:
      Machineconfiguration . Openshift . Io / Role:  master
  Machine Selector:
    Match Labels:
      Node - Role . Kubernetes . Io / Master:  
  Max Unavailable:                             <nil>
  Paused:                                      false
Status:
  Conditions:
    Last Transition Time:  2019-03-25T06:14:34Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-03-25T06:14:34Z
    Message:               
    Reason:                All nodes are updating to rendered-master-6c540afa18c09b5ed95dff3629b55c7d
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-master-6c540afa18c09b5ed95dff3629b55c7d
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-master
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-master-ssh
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-master-3131f023-4ea1-11e9-8737-022231078d74-registries
  Machine Count:              3
  Observed Generation:        1
  Ready Machine Count:        0
  Unavailable Machine Count:  1
  Updated Machine Count:      0
Events:                       <none>
[root@dhcp-140-138 yamlfile]# oc describe machineconfigpool worker
Name:         worker
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2019-03-25T01:56:23Z
  Generation:          1
  Resource Version:    177952
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  UID:                 3133fea7-4ea1-11e9-8737-022231078d74
Spec:
  Machine Config Selector:
    Match Labels:
      Machineconfiguration . Openshift . Io / Role:  worker
  Machine Selector:
    Match Labels:
      Node - Role . Kubernetes . Io / Worker:  
  Max Unavailable:                             <nil>
  Paused:                                      false
Status:
  Conditions:
    Last Transition Time:  2019-03-25T06:14:34Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-03-25T06:14:34Z
    Message:               
    Reason:                All nodes are updating to rendered-worker-38aa0c99ae25717b214181710d572963
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-worker-38aa0c99ae25717b214181710d572963
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker-ssh
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-3133fea7-4ea1-11e9-8737-022231078d74-registries
  Machine Count:              2
  Observed Generation:        1
  Ready Machine Count:        0
  Unavailable Machine Count:  1
  Updated Machine Count:      0
Events:                       <none>


[root@dhcp-140-138 yamlfile]# oc describe nodes|grep machineconfig
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-38aa0c99ae25717b214181710d572963
                    machineconfiguration.openshift.io/state: Degraded
                    machineconfiguration.openshift.io/currentConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-6c540afa18c09b5ed95dff3629b55c7d
                    machineconfiguration.openshift.io/state: Degraded
                    machineconfiguration.openshift.io/currentConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69
                    machineconfiguration.openshift.io/state: Done


[root@dhcp-140-138 yamlfile]# oc get machineconfig
NAME                                                        GENERATEDBYCONTROLLER       IGNITIONVERSION   CREATED
00-master                                                   4.0.22-201903220117-dirty   2.2.0             5h15m
00-master-ssh                                               4.0.22-201903220117-dirty   2.2.0             5h15m
00-worker                                                   4.0.22-201903220117-dirty   2.2.0             5h15m
00-worker-ssh                                               4.0.22-201903220117-dirty   2.2.0             5h15m
01-master-container-runtime                                 4.0.22-201903220117-dirty   2.2.0             5h15m
01-master-kubelet                                           4.0.22-201903220117-dirty   2.2.0             5h15m
01-worker-container-runtime                                 4.0.22-201903220117-dirty   2.2.0             5h15m
01-worker-kubelet                                           4.0.22-201903220117-dirty   2.2.0             5h15m
99-master-3131f023-4ea1-11e9-8737-022231078d74-registries   4.0.22-201903220117-dirty   2.2.0             5h14m
99-worker-3133fea7-4ea1-11e9-8737-022231078d74-registries   4.0.22-201903220117-dirty   2.2.0             5h14m
rendered-master-257ba747991bcdc361d0ad6faac39a69            4.0.22-201903220117-dirty   2.2.0             5h15m
rendered-master-6c540afa18c09b5ed95dff3629b55c7d            4.0.22-201903220117-dirty   2.2.0             58m
rendered-worker-38aa0c99ae25717b214181710d572963            4.0.22-201903220117-dirty   2.2.0             58m
rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2            4.0.22-201903220117-dirty   2.2.0             5h15m


oc logs -f po/cluster-version-operator-685c68d958-k5c44 -n openshift-cluster-version |grep E0325
E0325 07:14:12.898269       1 task.go:58] error running apply for clusteroperator "machine-config" (105 of 310): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
E0325 07:15:22.898834       1 task.go:58] error running apply for clusteroperator "machine-config" (105 of 310): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)


[root@dhcp-140-138 yamlfile]# oc get po -n openshift-machine-config-operator
NAME                                         READY   STATUS    RESTARTS   AGE
machine-config-controller-5d4864f75b-5bpjf   1/1     Running   0          5h19m
machine-config-daemon-2b6fk                  1/1     Running   0          5h17m
machine-config-daemon-498qp                  1/1     Running   0          5h17m
machine-config-daemon-7g9fv                  1/1     Running   0          5h17m
machine-config-daemon-fhzw5                  1/1     Running   0          5h12m
machine-config-daemon-sjd9k                  1/1     Running   0          5h12m
machine-config-operator-579574655f-kvt5x     1/1     Running   0          64m
machine-config-server-75sp8                  1/1     Running   0          5h17m
machine-config-server-pqgc7                  1/1     Running   0          5h17m
machine-config-server-qwv92                  1/1     Running   0          5h17m
[root@dhcp-140-138 yamlfile]# oc logs -f po/machine-config-operator-579574655f-kvt5x -n openshift-machine-config-operator |grep E0325
E0325 06:14:25.827782       1 event.go:259] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"2e11d21d-4ea1-11e9-8737-022231078d74", ResourceVersion:"177804", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689075778, loc:(*time.Location)(0x1d39d00)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-579574655f-kvt5x_f87f94f4-4ec4-11e9-9605-0a580a8200e4\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-03-25T06:14:25Z\",\"renewTime\":\"2019-03-25T06:14:25Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-579574655f-kvt5x_f87f94f4-4ec4-11e9-9605-0a580a8200e4 became leader'
E0325 06:15:57.585267       1 operator.go:249] error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
E0325 06:17:26.330500       1 operator.go:249] error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)

Comment 13 liujia 2019-03-25 08:04:20 UTC

Add image info in case it's still related with ami version.
# oc logs pod/machine-config-daemon-2b6fk -n openshift-machine-config-operator|grep Version
I0325 01:58:53.884669   16503 start.go:54] Version: 4.0.22-201903220117-dirty
                   Version: 410.8.20190320.1 (2019-03-20T21:01:36Z)
                   Version: 400.7.20190306.0 (2019-03-06T22:16:26Z)

# oc get clusterversion -o json|jq ".items[0].status"
{
  "availableUpdates": null,
  "conditions": [
    {
      "lastTransitionTime": "2019-03-25T02:07:00Z",
      "message": "Done applying 4.0.0-0.nightly-2019-03-22-191219",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2019-03-25T06:20:18Z",
      "message": "Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2019-03-25T06:04:00Z",
      "message": "Unable to apply 4.0.0-0.nightly-2019-03-23-183709: the cluster operator machine-config is failing",
      "reason": "ClusterOperatorFailing",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2019-03-25T06:02:41Z",
      "status": "True",
      "type": "RetrievedUpdates"
    }
  ],
  "desired": {
    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709",
    "version": "4.0.0-0.nightly-2019-03-23-183709"
  },
  "history": [
    {
      "completionTime": null,
      "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709",
      "startedTime": "2019-03-25T06:04:00Z",
      "state": "Partial",
      "version": "4.0.0-0.nightly-2019-03-23-183709"
    },
    {
      "completionTime": "2019-03-25T06:04:00Z",
      "image": "registry.svc.ci.openshift.org/ocp/release@sha256:97ec469af3deb6e5eba521f1188f165b74e5f6891e3e32dfd27c38aeb2bc17ad",
      "startedTime": "2019-03-25T01:51:29Z",
      "state": "Completed",
      "version": "4.0.0-0.nightly-2019-03-22-191219"
    }
  ],
  "observedGeneration": 3,
  "versionHash": "FYrLPpaVLdM="
}

Comment 14 zhou ying 2019-03-25 09:56:33 UTC

On today's test, we have met this frequently. So reopening it.

Comment 15 Colin Walters 2019-03-25 12:50:52 UTC

Whenever you see Degraded systems, look at the MCD logs:

`oc -n openshift-machine-config-operator logs pods/machine-config-daemon-xyzxyz`

Can you get some output from that?

Comment 16 Antonio Murdaca 2019-03-25 12:57:05 UTC

also, we've change the way that check (syncRequiredMachineConfigPool) works and it now instantly reports if it's not finished flipping to failing. So, my question is, does it eventually flip back to Failing: False if you leave the upgrade process running for some more time?

Maybe we shouldn't set Failing if we error on that but keeps retrying.

Comment 17 Antonio Murdaca 2019-03-25 17:39:29 UTC

It would also be nice if you provide informations by just using https://github.com/openshift/must-gather which auto-collects the data we need to debug.

Comment 18 zhou ying 2019-03-26 07:52:28 UTC

[root@dhcp-140-138 ~]# oc logs -f po/machine-config-daemon-94lcb  -n openshift-machine-config-operator |grep E0326
E0326 05:30:02.607247   96951 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=57, ErrCode=NO_ERROR, debug=""
E0326 05:30:02.607767   96951 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=57, ErrCode=NO_ERROR, debug=""
E0326 05:32:33.630012   96951 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=7, ErrCode=NO_ERROR, debug=""

Comment 19 zhou ying 2019-03-26 08:25:22 UTC

Today in my env:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-25-154140   True        True          5h39m   Unable to apply 4.0.0-0.nightly-2019-03-25-154140: the cluster operator machine-config has not yet successfully rolled out

Comment 20 zhou ying 2019-03-26 08:26:51 UTC

Created attachment 1547944 [details]
Must-gather info

Comment 22 Johnny Liu 2019-03-26 08:48:33 UTC

I also hit the same issue from 4.0.0-0.nightly-2019-03-23-222829 to 4.0.0-0.nightly-2019-03-25-154140 upgrade.

Before upgrade:
# oc describe node|grep machineconfig
                    machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-ec1202835a931d3cf83b34760ee45095
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-ec1202835a931d3cf83b34760ee45095
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/state: Done


# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-23-222829   True        False         45m     Cluster version is 4.0.0-0.nightly-2019-03-23-222829

# oc get clusteroperator
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                        4.0.0-0.nightly-2019-03-23-222829   True        False         False     46m
cluster-autoscaler                    4.0.0-0.nightly-2019-03-23-222829   True        False         False     84m
console                               4.0.0-0.nightly-2019-03-23-222829   True        False         False     51m
dns                                   4.0.0-0.nightly-2019-03-23-222829   True        False         False     83m
image-registry                        4.0.0-0.nightly-2019-03-23-222829   True        False         False     76m
ingress                               4.0.0-0.nightly-2019-03-23-222829   True        False         False     56m
kube-apiserver                        4.0.0-0.nightly-2019-03-23-222829   True        False         False     81m
kube-controller-manager               4.0.0-0.nightly-2019-03-23-222829   True        False         False     81m
kube-scheduler                        4.0.0-0.nightly-2019-03-23-222829   True        False         False     83m
machine-api                           4.0.0-0.nightly-2019-03-23-222829   True        False         False     84m
machine-config                        4.0.0-0.nightly-2019-03-23-222829   True        False         False     83m
marketplace                           4.0.0-0.nightly-2019-03-23-222829   True        False         False     79m
monitoring                            4.0.0-0.nightly-2019-03-23-222829   True        False         False     55m
network                               4.0.0-0.nightly-2019-03-23-222829   True        False         False     84m
node-tuning                           4.0.0-0.nightly-2019-03-23-222829   True        False         False     79m
openshift-apiserver                   4.0.0-0.nightly-2019-03-23-222829   True        False         False     76m
openshift-cloud-credential-operator   4.0.0-0.nightly-2019-03-23-222829   True        False         False     84m
openshift-controller-manager          4.0.0-0.nightly-2019-03-23-222829   True        False         False     80m
openshift-samples                     4.0.0-0.nightly-2019-03-23-222829   True        False         False     75m
operator-lifecycle-manager            4.0.0-0.nightly-2019-03-23-222829   True        False         False     83m
service-ca                            4.0.0-0.nightly-2019-03-23-222829   True        False         False     83m
service-catalog-apiserver             4.0.0-0.nightly-2019-03-23-222829   True        False         False     79m
service-catalog-controller-manager    4.0.0-0.nightly-2019-03-23-222829   True        False         False     79m
storage                               4.0.0-0.nightly-2019-03-23-222829   True        False         False     56m

Trigger upgrade:
# oc adm upgrade --to-latest
Updating to latest version 4.0.0-0.nightly-2019-03-25-154140
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-25-154140   True        True          7s      Working towards 4.0.0-0.nightly-2019-03-25-154140: downloading update



Check after *48 mins*:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-25-154140   True        True          48m     Unable to apply 4.0.0-0.nightly-2019-03-25-154140: the cluster operator machine-config has not yet successfully rolled out

# oc get clusteroperator
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                        4.0.0-0.nightly-2019-03-23-222829   True        False         False     50m
cluster-autoscaler                    4.0.0-0.nightly-2019-03-23-222829   True        False         False     50m
console                               4.0.0-0.nightly-2019-03-23-222829   True        False         False     107m
dns                                   4.0.0-0.nightly-2019-03-25-154140   True        False         False     140m
image-registry                        4.0.0-0.nightly-2019-03-23-222829   True        False         False     132m
ingress                               4.0.0-0.nightly-2019-03-23-222829   True        False         False     112m
kube-apiserver                        4.0.0-0.nightly-2019-03-25-154140   True        False         False     43m
kube-controller-manager               4.0.0-0.nightly-2019-03-25-154140   True        False         False     40m
kube-scheduler                        4.0.0-0.nightly-2019-03-25-154140   True        False         False     43m
machine-api                           4.0.0-0.nightly-2019-03-25-154140   True        False         False     141m
machine-config                        4.0.0-0.nightly-2019-03-23-222829   False       True          True      36m
marketplace                           4.0.0-0.nightly-2019-03-23-222829   True        False         False     135m
monitoring                            4.0.0-0.nightly-2019-03-23-222829   True        False         False     111m
network                               4.0.0-0.nightly-2019-03-25-154140   True        False         False     140m
node-tuning                           4.0.0-0.nightly-2019-03-23-222829   True        False         False     136m
openshift-apiserver                   4.0.0-0.nightly-2019-03-23-222829   True        False         False     16m
openshift-cloud-credential-operator   4.0.0-0.nightly-2019-03-25-154140   True        False         False     141m
openshift-controller-manager          4.0.0-0.nightly-2019-03-23-222829   True        False         False     48m
openshift-samples                     4.0.0-0.nightly-2019-03-23-222829   True        False         False     132m
operator-lifecycle-manager            4.0.0-0.nightly-2019-03-23-222829   True        False         False     140m
service-ca                            4.0.0-0.nightly-2019-03-25-154140   True        False         False     46m
service-catalog-apiserver             4.0.0-0.nightly-2019-03-23-222829   True        False         False     136m
service-catalog-controller-manager    4.0.0-0.nightly-2019-03-23-222829   True        False         False     49m
storage                               4.0.0-0.nightly-2019-03-23-222829   True        False         False     113m

# oc describe node|grep machineconfig
                    machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-ec1202835a931d3cf83b34760ee45095
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-fb0bade95cda29515460a5dddf46bce6
                    machineconfiguration.openshift.io/state: Unreconcilable
                    machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-13131f3a8f1d80a10d2149723b4bed3f
                    machineconfiguration.openshift.io/state: Unreconcilable
                    machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c
                    machineconfiguration.openshift.io/state: Done

# oc logs machine-config-daemon-lt2l2 -n openshift-machine-config-operator
<--snip-->
I0326 08:41:55.762357   48492 run.go:22] Running captured: rpm-ostree status
I0326 08:41:55.874056   48492 daemon.go:738] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dd9128d2031047683071211d883c96f30db50d4d1eb85f153a22302ec47bb16
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190320.1 (2019-03-20T21:01:36Z)

  pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670
              CustomOrigin: Provisioned from oscontainer
                   Version: 400.7.20190306.0 (2019-03-06T22:16:26Z)
I0326 08:41:55.874108   48492 daemon.go:673] Current config: rendered-worker-ec1202835a931d3cf83b34760ee45095
I0326 08:41:55.874121   48492 daemon.go:674] Desired config: rendered-worker-fb0bade95cda29515460a5dddf46bce6
I0326 08:41:55.885297   48492 daemon.go:792] Validated on-disk state
I0326 08:41:55.889954   48492 update.go:194] Checking reconcilable for config rendered-worker-ec1202835a931d3cf83b34760ee45095 to rendered-worker-fb0bade95cda29515460a5dddf46bce6
I0326 08:41:55.889972   48492 update.go:252] Checking if configs are reconcilable
I0326 08:41:55.891849   48492 update.go:715] can't reconcile config rendered-worker-ec1202835a931d3cf83b34760ee45095 with rendered-worker-fb0bade95cda29515460a5dddf46bce6: ignition links section contains changes
E0326 08:41:55.895754   48492 writer.go:97] Marking Unreconcilable due to: can't reconcile config rendered-worker-ec1202835a931d3cf83b34760ee45095 with rendered-worker-fb0bade95cda29515460a5dddf46bce6: ignition links section contains changes: unreconcilable

Comment 23 Johnny Liu 2019-03-26 08:51:18 UTC

Created attachment 1547958 [details]
rendered-worker-ec1202835a931d3cf83b34760ee45095.yaml

Comment 24 Johnny Liu 2019-03-26 08:51:43 UTC

Created attachment 1547960 [details]
rendered-worker-fb0bade95cda29515460a5dddf46bce6.yaml

Comment 25 Johnny Liu 2019-03-26 08:54:17 UTC

Follow up comment 22:
MCD on master also the same issue.

# oc logs machine-config-daemon-psm7q -n openshift-machine-config-operator
<--snip-->
I0326 08:50:41.416626   97441 run.go:22] Running captured: rpm-ostree status
I0326 08:50:41.452273   97441 daemon.go:738] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dd9128d2031047683071211d883c96f30db50d4d1eb85f153a22302ec47bb16
              CustomOrigin: Managed by pivot tool
                   Version: 410.8.20190320.1 (2019-03-20T21:01:36Z)

  pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670
              CustomOrigin: Provisioned from oscontainer
                   Version: 400.7.20190306.0 (2019-03-06T22:16:26Z)
I0326 08:50:41.452315   97441 daemon.go:673] Current config: rendered-master-8c108b7752cb2545da64b96e15241d8c
I0326 08:50:41.452337   97441 daemon.go:674] Desired config: rendered-master-13131f3a8f1d80a10d2149723b4bed3f
I0326 08:50:41.459478   97441 daemon.go:792] Validated on-disk state
I0326 08:50:41.459979   97441 update.go:194] Checking reconcilable for config rendered-master-8c108b7752cb2545da64b96e15241d8c to rendered-master-13131f3a8f1d80a10d2149723b4bed3f
I0326 08:50:41.460000   97441 update.go:252] Checking if configs are reconcilable
I0326 08:50:41.461579   97441 update.go:715] can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes
E0326 08:50:41.464958   97441 writer.go:97] Marking Unreconcilable due to: can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes: unreconcilable
W0326 08:50:41.616341   97441 daemon.go:292] Booting the MCD errored with can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes: unreconcilable
I0326 08:50:41.616375   97441 run.go:22] Running captured: rpm-ostree status

Comment 26 Johnny Liu 2019-03-26 08:54:45 UTC

Created attachment 1547962 [details]
rendered-master-13131f3a8f1d80a10d2149723b4bed3f.yaml

Comment 27 Johnny Liu 2019-03-26 08:55:54 UTC

Created attachment 1547963 [details]
rendered-master-8c108b7752cb2545da64b96e15241d8c.yaml

Comment 29 Wei Sun 2019-03-26 09:04:50 UTC

From this Monday,we frequent meet this bug,current QE's upgrade test is blocked by this bug.

Comment 30 Antonio Murdaca 2019-03-26 10:03:17 UTC

>  Booting the MCD errored with can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes: unreconcilable


That error is pretty clear though, it looks like between those machineconfigs, the links section has been altered and we can't reconcile.

I'll try to understand what is touching the links section though...I can't connect to the cluster with your kubeconfig, it times out.

Comment 31 Antonio Murdaca 2019-03-26 10:07:07 UTC

Ok, the links section being added is the stopgap we introduced to support pulling the pause image when it's authenticated (ref: https://github.com/openshift/machine-config-operator/pull/535)

the new MachineConfig contains:

```
      links:
      - filesystem: root
        overwrite: false
        path: /root/.docker/config.json
        target: /var/lib/kubelet/config.json
```

and that's causing the issue.

Comment 33 Antonio Murdaca 2019-03-26 10:17:10 UTC

This has been fixed by https://github.com/openshift/machine-config-operator/pull/540 which reverted https://github.com/openshift/machine-config-operator/pull/535 which was adding a symlink which isn't supported for reconcile.

What's happening now is that:

1) you're starting a cluster with an MCO version which contains #535
2) you're upgrading to a payload which doesn't have #535 but have #540

The above means that 1) generates MachineConfigs with an unsupported symlink and when upgrading to 2), the symlink is removed causing drift and an unreconcilable error.


> I also hit the same issue from 4.0.0-0.nightly-2019-03-23-222829 to 4.0.0-0.nightly-2019-03-25-154140 upgrade.

This BZ should be fixed as long as you use a starting cluster from a payload that contains #540

Comment 34 Antonio Murdaca 2019-03-26 10:24:27 UTC

This PR that I've just made is also helping with transitioning this test scenarios from old payload that don't contain #540 https://github.com/openshift/machine-config-operator/pull/580

But I'd consider the BZ fixed as long as you use a newer payload.

Comment 36 liujia 2019-03-27 06:01:58 UTC

Based comment33, QE use a source node including #540 and have a test against latest avialble upgrade path(from 4.0.0-0.nightly-2019-03-25-180911 to 4.0.0-0.nightly-2019-03-26-072833).
Upgrade succeed.

[root@preserve-jliu-worker tmp]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-26-072833   True        False         64m     Cluster version is 4.0.0-0.nightly-2019-03-26-072833
[root@preserve-jliu-worker tmp]# oc get clusterversion -o json|jq ".items[0].status.history"
[
  {
    "completionTime": "2019-03-27T04:47:41Z",
    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-26-072833",
    "startedTime": "2019-03-27T04:03:21Z",
    "state": "Completed",
    "version": "4.0.0-0.nightly-2019-03-26-072833"
  },
  {
    "completionTime": "2019-03-27T04:03:21Z",
    "image": "registry.svc.ci.openshift.org/ocp/release@sha256:2d781cbe28722b6eeb3ff969c5dc68199198fd1f0514a3284eb7215ae0cb4d2f",
    "startedTime": "2019-03-27T02:59:28Z",
    "state": "Completed",
    "version": "4.0.0-0.nightly-2019-03-25-180911"
  }
]
# oc get co machine-config
NAME             VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config   4.0.0-0.nightly-2019-03-26-072833   True        False         False     81m

So remove the blocker keywords.

For the comment34, since pr580 is still not avaialble in any green path.
So, keep the bug ON_QA status, will verify it after a new support upgrade path including the build ready for test.

Comment 37 liujia 2019-04-04 02:52:51 UTC

Continue the comment36, since there is not an avaliable update path in the last week(with a start node not including #540, a end node including #580). Currently, all nightly build not including #540 have been removed from [1]. And as for a beta3 release build, which will include #540. So according to the verify in comment36. Verify the bug and change status.
 

[1] https://openshift-release.svc.ci.openshift.org/

Comment 39 errata-xmlrpc 2019-06-04 10:44:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Comment 40 Hongkai Liu 2019-09-20 13:56:36 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7374#0:build-log.txt%3A2493

I saw a similar error msg in the CI:
Sep 20 11:35:35.264 E clusteroperator/machine-config changed Degraded to True: RequiredPoolsFailed: Failed to resync 4.2.0-0.nightly-2019-09-20-102942 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1)

Not sure if it is from the same cause. Reopening ... Please close if it is not.

Comment 41 Antonio Murdaca 2019-09-20 14:14:29 UTC

from logs on one of the master:

```
I0920 11:26:35.411199    5381 update.go:89] pod "packageserver-6cc7c655f4-k97r4" removed (evicted)
I0920 11:36:28.680522    5381 update.go:89] pod "downloads-64f8dbd46c-xdgzs" removed (evicted)
```

The timeout to the upgrade is caused by the download pod not evicting and taking ~600s (10m) to stop. That causes a delay which is reflected to upgrade time/roll out time.

The bug to track is: https://bugzilla.redhat.com/show_bug.cgi?id=1745772

Comment 42 liujia 2019-09-23 02:35:02 UTC

Please file a new bug when regression found if previous one is closed. 
So restore the bug's initial status back.

Note You need to log in before you can comment on or make changes to this bug.