Hide Forgot
Description of problem: Do manually upgrade with a specified update payload, upgrade failed and stunk at the stage of the cluster operator machine-config update. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-13-204401 True True 90m Unable to apply 4.0.0-0.nightly-2019-02-13-204401: the cluster operator machine-config is failing # oc adm upgrade error: Unable to apply 4.0.0-0.nightly-2019-02-13-204401: the cluster operator machine-config is failing: Reason: ClusterOperatorFailing Message: Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) # oc get clusterversion -o json|jq ".items[0].status" { "availableUpdates": null, "conditions": [ { "lastTransitionTime": "2019-02-13T10:02:55Z", "message": "Done applying 4.0.0-0.nightly-2019-02-12-150919", "status": "True", "type": "Available" }, { "lastTransitionTime": "2019-02-14T07:03:30Z", "message": "Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)", "reason": "ClusterOperatorFailing", "status": "True", "type": "Failing" }, { "lastTransitionTime": "2019-02-14T06:58:40Z", "message": "Unable to apply 4.0.0-0.nightly-2019-02-13-204401: the cluster operator machine-config is failing", "reason": "ClusterOperatorFailing", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2019-02-14T06:59:30Z", "status": "True", "type": "RetrievedUpdates" } ], "desired": { "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401", "version": "4.0.0-0.nightly-2019-02-13-204401" }, "history": [ { "completionTime": null, "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401", "startedTime": "2019-02-14T06:59:30Z", "state": "Partial", "version": "4.0.0-0.nightly-2019-02-13-204401" }, { "completionTime": "2019-02-14T06:59:30Z", "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401", "startedTime": "2019-02-14T06:58:40Z", "state": "Partial", "version": "4.0.0-0.nightly-2019-02-13-204401" }, { "completionTime": "2019-02-14T06:58:40Z", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:7bd57da7777e65f6cd4c8aa726b90ab00b6804ce97819cc83093bf9a1841e32b", "startedTime": "2019-02-13T09:34:06Z", "state": "Completed", "version": "4.0.0-0.nightly-2019-02-12-150919" } ], "observedGeneration": 3, "versionHash": "C6QROhGXMC8=" } ======================================== Some error logs from cvo pod. # oc logs -f cluster-version-operator-67b8598cf8-m7th8|grep error E0214 07:00:52.841603 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Get https://127.0.0.1:6443/apis/config.openshift.io/v1/clusteroperators/machine-config: dial tcp 127.0.0.1:6443: connect: connection refused E0214 07:02:02.865496 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:03:15.878955 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:04:53.287868 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:06:03.296365 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:07:16.303944 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:09:18.579255 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:10:28.587613 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:11:41.595463 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:14:38.076932 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config has not yet reported success E0214 07:15:48.084508 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) E0214 07:17:01.091884 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) I0214 07:17:01.092000 1 task_graph.go:518] Result of work: [Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)] E0214 07:17:01.092021 1 sync_worker.go:263] unable to synchronize image (waiting 3m19.747206386s): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) E0214 07:22:00.358999 1 task.go:57] error running apply for clusteroperator "machine-config" (96 of 279): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) One master get Degraded during upgrade. [root@preserve-jliu-worker 0213]# oc get clusteroperators machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: 2019-02-13T09:45:45Z generation: 1 name: machine-config resourceVersion: "798380" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config uid: 22c5bb1c-2f74-11e9-a9af-02a295b947c2 spec: {} status: conditions: - lastTransitionTime: 2019-02-13T09:46:23Z message: Cluster is available at 4.0.0-0.171.0.1-dirty status: "True" type: Available - lastTransitionTime: 2019-02-14T07:05:26Z message: Running resync for 4.0.0-0.171.0.1-dirty status: "True" type: Progressing - lastTransitionTime: 2019-02-14T07:15:30Z message: 'Failed to resync 4.0.0-0.171.0.1-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)' reason: 'error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)' status: "True" type: Failing extension: master: pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed. worker: all 3 nodes are at latest configuration worker-368aaa977e43afad36e2103a38d1dd6d relatedObjects: null versions: - name: machineconfigcontroller version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4c089d6750a907f773eb5f06fcf768f9ce9e33bd54920634420b42a9e31c97f6 - name: machineconfigdaemon version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c1a503a17c86e3beb47d6006466bfb13788f6a264e44ca08f8a0ce8934c5dd0b - name: machineconfigserver version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6c850fba1c46122712ee8d6e5d8c0ba90a6ee020bd488323cde337a4bc60d9d3 - name: operator version: 4.0.0-0.171.0.1-dirty [root@preserve-jliu-worker 0213]# machineconfigpool master -o yaml bash: machineconfigpool: command not found [root@preserve-jliu-worker 0213]# oc get machineconfigpool master -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: 2019-02-13T09:45:45Z generation: 1 labels: operator.machineconfiguration.openshift.io/required-for-upgrade: "" name: master resourceVersion: "769155" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master uid: 22c991ba-2f74-11e9-a9af-02a295b947c2 spec: machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: master machineSelector: matchLabels: node-role.kubernetes.io/master: "" maxUnavailable: null paused: false status: conditions: - lastTransitionTime: 2019-02-14T07:05:25Z message: "" reason: "" status: "False" type: Updated - lastTransitionTime: 2019-02-14T07:05:25Z message: "" reason: All nodes are updating to master-aa5545a458765bfde2ba66c68a13ad3c status: "True" type: Updating - lastTransitionTime: 2019-02-14T07:05:30Z message: "" reason: 1 nodes are reporting degraded status on update. Cannot proceed. status: "True" type: Degraded configuration: name: master-aa5545a458765bfde2ba66c68a13ad3c source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-master - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-master-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-kubelet machineCount: 3 observedGeneration: 1 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0 [root@preserve-jliu-worker 0213]# oc get nodes -l node-role.kubernetes.io/master= NAME STATUS ROLES AGE VERSION ip-10-0-31-39.us-east-2.compute.internal Ready master 22h v1.12.4+a532756e37 ip-10-0-47-24.us-east-2.compute.internal Ready master 22h v1.12.4+a532756e37 ip-10-0-6-220.us-east-2.compute.internal Ready master 22h v1.12.4+a532756e37 [root@preserve-jliu-worker 0213]# oc get nodes -o yaml | grep -e name: -e machineconfiguration ... machineconfiguration.openshift.io/currentConfig: master-a7f013187d3f2d0f01781946ed61fe1b machineconfiguration.openshift.io/desiredConfig: master-a7f013187d3f2d0f01781946ed61fe1b machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Degraded kubernetes.io/hostname: ip-10-0-6-220 name: ip-10-0-6-220.us-east-2.compute.internal Version-Release number of the following components: sh-4.2# cluster-version-operator version ClusterVersionOperator v4.0.0-0.171.0.0-dirty How reproducible: always Steps to Reproduce: 1. Install ocp with 4.0.0-0.nightly-2019-02-12-150919 succeed. 2. Run upgrade manually with a specified release image # oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401 3. Actual results: upgrade failed. Expected results: upgrade succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag
> # oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-13-204401 I've changed the component from Installer to Upgrade, since the issue is this post-install step.
Hit the issue again when do upgrade manually from 4.0.0-0.nightly-2019-03-04-234414 to 4.0.0-0.nightly-2019-03-06-074438. The same steps with those in description. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-06-074438 True True 44m Unable to apply 4.0.0-0.nightly-2019-03-06-074438: the cluster operator machine-config is failing # oc get clusteroperators machine-config NAME VERSION AVAILABLE PROGRESSING FAILING SINCE machine-config False False True 25m # oc adm upgrade error: Unable to apply 4.0.0-0.nightly-2019-03-06-074438: the cluster operator machine-config is failing: Reason: ClusterOperatorFailing Message: Cluster operator machine-config is reporting a failure: Failed to resync 4.0.16-1-dirty because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) # oc describe clusteroperators machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-03-11T02:54:59Z Generation: 1 Resource Version: 300749 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 0f2b1a8b-43a9-11e9-899e-0626753ec480 Spec: Status: Conditions: Last Transition Time: 2019-03-11T07:07:02Z Message: Cluster not available for 4.0.16-1-dirty Status: False Type: Available Last Transition Time: 2019-03-11T07:07:02Z Message: Cluster version is 4.0.16-1-dirty Status: False Type: Progressing Last Transition Time: 2019-03-11T07:07:02Z Message: Failed to resync 4.0.16-1-dirty because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) Reason: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) Status: True Type: Failing Extension: Master: pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed. Worker: pool is degraded because of 4 nodes are reporting degraded status on update. Cannot proceed. Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Versions: Name: machineconfigcontroller Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd23c28661f8e4b885bb52c485eb36f6a844e0e1d43fd19a30d602aed56f237d Name: machineconfigdaemon Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6527c163c19a84c5df242732ead867e2ac6fdf3b9a428fa4a26bc0be42a17c4d Name: machineconfigserver Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c72b03fd8dba3477ca3e6cbb962047e5d7572603dc9bdab41f17ef5c02877a Name: operator Version: 4.0.16-1-dirty Events: <none> Please refer to detail cvo log in attachment.
Today I hit the issue too when do upgrade from quay.io/openshift-release-dev/ocp-release:4.0.0-0.7 to registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438. Steps: 1. install beta2 released cluster as a base cluster 2. change cincinnati server to my dummy build graph. # oc get clusterversion -o yaml <--snip--> spec: channel: stable-4.0 clusterID: b98615d8-224c-4737-8934-c822e3dcbb58 upstream: http://3.86.146.114/cincinnati_build_graph <--snip--> # curl http://3.86.146.114/cincinnati_build_graph { "nodes": [ { "version": "4.0.0-0.7", "payload": "quay.io/openshift-release-dev/ocp-release:4.0.0-0.7", "metadata": {"description": "Beta 2"} }, { "version": "4.0.0-0.nightly-2019-03-06-074438", "payload": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438" } ], "edges": [ [ 0, 1 ] ] } 3. Trigger upgrade # oc adm upgrade --to=4.0.0-0.nightly-2019-03-06-074438 Updating to 4.0.0-0.nightly-2019-03-06-074438 4. Upgrade is completed. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-06-074438 True False 6m58s Cluster version is 4.0.0-0.nightly-2019-03-06-074438 5. After 12 hours later, check cluster again. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-06-074438 True False 23h Error while reconciling 4.0.0-0.nightly-2019-03-06-074438: the cluster operator machine-config is failing # oc get clusteroperators machine-config NAME VERSION AVAILABLE PROGRESSING FAILING SINCE machine-config False False True 22h # oc describe clusteroperators machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-03-12T07:18:08Z Generation: 1 Resource Version: 940930 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: fcbe9c26-4496-11e9-a8a9-064890e55fee Spec: Status: Conditions: Last Transition Time: 2019-03-12T09:38:20Z Message: Cluster not available for 4.0.16-1-dirty Status: False Type: Available Last Transition Time: 2019-03-12T09:38:20Z Message: Cluster version is 4.0.16-1-dirty Status: False Type: Progressing Last Transition Time: 2019-03-12T09:38:20Z Message: Failed to resync 4.0.16-1-dirty because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) Reason: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) Status: True Type: Failing Extension: Master: pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed. Worker: pool is degraded because of 1 nodes are reporting degraded status on update. Cannot proceed. Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Versions: Name: machineconfigcontroller Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd23c28661f8e4b885bb52c485eb36f6a844e0e1d43fd19a30d602aed56f237d Name: machineconfigdaemon Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6527c163c19a84c5df242732ead867e2ac6fdf3b9a428fa4a26bc0be42a17c4d Name: machineconfigserver Version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:96c72b03fd8dba3477ca3e6cbb962047e5d7572603dc9bdab41f17ef5c02877a Name: operator Version: 4.0.16-1-dirty Events: <none> # oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED master master-3b03dd678e723c6304a35ca76f8194fd False True True worker worker-6d9525da78e51f902c8074b02e95061a True False True # oc describe machineconfigpool master Name: master Namespace: Labels: operator.machineconfiguration.openshift.io/required-for-upgrade= Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-03-12T07:18:08Z Generation: 1 Resource Version: 94213 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master UID: fcc1e5c8-4496-11e9-a8a9-064890e55fee Spec: Machine Config Selector: Match Labels: Machineconfiguration . Openshift . Io / Role: master Machine Selector: Match Labels: Node - Role . Kubernetes . Io / Master: Max Unavailable: <nil> Paused: false Status: Conditions: Last Transition Time: 2019-03-12T09:27:45Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-03-12T09:27:45Z Message: Reason: All nodes are updating to master-3b03dd678e723c6304a35ca76f8194fd Status: True Type: Updating Last Transition Time: 2019-03-12T09:27:50Z Message: Reason: 1 nodes are reporting degraded status on update. Cannot proceed. Status: True Type: Degraded Configuration: Name: master-3b03dd678e723c6304a35ca76f8194fd Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master-ssh API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-fcc1e5c8-4496-11e9-a8a9-064890e55fee-registries Machine Count: 3 Observed Generation: 1 Ready Machine Count: 0 Unavailable Machine Count: 1 Updated Machine Count: 0 Events: <none> # oc describe machineconfigpool worker Name: worker Namespace: Labels: <none> Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-03-12T07:18:08Z Generation: 1 Resource Version: 40789 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker UID: fcc305dc-4496-11e9-a8a9-064890e55fee Spec: Machine Config Selector: Match Labels: Machineconfiguration . Openshift . Io / Role: worker Machine Selector: Match Labels: Node - Role . Kubernetes . Io / Worker: Max Unavailable: <nil> Paused: false Status: Conditions: Last Transition Time: 2019-03-12T07:24:15Z Message: Reason: All nodes are updated with worker-6d9525da78e51f902c8074b02e95061a Status: True Type: Updated Last Transition Time: 2019-03-12T07:24:15Z Message: Reason: Status: False Type: Updating Last Transition Time: 2019-03-12T08:03:52Z Message: Reason: 1 nodes are reporting degraded status on update. Cannot proceed. Status: True Type: Degraded Configuration: Name: worker-6d9525da78e51f902c8074b02e95061a Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker-ssh API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet Machine Count: 1 Observed Generation: 1 Ready Machine Count: 1 Unavailable Machine Count: 0 Updated Machine Count: 1 Events: <none> # oc describe nodes|grep machineconfig machineconfiguration.openshift.io/currentConfig: worker-6d9525da78e51f902c8074b02e95061a machineconfiguration.openshift.io/desiredConfig: worker-6d9525da78e51f902c8074b02e95061a machineconfiguration.openshift.io/state: Degraded machineconfiguration.openshift.io/currentConfig: master-5268e4d86a7d26076a01ef2ff2ceec96 machineconfiguration.openshift.io/desiredConfig: master-3b03dd678e723c6304a35ca76f8194fd machineconfiguration.openshift.io/state: Degraded machineconfiguration.openshift.io/currentConfig: master-5268e4d86a7d26076a01ef2ff2ceec96 machineconfiguration.openshift.io/desiredConfig: master-5268e4d86a7d26076a01ef2ff2ceec96 machineconfiguration.openshift.io/state: Degraded machineconfiguration.openshift.io/currentConfig: master-5268e4d86a7d26076a01ef2ff2ceec96 machineconfiguration.openshift.io/desiredConfig: master-5268e4d86a7d26076a01ef2ff2ceec96 machineconfiguration.openshift.io/state: Degraded # oc get machineconfig NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED 00-master 4.0.16-1-dirty 2.2.0 25h 00-master-ssh 4.0.16-1-dirty 25h 00-worker 4.0.16-1-dirty 2.2.0 25h 00-worker-ssh 4.0.16-1-dirty 25h 01-master-container-runtime 4.0.16-1-dirty 2.2.0 25h 01-master-kubelet 4.0.16-1-dirty 2.2.0 25h 01-worker-container-runtime 4.0.16-1-dirty 2.2.0 25h 01-worker-kubelet 4.0.16-1-dirty 2.2.0 25h 99-master-fcc1e5c8-4496-11e9-a8a9-064890e55fee-registries 4.0.16-1-dirty 25h 99-worker-fcc305dc-4496-11e9-a8a9-064890e55fee-registries 4.0.16-1-dirty 25h master-3b03dd678e723c6304a35ca76f8194fd 4.0.16-1-dirty 2.2.0 23h master-5268e4d86a7d26076a01ef2ff2ceec96 4.0.15-1-dirty 2.2.0 25h worker-6d9525da78e51f902c8074b02e95061a 4.0.16-1-dirty 2.2.0 25h # oc logs machine-config-daemon-rxftf -n openshift-machine-config-operator I0312 09:27:44.004669 98001 start.go:52] Version: 4.0.16-1-dirty I0312 09:27:44.005952 98001 start.go:88] starting node writer I0312 09:27:44.015714 98001 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json I0312 09:27:44.122266 98001 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330) I0312 09:27:44.122444 98001 daemon.go:247] Managing node: ip-10-0-139-185.us-east-2.compute.internal I0312 09:27:44.145575 98001 start.go:146] Calling chroot("/rootfs") I0312 09:27:44.145610 98001 run.go:22] Running captured: rpm-ostree status I0312 09:27:44.171489 98001 daemon.go:577] State: idle AutomaticUpdates: disabled Deployments: * pivot://registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b CustomOrigin: Provisioned from oscontainer Version: 47.330 (2019-02-23T04:17:13Z) I0312 09:27:44.171526 98001 daemon.go:477] In bootstrap mode I0312 09:27:44.180165 98001 daemon.go:505] Current+desired config: worker-6d9525da78e51f902c8074b02e95061a I0312 09:27:44.181376 98001 daemon.go:598] Node is degraded; going to sleep # oc logs machine-config-daemon-s2hr5 -n openshift-machine-config-operator I0312 09:27:49.193723 5699 start.go:52] Version: 4.0.16-1-dirty I0312 09:27:49.194096 5699 start.go:88] starting node writer I0312 09:27:49.200065 5699 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json I0312 09:27:49.276403 5699 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330) I0312 09:27:49.276671 5699 daemon.go:247] Managing node: ip-10-0-164-78.us-east-2.compute.internal I0312 09:27:49.300762 5699 start.go:146] Calling chroot("/rootfs") I0312 09:27:49.300794 5699 run.go:22] Running captured: rpm-ostree status I0312 09:27:49.326492 5699 daemon.go:577] State: idle AutomaticUpdates: disabled Deployments: * pivot://registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b CustomOrigin: Provisioned from oscontainer Version: 47.330 (2019-02-23T04:17:13Z) I0312 09:27:49.326525 5699 daemon.go:477] In bootstrap mode I0312 09:27:49.335480 5699 daemon.go:505] Current+desired config: master-5268e4d86a7d26076a01ef2ff2ceec96 I0312 09:27:49.336907 5699 daemon.go:598] Node is degraded; going to sleep
I hit issue from 4.0.0-0.nightly-2019-03-04-234414 to 4.0.0-0.nightly-2019-03-05-045224.
Add testblocker keyword,since once we hit this issue,then the upgrade testing could not be moved forward.
Maybe this bug is some side effect introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1688321, even before upgrade, base cluster probably already would get into `Degraded` state (have to wait for some time).
> Maybe this bug is some side effect introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1688321, even before upgrade, base cluster probably already would get into `Degraded` state (have to wait for some time). That sounds plausible to me. Try with a RHCOS newer than 47.330 (see [1])? [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1688321#c3
Yes, we need to update RHCOS. Also I think it should be fairly easy to do a quick sanity check for this by mirroring the release payload to any registry that requires auth and injecting the auth required into the pull secret.
And yesterday I used 400.7.20190306.0 as AMI for based cluster, then upgrade the cluster, did not reproduce such issue any more. According to https://bugzilla.redhat.com/show_bug.cgi?id=1688321#c6, we would not support beta2 to beta3 upgrade, so this issue is out of our test plan. So I am okay with close it as 'WONTFIX'.
Today I hit the same issue when do upgrade from 4.0.0-0.nightly-2019-03-22-191219 to registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709: Steps: 1. install ocp env from Payload: registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-22-191219, as a base cluster 2. change cincinnati server to OCP graph. # oc get clusterversion -o yaml <--snip--> spec: channel: stable-4.0 clusterID: 0623d282-fdfb-4178-b7ac-1889f8213f49 desiredUpdate: image: registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709 version: 4.0.0-0.nightly-2019-03-23-183709 upstream: https://openshift-release.svc.ci.openshift.org/graph <--snip--> 3. Trigger upgrade [root@dhcp-140-138 yamlfile]# oc adm upgrade --to 4.0.0-0.nightly-2019-03-23-183709 Updating to 4.0.0-0.nightly-2019-03-23-183709 4. Upgrade is completed. [root@dhcp-140-138 yamlfile]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-23-183709 True True 21m Unable to apply 4.0.0-0.nightly-2019-03-23-183709: the cluster operator machine-config is failing [root@dhcp-140-138 yamlfile]# oc get clusteroperator machine-config NAME VERSION AVAILABLE PROGRESSING FAILING SINCE machine-config 4.0.0-0.nightly-2019-03-23-183709 False False True 53m [root@dhcp-140-138 yamlfile]# oc describe clusteroperators machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-03-25T01:56:23Z Generation: 1 Resource Version: 232307 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 312bc461-4ea1-11e9-8737-022231078d74 Spec: Status: Conditions: Last Transition Time: 2019-03-25T06:14:37Z Message: Cluster not available for 4.0.0-0.nightly-2019-03-23-183709 Status: False Type: Available Last Transition Time: 2019-03-25T06:14:32Z Message: Cluster version is 4.0.0-0.nightly-2019-03-23-183709 Status: False Type: Progressing Last Transition Time: 2019-03-25T06:14:37Z Message: Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) Reason: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) Status: True Type: Failing Extension: Master: 0 out of 3 nodes have updated to latest configuration rendered-master-6c540afa18c09b5ed95dff3629b55c7d Worker: 0 out of 2 nodes have updated to latest configuration rendered-worker-38aa0c99ae25717b214181710d572963 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Versions: Name: operator Version: 4.0.0-0.nightly-2019-03-23-183709 Events: <none> [root@dhcp-140-138 yamlfile]# oc get machineconfigpool NAME CONFIG UPDATED UPDATING master rendered-master-6c540afa18c09b5ed95dff3629b55c7d False True worker rendered-worker-38aa0c99ae25717b214181710d572963 False True [root@dhcp-140-138 yamlfile]# oc describe machineconfigpool master Name: master Namespace: Labels: operator.machineconfiguration.openshift.io/required-for-upgrade= Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-03-25T01:56:23Z Generation: 1 Resource Version: 178001 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master UID: 3131f023-4ea1-11e9-8737-022231078d74 Spec: Machine Config Selector: Match Labels: Machineconfiguration . Openshift . Io / Role: master Machine Selector: Match Labels: Node - Role . Kubernetes . Io / Master: Max Unavailable: <nil> Paused: false Status: Conditions: Last Transition Time: 2019-03-25T06:14:34Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-03-25T06:14:34Z Message: Reason: All nodes are updating to rendered-master-6c540afa18c09b5ed95dff3629b55c7d Status: True Type: Updating Configuration: Name: rendered-master-6c540afa18c09b5ed95dff3629b55c7d Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master-ssh API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-3131f023-4ea1-11e9-8737-022231078d74-registries Machine Count: 3 Observed Generation: 1 Ready Machine Count: 0 Unavailable Machine Count: 1 Updated Machine Count: 0 Events: <none> [root@dhcp-140-138 yamlfile]# oc describe machineconfigpool worker Name: worker Namespace: Labels: <none> Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-03-25T01:56:23Z Generation: 1 Resource Version: 177952 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker UID: 3133fea7-4ea1-11e9-8737-022231078d74 Spec: Machine Config Selector: Match Labels: Machineconfiguration . Openshift . Io / Role: worker Machine Selector: Match Labels: Node - Role . Kubernetes . Io / Worker: Max Unavailable: <nil> Paused: false Status: Conditions: Last Transition Time: 2019-03-25T06:14:34Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-03-25T06:14:34Z Message: Reason: All nodes are updating to rendered-worker-38aa0c99ae25717b214181710d572963 Status: True Type: Updating Configuration: Name: rendered-worker-38aa0c99ae25717b214181710d572963 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker-ssh API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-3133fea7-4ea1-11e9-8737-022231078d74-registries Machine Count: 2 Observed Generation: 1 Ready Machine Count: 0 Unavailable Machine Count: 1 Updated Machine Count: 0 Events: <none> [root@dhcp-140-138 yamlfile]# oc describe nodes|grep machineconfig machineconfiguration.openshift.io/currentConfig: rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2 machineconfiguration.openshift.io/desiredConfig: rendered-worker-38aa0c99ae25717b214181710d572963 machineconfiguration.openshift.io/state: Degraded machineconfiguration.openshift.io/currentConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69 machineconfiguration.openshift.io/desiredConfig: rendered-master-6c540afa18c09b5ed95dff3629b55c7d machineconfiguration.openshift.io/state: Degraded machineconfiguration.openshift.io/currentConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69 machineconfiguration.openshift.io/desiredConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69 machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2 machineconfiguration.openshift.io/desiredConfig: rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2 machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69 machineconfiguration.openshift.io/desiredConfig: rendered-master-257ba747991bcdc361d0ad6faac39a69 machineconfiguration.openshift.io/state: Done [root@dhcp-140-138 yamlfile]# oc get machineconfig NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED 00-master 4.0.22-201903220117-dirty 2.2.0 5h15m 00-master-ssh 4.0.22-201903220117-dirty 2.2.0 5h15m 00-worker 4.0.22-201903220117-dirty 2.2.0 5h15m 00-worker-ssh 4.0.22-201903220117-dirty 2.2.0 5h15m 01-master-container-runtime 4.0.22-201903220117-dirty 2.2.0 5h15m 01-master-kubelet 4.0.22-201903220117-dirty 2.2.0 5h15m 01-worker-container-runtime 4.0.22-201903220117-dirty 2.2.0 5h15m 01-worker-kubelet 4.0.22-201903220117-dirty 2.2.0 5h15m 99-master-3131f023-4ea1-11e9-8737-022231078d74-registries 4.0.22-201903220117-dirty 2.2.0 5h14m 99-worker-3133fea7-4ea1-11e9-8737-022231078d74-registries 4.0.22-201903220117-dirty 2.2.0 5h14m rendered-master-257ba747991bcdc361d0ad6faac39a69 4.0.22-201903220117-dirty 2.2.0 5h15m rendered-master-6c540afa18c09b5ed95dff3629b55c7d 4.0.22-201903220117-dirty 2.2.0 58m rendered-worker-38aa0c99ae25717b214181710d572963 4.0.22-201903220117-dirty 2.2.0 58m rendered-worker-9791c3bf96d8a9c08fff0b8c184743e2 4.0.22-201903220117-dirty 2.2.0 5h15m oc logs -f po/cluster-version-operator-685c68d958-k5c44 -n openshift-cluster-version |grep E0325 E0325 07:14:12.898269 1 task.go:58] error running apply for clusteroperator "machine-config" (105 of 310): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) E0325 07:15:22.898834 1 task.go:58] error running apply for clusteroperator "machine-config" (105 of 310): Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) [root@dhcp-140-138 yamlfile]# oc get po -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-5d4864f75b-5bpjf 1/1 Running 0 5h19m machine-config-daemon-2b6fk 1/1 Running 0 5h17m machine-config-daemon-498qp 1/1 Running 0 5h17m machine-config-daemon-7g9fv 1/1 Running 0 5h17m machine-config-daemon-fhzw5 1/1 Running 0 5h12m machine-config-daemon-sjd9k 1/1 Running 0 5h12m machine-config-operator-579574655f-kvt5x 1/1 Running 0 64m machine-config-server-75sp8 1/1 Running 0 5h17m machine-config-server-pqgc7 1/1 Running 0 5h17m machine-config-server-qwv92 1/1 Running 0 5h17m [root@dhcp-140-138 yamlfile]# oc logs -f po/machine-config-operator-579574655f-kvt5x -n openshift-machine-config-operator |grep E0325 E0325 06:14:25.827782 1 event.go:259] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config", UID:"2e11d21d-4ea1-11e9-8737-022231078d74", ResourceVersion:"177804", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689075778, loc:(*time.Location)(0x1d39d00)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-operator-579574655f-kvt5x_f87f94f4-4ec4-11e9-9605-0a580a8200e4\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-03-25T06:14:25Z\",\"renewTime\":\"2019-03-25T06:14:25Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-operator-579574655f-kvt5x_f87f94f4-4ec4-11e9-9605-0a580a8200e4 became leader' E0325 06:15:57.585267 1 operator.go:249] error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) E0325 06:17:26.330500 1 operator.go:249] error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)
Add image info in case it's still related with ami version. # oc logs pod/machine-config-daemon-2b6fk -n openshift-machine-config-operator|grep Version I0325 01:58:53.884669 16503 start.go:54] Version: 4.0.22-201903220117-dirty Version: 410.8.20190320.1 (2019-03-20T21:01:36Z) Version: 400.7.20190306.0 (2019-03-06T22:16:26Z) # oc get clusterversion -o json|jq ".items[0].status" { "availableUpdates": null, "conditions": [ { "lastTransitionTime": "2019-03-25T02:07:00Z", "message": "Done applying 4.0.0-0.nightly-2019-03-22-191219", "status": "True", "type": "Available" }, { "lastTransitionTime": "2019-03-25T06:20:18Z", "message": "Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.nightly-2019-03-23-183709 because: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1)", "reason": "ClusterOperatorFailing", "status": "True", "type": "Failing" }, { "lastTransitionTime": "2019-03-25T06:04:00Z", "message": "Unable to apply 4.0.0-0.nightly-2019-03-23-183709: the cluster operator machine-config is failing", "reason": "ClusterOperatorFailing", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2019-03-25T06:02:41Z", "status": "True", "type": "RetrievedUpdates" } ], "desired": { "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709", "version": "4.0.0-0.nightly-2019-03-23-183709" }, "history": [ { "completionTime": null, "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-23-183709", "startedTime": "2019-03-25T06:04:00Z", "state": "Partial", "version": "4.0.0-0.nightly-2019-03-23-183709" }, { "completionTime": "2019-03-25T06:04:00Z", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:97ec469af3deb6e5eba521f1188f165b74e5f6891e3e32dfd27c38aeb2bc17ad", "startedTime": "2019-03-25T01:51:29Z", "state": "Completed", "version": "4.0.0-0.nightly-2019-03-22-191219" } ], "observedGeneration": 3, "versionHash": "FYrLPpaVLdM=" }
On today's test, we have met this frequently. So reopening it.
Whenever you see Degraded systems, look at the MCD logs: `oc -n openshift-machine-config-operator logs pods/machine-config-daemon-xyzxyz` Can you get some output from that?
also, we've change the way that check (syncRequiredMachineConfigPool) works and it now instantly reports if it's not finished flipping to failing. So, my question is, does it eventually flip back to Failing: False if you leave the upgrade process running for some more time? Maybe we shouldn't set Failing if we error on that but keeps retrying.
It would also be nice if you provide informations by just using https://github.com/openshift/must-gather which auto-collects the data we need to debug.
[root@dhcp-140-138 ~]# oc logs -f po/machine-config-daemon-94lcb -n openshift-machine-config-operator |grep E0326 E0326 05:30:02.607247 96951 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=57, ErrCode=NO_ERROR, debug="" E0326 05:30:02.607767 96951 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=57, ErrCode=NO_ERROR, debug="" E0326 05:32:33.630012 96951 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=7, ErrCode=NO_ERROR, debug=""
Today in my env: [root@dhcp-140-138 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-25-154140 True True 5h39m Unable to apply 4.0.0-0.nightly-2019-03-25-154140: the cluster operator machine-config has not yet successfully rolled out
Created attachment 1547944 [details] Must-gather info
I also hit the same issue from 4.0.0-0.nightly-2019-03-23-222829 to 4.0.0-0.nightly-2019-03-25-154140 upgrade. Before upgrade: # oc describe node|grep machineconfig machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-worker-ec1202835a931d3cf83b34760ee45095 machineconfiguration.openshift.io/desiredConfig: rendered-worker-ec1202835a931d3cf83b34760ee45095 machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/state: Done # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-23-222829 True False 45m Cluster version is 4.0.0-0.nightly-2019-03-23-222829 # oc get clusteroperator NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.0.0-0.nightly-2019-03-23-222829 True False False 46m cluster-autoscaler 4.0.0-0.nightly-2019-03-23-222829 True False False 84m console 4.0.0-0.nightly-2019-03-23-222829 True False False 51m dns 4.0.0-0.nightly-2019-03-23-222829 True False False 83m image-registry 4.0.0-0.nightly-2019-03-23-222829 True False False 76m ingress 4.0.0-0.nightly-2019-03-23-222829 True False False 56m kube-apiserver 4.0.0-0.nightly-2019-03-23-222829 True False False 81m kube-controller-manager 4.0.0-0.nightly-2019-03-23-222829 True False False 81m kube-scheduler 4.0.0-0.nightly-2019-03-23-222829 True False False 83m machine-api 4.0.0-0.nightly-2019-03-23-222829 True False False 84m machine-config 4.0.0-0.nightly-2019-03-23-222829 True False False 83m marketplace 4.0.0-0.nightly-2019-03-23-222829 True False False 79m monitoring 4.0.0-0.nightly-2019-03-23-222829 True False False 55m network 4.0.0-0.nightly-2019-03-23-222829 True False False 84m node-tuning 4.0.0-0.nightly-2019-03-23-222829 True False False 79m openshift-apiserver 4.0.0-0.nightly-2019-03-23-222829 True False False 76m openshift-cloud-credential-operator 4.0.0-0.nightly-2019-03-23-222829 True False False 84m openshift-controller-manager 4.0.0-0.nightly-2019-03-23-222829 True False False 80m openshift-samples 4.0.0-0.nightly-2019-03-23-222829 True False False 75m operator-lifecycle-manager 4.0.0-0.nightly-2019-03-23-222829 True False False 83m service-ca 4.0.0-0.nightly-2019-03-23-222829 True False False 83m service-catalog-apiserver 4.0.0-0.nightly-2019-03-23-222829 True False False 79m service-catalog-controller-manager 4.0.0-0.nightly-2019-03-23-222829 True False False 79m storage 4.0.0-0.nightly-2019-03-23-222829 True False False 56m Trigger upgrade: # oc adm upgrade --to-latest Updating to latest version 4.0.0-0.nightly-2019-03-25-154140 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-25-154140 True True 7s Working towards 4.0.0-0.nightly-2019-03-25-154140: downloading update Check after *48 mins*: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-25-154140 True True 48m Unable to apply 4.0.0-0.nightly-2019-03-25-154140: the cluster operator machine-config has not yet successfully rolled out # oc get clusteroperator NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.0.0-0.nightly-2019-03-23-222829 True False False 50m cluster-autoscaler 4.0.0-0.nightly-2019-03-23-222829 True False False 50m console 4.0.0-0.nightly-2019-03-23-222829 True False False 107m dns 4.0.0-0.nightly-2019-03-25-154140 True False False 140m image-registry 4.0.0-0.nightly-2019-03-23-222829 True False False 132m ingress 4.0.0-0.nightly-2019-03-23-222829 True False False 112m kube-apiserver 4.0.0-0.nightly-2019-03-25-154140 True False False 43m kube-controller-manager 4.0.0-0.nightly-2019-03-25-154140 True False False 40m kube-scheduler 4.0.0-0.nightly-2019-03-25-154140 True False False 43m machine-api 4.0.0-0.nightly-2019-03-25-154140 True False False 141m machine-config 4.0.0-0.nightly-2019-03-23-222829 False True True 36m marketplace 4.0.0-0.nightly-2019-03-23-222829 True False False 135m monitoring 4.0.0-0.nightly-2019-03-23-222829 True False False 111m network 4.0.0-0.nightly-2019-03-25-154140 True False False 140m node-tuning 4.0.0-0.nightly-2019-03-23-222829 True False False 136m openshift-apiserver 4.0.0-0.nightly-2019-03-23-222829 True False False 16m openshift-cloud-credential-operator 4.0.0-0.nightly-2019-03-25-154140 True False False 141m openshift-controller-manager 4.0.0-0.nightly-2019-03-23-222829 True False False 48m openshift-samples 4.0.0-0.nightly-2019-03-23-222829 True False False 132m operator-lifecycle-manager 4.0.0-0.nightly-2019-03-23-222829 True False False 140m service-ca 4.0.0-0.nightly-2019-03-25-154140 True False False 46m service-catalog-apiserver 4.0.0-0.nightly-2019-03-23-222829 True False False 136m service-catalog-controller-manager 4.0.0-0.nightly-2019-03-23-222829 True False False 49m storage 4.0.0-0.nightly-2019-03-23-222829 True False False 113m # oc describe node|grep machineconfig machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-worker-ec1202835a931d3cf83b34760ee45095 machineconfiguration.openshift.io/desiredConfig: rendered-worker-fb0bade95cda29515460a5dddf46bce6 machineconfiguration.openshift.io/state: Unreconcilable machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/desiredConfig: rendered-master-13131f3a8f1d80a10d2149723b4bed3f machineconfiguration.openshift.io/state: Unreconcilable machineconfiguration.openshift.io/currentConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/desiredConfig: rendered-master-8c108b7752cb2545da64b96e15241d8c machineconfiguration.openshift.io/state: Done # oc logs machine-config-daemon-lt2l2 -n openshift-machine-config-operator <--snip--> I0326 08:41:55.762357 48492 run.go:22] Running captured: rpm-ostree status I0326 08:41:55.874056 48492 daemon.go:738] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dd9128d2031047683071211d883c96f30db50d4d1eb85f153a22302ec47bb16 CustomOrigin: Managed by pivot tool Version: 410.8.20190320.1 (2019-03-20T21:01:36Z) pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670 CustomOrigin: Provisioned from oscontainer Version: 400.7.20190306.0 (2019-03-06T22:16:26Z) I0326 08:41:55.874108 48492 daemon.go:673] Current config: rendered-worker-ec1202835a931d3cf83b34760ee45095 I0326 08:41:55.874121 48492 daemon.go:674] Desired config: rendered-worker-fb0bade95cda29515460a5dddf46bce6 I0326 08:41:55.885297 48492 daemon.go:792] Validated on-disk state I0326 08:41:55.889954 48492 update.go:194] Checking reconcilable for config rendered-worker-ec1202835a931d3cf83b34760ee45095 to rendered-worker-fb0bade95cda29515460a5dddf46bce6 I0326 08:41:55.889972 48492 update.go:252] Checking if configs are reconcilable I0326 08:41:55.891849 48492 update.go:715] can't reconcile config rendered-worker-ec1202835a931d3cf83b34760ee45095 with rendered-worker-fb0bade95cda29515460a5dddf46bce6: ignition links section contains changes E0326 08:41:55.895754 48492 writer.go:97] Marking Unreconcilable due to: can't reconcile config rendered-worker-ec1202835a931d3cf83b34760ee45095 with rendered-worker-fb0bade95cda29515460a5dddf46bce6: ignition links section contains changes: unreconcilable
Created attachment 1547958 [details] rendered-worker-ec1202835a931d3cf83b34760ee45095.yaml
Created attachment 1547960 [details] rendered-worker-fb0bade95cda29515460a5dddf46bce6.yaml
Follow up comment 22: MCD on master also the same issue. # oc logs machine-config-daemon-psm7q -n openshift-machine-config-operator <--snip--> I0326 08:50:41.416626 97441 run.go:22] Running captured: rpm-ostree status I0326 08:50:41.452273 97441 daemon.go:738] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4dd9128d2031047683071211d883c96f30db50d4d1eb85f153a22302ec47bb16 CustomOrigin: Managed by pivot tool Version: 410.8.20190320.1 (2019-03-20T21:01:36Z) pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670 CustomOrigin: Provisioned from oscontainer Version: 400.7.20190306.0 (2019-03-06T22:16:26Z) I0326 08:50:41.452315 97441 daemon.go:673] Current config: rendered-master-8c108b7752cb2545da64b96e15241d8c I0326 08:50:41.452337 97441 daemon.go:674] Desired config: rendered-master-13131f3a8f1d80a10d2149723b4bed3f I0326 08:50:41.459478 97441 daemon.go:792] Validated on-disk state I0326 08:50:41.459979 97441 update.go:194] Checking reconcilable for config rendered-master-8c108b7752cb2545da64b96e15241d8c to rendered-master-13131f3a8f1d80a10d2149723b4bed3f I0326 08:50:41.460000 97441 update.go:252] Checking if configs are reconcilable I0326 08:50:41.461579 97441 update.go:715] can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes E0326 08:50:41.464958 97441 writer.go:97] Marking Unreconcilable due to: can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes: unreconcilable W0326 08:50:41.616341 97441 daemon.go:292] Booting the MCD errored with can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes: unreconcilable I0326 08:50:41.616375 97441 run.go:22] Running captured: rpm-ostree status
Created attachment 1547962 [details] rendered-master-13131f3a8f1d80a10d2149723b4bed3f.yaml
Created attachment 1547963 [details] rendered-master-8c108b7752cb2545da64b96e15241d8c.yaml
From this Monday,we frequent meet this bug,current QE's upgrade test is blocked by this bug.
> Booting the MCD errored with can't reconcile config rendered-master-8c108b7752cb2545da64b96e15241d8c with rendered-master-13131f3a8f1d80a10d2149723b4bed3f: ignition links section contains changes: unreconcilable That error is pretty clear though, it looks like between those machineconfigs, the links section has been altered and we can't reconcile. I'll try to understand what is touching the links section though...I can't connect to the cluster with your kubeconfig, it times out.
Ok, the links section being added is the stopgap we introduced to support pulling the pause image when it's authenticated (ref: https://github.com/openshift/machine-config-operator/pull/535) the new MachineConfig contains: ``` links: - filesystem: root overwrite: false path: /root/.docker/config.json target: /var/lib/kubelet/config.json ``` and that's causing the issue.
This has been fixed by https://github.com/openshift/machine-config-operator/pull/540 which reverted https://github.com/openshift/machine-config-operator/pull/535 which was adding a symlink which isn't supported for reconcile. What's happening now is that: 1) you're starting a cluster with an MCO version which contains #535 2) you're upgrading to a payload which doesn't have #535 but have #540 The above means that 1) generates MachineConfigs with an unsupported symlink and when upgrading to 2), the symlink is removed causing drift and an unreconcilable error. > I also hit the same issue from 4.0.0-0.nightly-2019-03-23-222829 to 4.0.0-0.nightly-2019-03-25-154140 upgrade. This BZ should be fixed as long as you use a starting cluster from a payload that contains #540
This PR that I've just made is also helping with transitioning this test scenarios from old payload that don't contain #540 https://github.com/openshift/machine-config-operator/pull/580 But I'd consider the BZ fixed as long as you use a newer payload.
Based comment33, QE use a source node including #540 and have a test against latest avialble upgrade path(from 4.0.0-0.nightly-2019-03-25-180911 to 4.0.0-0.nightly-2019-03-26-072833). Upgrade succeed. [root@preserve-jliu-worker tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-26-072833 True False 64m Cluster version is 4.0.0-0.nightly-2019-03-26-072833 [root@preserve-jliu-worker tmp]# oc get clusterversion -o json|jq ".items[0].status.history" [ { "completionTime": "2019-03-27T04:47:41Z", "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-26-072833", "startedTime": "2019-03-27T04:03:21Z", "state": "Completed", "version": "4.0.0-0.nightly-2019-03-26-072833" }, { "completionTime": "2019-03-27T04:03:21Z", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:2d781cbe28722b6eeb3ff969c5dc68199198fd1f0514a3284eb7215ae0cb4d2f", "startedTime": "2019-03-27T02:59:28Z", "state": "Completed", "version": "4.0.0-0.nightly-2019-03-25-180911" } ] # oc get co machine-config NAME VERSION AVAILABLE PROGRESSING FAILING SINCE machine-config 4.0.0-0.nightly-2019-03-26-072833 True False False 81m So remove the blocker keywords. For the comment34, since pr580 is still not avaialble in any green path. So, keep the bug ON_QA status, will verify it after a new support upgrade path including the build ready for test.
Continue the comment36, since there is not an avaliable update path in the last week(with a start node not including #540, a end node including #580). Currently, all nightly build not including #540 have been removed from [1]. And as for a beta3 release build, which will include #540. So according to the verify in comment36. Verify the bug and change status. [1] https://openshift-release.svc.ci.openshift.org/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7374#0:build-log.txt%3A2493 I saw a similar error msg in the CI: Sep 20 11:35:35.264 E clusteroperator/machine-config changed Degraded to True: RequiredPoolsFailed: Failed to resync 4.2.0-0.nightly-2019-09-20-102942 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1) Not sure if it is from the same cause. Reopening ... Please close if it is not.
from logs on one of the master: ``` I0920 11:26:35.411199 5381 update.go:89] pod "packageserver-6cc7c655f4-k97r4" removed (evicted) I0920 11:36:28.680522 5381 update.go:89] pod "downloads-64f8dbd46c-xdgzs" removed (evicted) ``` The timeout to the upgrade is caused by the download pod not evicting and taking ~600s (10m) to stop. That causes a delay which is reflected to upgrade time/roll out time. The bug to track is: https://bugzilla.redhat.com/show_bug.cgi?id=1745772
Please file a new bug when regression found if previous one is closed. So restore the bug's initial status back.