Description of problem: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 8, updated: 0, ready: 7, unavailable: 1) Version-Release number of selected component (if applicable): 4.2.36 How reproducible: Steps to Reproduce: 1. 4.1.41 cluster 2. Upgrade to 4.2.36 (oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.2.36-x86_64 --force=true --allow-explicit-upgrade=true) Actual results: Monitoring cluster operator degraded. Machine config operator still at 4.1.41 version Expected results: Upgrades to 4.2.36 and no degraded operators Additional info: 08:22:02 Name: monitoring 08:22:02 Namespace: 08:22:02 Labels: <none> 08:22:02 Annotations: <none> 08:22:02 API Version: config.openshift.io/v1 08:22:02 Kind: ClusterOperator 08:22:02 Metadata: 08:22:02 Creation Timestamp: 2021-03-10T09:24:27Z 08:22:02 Generation: 1 08:22:02 Resource Version: 118766 08:22:02 Self Link: /apis/config.openshift.io/v1/clusteroperators/monitoring 08:22:02 UID: 68e2cb41-8182-11eb-8aff-022e0131c968 08:22:02 Spec: 08:22:02 Status: 08:22:02 Conditions: 08:22:02 Last Transition Time: 2021-03-10T10:37:30Z 08:22:02 Message: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 8, updated: 0, ready: 7, unavailable: 1) 08:22:02 Reason: UpdatingnodeExporterFailed 08:22:02 Status: True 08:22:02 Type: Degraded 08:22:02 Last Transition Time: 2021-03-10T13:20:51Z 08:22:02 Message: Rollout of the monitoring stack is in progress. Please wait until it finishes. 08:22:02 Reason: RollOutInProgress 08:22:02 Status: True 08:22:02 Type: Upgradeable 08:22:02 Last Transition Time: 2021-03-10T10:32:24Z 08:22:02 Status: False 08:22:02 Type: Available 08:22:02 Last Transition Time: 2021-03-10T13:20:51Z 08:22:02 Message: Rolling out the stack. 08:22:02 Reason: RollOutInProgress 08:22:02 Status: True 08:22:02 Type: Progressing 08:22:02 Extension: <nil> 08:22:02 Name: machine-config 08:22:02 Namespace: 08:22:02 Labels: <none> 08:22:02 Annotations: <none> 08:22:02 API Version: config.openshift.io/v1 08:22:02 Kind: ClusterOperator 08:22:02 Metadata: 08:22:02 Creation Timestamp: 2021-03-10T09:20:08Z 08:22:02 Generation: 1 08:22:02 Resource Version: 24283 08:22:02 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config 08:22:02 UID: cf07284c-8181-11eb-8ac9-02e6e7f0839c 08:22:02 Spec: 08:22:02 Status: 08:22:02 Conditions: 08:22:02 Last Transition Time: 2021-03-10T09:20:38Z 08:22:02 Message: Cluster has deployed 4.1.41 08:22:02 Status: True 08:22:02 Type: Available 08:22:02 Last Transition Time: 2021-03-10T09:20:38Z 08:22:02 Message: Cluster version is 4.1.41 08:22:02 Status: False 08:22:02 Type: Progressing 08:22:02 Last Transition Time: 2021-03-10T09:20:09Z 08:22:02 Status: False 08:22:02 Type: Degraded 08:22:02 Extension: 08:22:02 Master: all 3 nodes are at latest configuration rendered-master-280edc5f78074808902bcc763bc8ad0a 08:22:02 Worker: all 5 nodes are at latest configuration rendered-worker-a3cd245deb264bce3d323aa752916c25
We still hit this problem when upgrade from 4.7 to 4.8 nightly, so I have to reopen the bug. Upgrade command: ./oc adm upgrade --to-image=vmc.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.8.0-0.nightly-2021-04-25-110331 --force=true --allow-explicit-upgrade=true $ oc get node NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME compute-0 Ready worker 4h26m v1.20.0+7d0a2b2 172.31.246.44 172.31.246.44 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 compute-1 NotReady,SchedulingDisabled worker 4h26m v1.20.0+7d0a2b2 172.31.246.61 172.31.246.61 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 control-plane-0 Ready master 4h39m v1.20.0+7d0a2b2 172.31.246.28 172.31.246.28 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 control-plane-1 NotReady,SchedulingDisabled master 4h39m v1.20.0+7d0a2b2 172.31.246.52 172.31.246.52 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 control-plane-2 Ready master 4h39m v1.20.0+7d0a2b2 172.31.246.41 172.31.246.41 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-04-25-110331 True False True 84m baremetal 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m cloud-credential 4.8.0-0.nightly-2021-04-25-110331 True False False 4h39m cluster-autoscaler 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m config-operator 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m console 4.8.0-0.nightly-2021-04-25-110331 True False False 153m csi-snapshot-controller 4.8.0-0.nightly-2021-04-25-110331 True False False 149m dns 4.8.0-0.nightly-2021-04-25-110331 True False True 4h33m etcd 4.8.0-0.nightly-2021-04-25-110331 True False True 4h33m image-registry 4.8.0-0.nightly-2021-04-25-110331 True False False 88m ingress 4.8.0-0.nightly-2021-04-25-110331 True False True 4h25m insights 4.8.0-0.nightly-2021-04-25-110331 True False False 4h31m kube-apiserver 4.8.0-0.nightly-2021-04-25-110331 True False True 4h31m kube-controller-manager 4.8.0-0.nightly-2021-04-25-110331 True False True 4h32m kube-scheduler 4.8.0-0.nightly-2021-04-25-110331 True False True 4h33m kube-storage-version-migrator 4.8.0-0.nightly-2021-04-25-110331 True False False 88m machine-api 4.8.0-0.nightly-2021-04-25-110331 True False False 4h25m machine-approver 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m machine-config 4.7.0-0.nightly-2021-04-25-102429 False True True 132m marketplace 4.8.0-0.nightly-2021-04-25-110331 True False False 3h8m monitoring 4.8.0-0.nightly-2021-04-25-110331 False True True 83m network 4.8.0-0.nightly-2021-04-25-110331 True True True 4h38m node-tuning 4.8.0-0.nightly-2021-04-25-110331 True False False 153m openshift-apiserver 4.8.0-0.nightly-2021-04-25-110331 True False True 84m openshift-controller-manager 4.8.0-0.nightly-2021-04-25-110331 True False False 4h33m openshift-samples 4.8.0-0.nightly-2021-04-25-110331 True False False 153m operator-lifecycle-manager 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-04-25-110331 True False False 153m service-ca 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m storage 4.8.0-0.nightly-2021-04-25-110331 True False False 3h10m $ oc describe co/machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-04-25T16:42:19Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:versions: Manager: cluster-version-operator Operation: Update Time: 2021-04-25T16:42:19Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: f:versions: Manager: machine-config-operator Operation: Update Time: 2021-04-25T20:05:22Z Resource Version: 141689 UID: cdf4f3fa-b3ca-4abe-bdb2-63dff6efbe9a Spec: Status: Conditions: Last Transition Time: 2021-04-25T19:01:50Z Message: Working towards 4.8.0-0.nightly-2021-04-25-110331 Status: True Type: Progressing Last Transition Time: 2021-04-25T20:05:22Z Message: Unable to apply 4.8.0-0.nightly-2021-04-25-110331: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 3, unavailable: 2) Reason: MachineConfigDaemonFailed Status: True Type: Degraded Last Transition Time: 2021-04-25T19:11:51Z Message: Cluster not available for 4.8.0-0.nightly-2021-04-25-110331 Status: False Type: Available Last Transition Time: 2021-04-25T19:55:23Z Message: One or more machine config pools are updating, please see `oc get mcp` for further details Reason: PoolUpdating Status: False Type: Upgradeable Extension: Master: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-3769e7e060d5890610360c5d5513eaa8 Worker: 0 (ready 0) out of 2 nodes are updating to latest configuration rendered-worker-5570462504cf4f902167d5d3ce228ac4 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Group: Name: openshift-kni-infra Resource: namespaces Group: Name: openshift-openstack-infra Resource: namespaces Group: Name: openshift-ovirt-infra Resource: namespaces Group: Name: openshift-vsphere-infra Resource: namespaces Versions: Name: operator Version: 4.7.0-0.nightly-2021-04-25-102429 Events: <none>
I did a quick search in logs of must-gather, $ grep -nr 'E0425 20:05' namespaces/openshift-machine-config-operator/pods/machine-config-operator-54b676975d-msxnw/machine-config-operator/machine-config-operator/logs/current.log:627:2021-04-25T20:05:22.414938630Z E0425 20:05:22.414822 1 sync.go:639] Error syncing Required MachineConfigPools: "pool master has not progressed to latest configuration: controller version mismatch for rendered-master-6c5851d411826109697ed6d4b1f404b6 expected 0c69300057bac1ea65d544ab0e22b378690b2488 has ac79a2ffc6002f086b3fa80003b278b635d2055a: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-3769e7e060d5890610360c5d5513eaa8, retrying" namespaces/openshift-monitoring/pods/cluster-monitoring-operator-8767f9b5d-xfdcg/cluster-monitoring-operator/cluster-monitoring-operator/logs/current.log:84:2021-04-25T20:05:19.044608892Z E0425 20:05:19.044560 1 operator.go:399] Syncing "openshift-monitoring/cluster-monitoring-config" failed namespaces/openshift-monitoring/pods/cluster-monitoring-operator-8767f9b5d-xfdcg/cluster-monitoring-operator/cluster-monitoring-operator/logs/current.log:85:2021-04-25T20:05:19.044608892Z E0425 20:05:19.044589 1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 2 unavailable nodes ... we can found at time 20:05:19.044589 third error log, this is root reason caused the cluster operation monitoring DEGRADED.
(In reply to Ke Wang from comment #3) > We still hit this problem when upgrade from 4.7 to 4.8 nightly, so I have to > reopen the bug. network is also degraded, so affect monitoring.