Bug 1937888 - reconciling node-exporter DaemonSet failed when upgrading from 4.1.41 to 4.2.36
Summary: reconciling node-exporter DaemonSet failed when upgrading from 4.1.41 to 4.2.36
Keywords:
Status: CLOSED EOL
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-11 17:18 UTC by Paige Rubendall
Modified: 2021-05-12 13:50 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-12 13:50:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1792033 1 unspecified CLOSED Updating node-exporter failed: reconciling node-exporter DaemonSet failed 2023-12-15 17:11:50 UTC

Description Paige Rubendall 2021-03-11 17:18:38 UTC
Description of problem:
Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 8, updated: 0, ready: 7, unavailable: 1)

Version-Release number of selected component (if applicable): 4.2.36


How reproducible: 


Steps to Reproduce:
1. 4.1.41 cluster
2. Upgrade to 4.2.36 (oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.2.36-x86_64 --force=true --allow-explicit-upgrade=true)

Actual results:
Monitoring cluster operator degraded. Machine config operator still at 4.1.41 version

Expected results:
Upgrades to 4.2.36 and no degraded operators 

Additional info:

08:22:02  Name:         monitoring
08:22:02  Namespace:    
08:22:02  Labels:       <none>
08:22:02  Annotations:  <none>
08:22:02  API Version:  config.openshift.io/v1
08:22:02  Kind:         ClusterOperator
08:22:02  Metadata:
08:22:02    Creation Timestamp:  2021-03-10T09:24:27Z
08:22:02    Generation:          1
08:22:02    Resource Version:    118766
08:22:02    Self Link:           /apis/config.openshift.io/v1/clusteroperators/monitoring
08:22:02    UID:                 68e2cb41-8182-11eb-8aff-022e0131c968
08:22:02  Spec:
08:22:02  Status:
08:22:02    Conditions:
08:22:02      Last Transition Time:  2021-03-10T10:37:30Z
08:22:02      Message:               Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 8, updated: 0, ready: 7, unavailable: 1)
08:22:02      Reason:                UpdatingnodeExporterFailed
08:22:02      Status:                True
08:22:02      Type:                  Degraded
08:22:02      Last Transition Time:  2021-03-10T13:20:51Z
08:22:02      Message:               Rollout of the monitoring stack is in progress. Please wait until it finishes.
08:22:02      Reason:                RollOutInProgress
08:22:02      Status:                True
08:22:02      Type:                  Upgradeable
08:22:02      Last Transition Time:  2021-03-10T10:32:24Z
08:22:02      Status:                False
08:22:02      Type:                  Available
08:22:02      Last Transition Time:  2021-03-10T13:20:51Z
08:22:02      Message:               Rolling out the stack.
08:22:02      Reason:                RollOutInProgress
08:22:02      Status:                True
08:22:02      Type:                  Progressing
08:22:02    Extension:               <nil>



08:22:02  Name:         machine-config
08:22:02  Namespace:    
08:22:02  Labels:       <none>
08:22:02  Annotations:  <none>
08:22:02  API Version:  config.openshift.io/v1
08:22:02  Kind:         ClusterOperator
08:22:02  Metadata:
08:22:02    Creation Timestamp:  2021-03-10T09:20:08Z
08:22:02    Generation:          1
08:22:02    Resource Version:    24283
08:22:02    Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
08:22:02    UID:                 cf07284c-8181-11eb-8ac9-02e6e7f0839c
08:22:02  Spec:
08:22:02  Status:
08:22:02    Conditions:
08:22:02      Last Transition Time:  2021-03-10T09:20:38Z
08:22:02      Message:               Cluster has deployed 4.1.41
08:22:02      Status:                True
08:22:02      Type:                  Available
08:22:02      Last Transition Time:  2021-03-10T09:20:38Z
08:22:02      Message:               Cluster version is 4.1.41
08:22:02      Status:                False
08:22:02      Type:                  Progressing
08:22:02      Last Transition Time:  2021-03-10T09:20:09Z
08:22:02      Status:                False
08:22:02      Type:                  Degraded
08:22:02    Extension:
08:22:02      Master:  all 3 nodes are at latest configuration rendered-master-280edc5f78074808902bcc763bc8ad0a
08:22:02      Worker:  all 5 nodes are at latest configuration rendered-worker-a3cd245deb264bce3d323aa752916c25

Comment 3 Ke Wang 2021-05-12 11:30:42 UTC
We still hit this problem when upgrade from 4.7 to 4.8 nightly, so I have to reopen the bug.

Upgrade command: ./oc adm upgrade --to-image=vmc.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.8.0-0.nightly-2021-04-25-110331 --force=true --allow-explicit-upgrade=true

$ oc get node
 NAME              STATUS                        ROLES    AGE     VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
 compute-0         Ready                         worker   4h26m   v1.20.0+7d0a2b2   172.31.246.44   172.31.246.44   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 compute-1         NotReady,SchedulingDisabled   worker   4h26m   v1.20.0+7d0a2b2   172.31.246.61   172.31.246.61   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 control-plane-0   Ready                         master   4h39m   v1.20.0+7d0a2b2   172.31.246.28   172.31.246.28   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 control-plane-1   NotReady,SchedulingDisabled   master   4h39m   v1.20.0+7d0a2b2   172.31.246.52   172.31.246.52   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 control-plane-2   Ready                         master   4h39m   v1.20.0+7d0a2b2   172.31.246.41   172.31.246.41   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8

$ oc get co
 NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
 authentication                             4.8.0-0.nightly-2021-04-25-110331   True        False         True       84m
 baremetal                                  4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 cloud-credential                           4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h39m
 cluster-autoscaler                         4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 config-operator                            4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 console                                    4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 csi-snapshot-controller                    4.8.0-0.nightly-2021-04-25-110331   True        False         False      149m
 dns                                        4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h33m
 etcd                                       4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h33m
 image-registry                             4.8.0-0.nightly-2021-04-25-110331   True        False         False      88m
 ingress                                    4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h25m
 insights                                   4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h31m
 kube-apiserver                             4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h31m
 kube-controller-manager                    4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h32m
 kube-scheduler                             4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h33m
 kube-storage-version-migrator              4.8.0-0.nightly-2021-04-25-110331   True        False         False      88m
 machine-api                                4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h25m
 machine-approver                           4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 machine-config                             4.7.0-0.nightly-2021-04-25-102429   False       True          True       132m
 marketplace                                4.8.0-0.nightly-2021-04-25-110331   True        False         False      3h8m
 monitoring                                 4.8.0-0.nightly-2021-04-25-110331   False       True          True       83m
 network                                    4.8.0-0.nightly-2021-04-25-110331   True        True          True       4h38m
 node-tuning                                4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 openshift-apiserver                        4.8.0-0.nightly-2021-04-25-110331   True        False         True       84m
 openshift-controller-manager               4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h33m
 openshift-samples                          4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 operator-lifecycle-manager                 4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 service-ca                                 4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 storage                                    4.8.0-0.nightly-2021-04-25-110331   True        False         False      3h10m
 
$ oc describe co/machine-config
Name:         machine-config
 Namespace:    
 Labels:       <none>
 Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
               include.release.openshift.io/self-managed-high-availability: true
               include.release.openshift.io/single-node-developer: true
 API Version:  config.openshift.io/v1
 Kind:         ClusterOperator
 Metadata:
   Creation Timestamp:  2021-04-25T16:42:19Z
   Generation:          1
   Managed Fields:
     API Version:  config.openshift.io/v1
     Fields Type:  FieldsV1
     fieldsV1:
       f:metadata:
         f:annotations:
           .:
           f:exclude.release.openshift.io/internal-openshift-hosted:
           f:include.release.openshift.io/self-managed-high-availability:
           f:include.release.openshift.io/single-node-developer:
       f:spec:
       f:status:
         .:
         f:versions:
     Manager:      cluster-version-operator
     Operation:    Update
     Time:         2021-04-25T16:42:19Z
     API Version:  config.openshift.io/v1
     Fields Type:  FieldsV1
     fieldsV1:
       f:status:
         f:conditions:
         f:extension:
           .:
           f:master:
           f:worker:
         f:relatedObjects:
         f:versions:
     Manager:         machine-config-operator
     Operation:       Update
     Time:            2021-04-25T20:05:22Z
   Resource Version:  141689
   UID:               cdf4f3fa-b3ca-4abe-bdb2-63dff6efbe9a
 Spec:
 Status:
   Conditions:
     Last Transition Time:  2021-04-25T19:01:50Z
     Message:               Working towards 4.8.0-0.nightly-2021-04-25-110331
     Status:                True
     Type:                  Progressing
     Last Transition Time:  2021-04-25T20:05:22Z
     Message:               Unable to apply 4.8.0-0.nightly-2021-04-25-110331: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 3, unavailable: 2)
     Reason:                MachineConfigDaemonFailed
     Status:                True
     Type:                  Degraded
     Last Transition Time:  2021-04-25T19:11:51Z
     Message:               Cluster not available for 4.8.0-0.nightly-2021-04-25-110331
     Status:                False
     Type:                  Available
     Last Transition Time:  2021-04-25T19:55:23Z
     Message:               One or more machine config pools are updating, please see `oc get mcp` for further details
     Reason:                PoolUpdating
     Status:                False
     Type:                  Upgradeable
   Extension:
     Master:  0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-3769e7e060d5890610360c5d5513eaa8
     Worker:  0 (ready 0) out of 2 nodes are updating to latest configuration rendered-worker-5570462504cf4f902167d5d3ce228ac4
   Related Objects:
     Group:     
     Name:      openshift-machine-config-operator
     Resource:  namespaces
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  machineconfigpools
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  controllerconfigs
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  kubeletconfigs
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  containerruntimeconfigs
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  machineconfigs
     Group:     
     Name:      
     Resource:  nodes
     Group:     
     Name:      openshift-kni-infra
     Resource:  namespaces
     Group:     
     Name:      openshift-openstack-infra
     Resource:  namespaces
     Group:     
     Name:      openshift-ovirt-infra
     Resource:  namespaces
     Group:     
     Name:      openshift-vsphere-infra
     Resource:  namespaces
   Versions:
     Name:     operator
     Version:  4.7.0-0.nightly-2021-04-25-102429
 Events:       <none>

Comment 5 Ke Wang 2021-05-12 11:37:53 UTC
I did a quick search in logs of must-gather, 

$ grep -nr 'E0425 20:05'
namespaces/openshift-machine-config-operator/pods/machine-config-operator-54b676975d-msxnw/machine-config-operator/machine-config-operator/logs/current.log:627:2021-04-25T20:05:22.414938630Z E0425 20:05:22.414822       1 sync.go:639] Error syncing Required MachineConfigPools: "pool master has not progressed to latest configuration: controller version mismatch for rendered-master-6c5851d411826109697ed6d4b1f404b6 expected 0c69300057bac1ea65d544ab0e22b378690b2488 has ac79a2ffc6002f086b3fa80003b278b635d2055a: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-3769e7e060d5890610360c5d5513eaa8, retrying"

namespaces/openshift-monitoring/pods/cluster-monitoring-operator-8767f9b5d-xfdcg/cluster-monitoring-operator/cluster-monitoring-operator/logs/current.log:84:2021-04-25T20:05:19.044608892Z E0425 20:05:19.044560       1 operator.go:399] Syncing "openshift-monitoring/cluster-monitoring-config" failed

namespaces/openshift-monitoring/pods/cluster-monitoring-operator-8767f9b5d-xfdcg/cluster-monitoring-operator/cluster-monitoring-operator/logs/current.log:85:2021-04-25T20:05:19.044608892Z E0425 20:05:19.044589       1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 2 unavailable nodes
...

we can found at time 20:05:19.044589 third error log, this is root reason caused the cluster operation monitoring DEGRADED.

Comment 6 Junqi Zhao 2021-05-12 13:35:12 UTC
(In reply to Ke Wang from comment #3)
> We still hit this problem when upgrade from 4.7 to 4.8 nightly, so I have to
> reopen the bug.

network is also degraded, so affect monitoring.


Note You need to log in before you can comment on or make changes to this bug.