Bug 2000937 - clusteroperator/machine-config condition/Degraded status/True: pool master has not progressed to latest configuration
Summary: clusteroperator/machine-config condition/Degraded status/True: pool master ha...
Keywords:
Status: CLOSED DUPLICATE of bug 1955300
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Kirsten Garrison
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-03 10:33 UTC by Jan Chaloupka
Modified: 2022-05-18 19:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
jobs=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade=all
Last Closed: 2022-05-18 19:47:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jan Chaloupka 2021-09-03 10:33:35 UTC
From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1433645078152220672:
```
1 unexpected clusteroperator state transitions during e2e test run

Sep 03 05:40:33.602 - 360s  E clusteroperator/machine-config condition/Degraded status/True reason/Unable to apply 4.9.0-0.ci-2021-09-03-041553: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: osImageURL mismatch for master in rendered-master-94b6dedbfd9e1d6dce3416430266bb87 expected: registry.ci.openshift.org/ocp/4.9-2021-09-03-041553@sha256:059300e82d52ca56a6db70333fb21760aae18651842c69cad15f365cdfc7df18 got: registry.ci.openshift.org/ocp/4.9-2021-09-01-084637@sha256:d892a013b0902c3e46aea9eb5aa8eed3688509953b56267185af210d03ffeba4, retrying
```

From master nodes:
```
Sep 03 05:30:36.617857 ci-op-x2v3qh91-875d2-2nz7t-master-0 root[174575]: machine-config-daemon[168154]: Starting update from rendered-master-94b6dedbfd9e1d6dce3416430266bb87 to rendered-master-08d7a4ddc516d7b00e3991138b717bfc: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Sep 03 05:32:27.806195 ci-op-x2v3qh91-875d2-2nz7t-master-0 logger[182092]: rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:32:27.818188 ci-op-x2v3qh91-875d2-2nz7t-master-0 root[182100]: machine-config-daemon[168154]: initiating reboot: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:32:27.832613 ci-op-x2v3qh91-875d2-2nz7t-master-0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc.
Sep 03 05:32:27.899752 ci-op-x2v3qh91-875d2-2nz7t-master-0 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc.
Sep 03 05:34:54.006376 ci-op-x2v3qh91-875d2-2nz7t-master-0 root[5926]: machine-config-daemon[3379]: Disk currentConfig rendered-master-08d7a4ddc516d7b00e3991138b717bfc overrides node's currentConfig annotation rendered-master-94b6dedbfd9e1d6dce3416430266bb87
Sep 03 05:34:54.194894 ci-op-x2v3qh91-875d2-2nz7t-master-0 logger[5993]: rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:34:54.359954 ci-op-x2v3qh91-875d2-2nz7t-master-0 root[6037]: machine-config-daemon[3379]: Update completed for config rendered-master-08d7a4ddc516d7b00e3991138b717bfc and node has been successfully uncordoned

Sep 03 05:35:00.384504 ci-op-x2v3qh91-875d2-2nz7t-master-1 root[193045]: machine-config-daemon[173792]: Starting update from rendered-master-94b6dedbfd9e1d6dce3416430266bb87 to rendered-master-08d7a4ddc516d7b00e3991138b717bfc: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Sep 03 05:38:15.527946 ci-op-x2v3qh91-875d2-2nz7t-master-1 logger[204716]: rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:38:15.537691 ci-op-x2v3qh91-875d2-2nz7t-master-1 root[204718]: machine-config-daemon[173792]: initiating reboot: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:38:15.555313 ci-op-x2v3qh91-875d2-2nz7t-master-1 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc.
Sep 03 05:38:15.629863 ci-op-x2v3qh91-875d2-2nz7t-master-1 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc.
Sep 03 05:40:42.244999 ci-op-x2v3qh91-875d2-2nz7t-master-1 root[5848]: machine-config-daemon[3413]: Disk currentConfig rendered-master-08d7a4ddc516d7b00e3991138b717bfc overrides node's currentConfig annotation rendered-master-94b6dedbfd9e1d6dce3416430266bb87
Sep 03 05:40:42.368972 ci-op-x2v3qh91-875d2-2nz7t-master-1 logger[5892]: rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:40:42.414998 ci-op-x2v3qh91-875d2-2nz7t-master-1 root[5914]: machine-config-daemon[3413]: Update completed for config rendered-master-08d7a4ddc516d7b00e3991138b717bfc and node has been successfully uncordoned

Sep 03 05:40:49.154320 ci-op-x2v3qh91-875d2-2nz7t-master-2 root[202993]: machine-config-daemon[167007]: Starting update from rendered-master-94b6dedbfd9e1d6dce3416430266bb87 to rendered-master-08d7a4ddc516d7b00e3991138b717bfc: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Sep 03 05:44:00.170362 ci-op-x2v3qh91-875d2-2nz7t-master-2 logger[213749]: rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:44:00.178162 ci-op-x2v3qh91-875d2-2nz7t-master-2 root[213751]: machine-config-daemon[167007]: initiating reboot: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:44:00.194951 ci-op-x2v3qh91-875d2-2nz7t-master-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc.
Sep 03 05:44:00.277242 ci-op-x2v3qh91-875d2-2nz7t-master-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-08d7a4ddc516d7b00e3991138b717bfc.
Sep 03 05:46:28.024424 ci-op-x2v3qh91-875d2-2nz7t-master-2 root[6179]: machine-config-daemon[3441]: Disk currentConfig rendered-master-08d7a4ddc516d7b00e3991138b717bfc overrides node's currentConfig annotation rendered-master-94b6dedbfd9e1d6dce3416430266bb87
Sep 03 05:46:28.440718 ci-op-x2v3qh91-875d2-2nz7t-master-2 logger[6277]: rendered-master-08d7a4ddc516d7b00e3991138b717bfc
Sep 03 05:46:28.506168 ci-op-x2v3qh91-875d2-2nz7t-master-2 root[6306]: machine-config-daemon[3441]: Update completed for config rendered-master-08d7a4ddc516d7b00e3991138b717bfc and node has been successfully uncordoned
```

All three master nodes eventually got upgraded to rendered-master-08d7a4ddc516d7b00e3991138b717bfc. It just took longer than the operator anticipated.
The first node upgrade (master-0) started at 05:30:36.617857. The last node upgrade (master-2) started at 05:40:49.154320 and ended at 05:46:28.506168.

From the machine-config-operator:
```
I0903 05:46:34.810620       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"20308572-e715-4af9-9e52-8aa6019ca4ae", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.9.0-0.ci-2021-09-01-084637}] to [{operator 4.9.0-0.ci-2021-09-03-041553}]
I0903 05:42:58.001116       1 operator.go:262] Starting MachineConfigOperator
I0903 05:42:57.300620       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
I0903 05:41:01.624357       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0903 05:41:01.585104       1 start.go:43] Version: 4.9.0-0.ci-2021-09-03-041553 (Raw: machine-config-daemon-4.6.0-202006240615.p0-1020-g62d11bb9-dirty, Hash: 62d11bb969db4b43770383344826c58049df3803)
I0903 05:40:33.641450       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"20308572-e715-4af9-9e52-8aa6019ca4ae", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorNotAvailable' Cluster not available for 4.9.0-0.ci-2021-09-03-041553
I0903 05:40:33.574235       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"20308572-e715-4af9-9e52-8aa6019ca4ae", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: RequiredPoolsFailed' Unable to apply 4.9.0-0.ci-2021-09-03-041553: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: osImageURL mismatch for master in rendered-master-94b6dedbfd9e1d6dce3416430266bb87 expected: registry.ci.openshift.org/ocp/4.9-2021-09-03-041553@sha256:059300e82d52ca56a6db70333fb21760aae18651842c69cad15f365cdfc7df18 got: registry.ci.openshift.org/ocp/4.9-2021-09-01-084637@sha256:d892a013b0902c3e46aea9eb5aa8eed3688509953b56267185af210d03ffeba4, retrying
E0903 05:40:33.558643       1 sync.go:638] Error syncing Required MachineConfigPools: "pool master has not progressed to latest configuration: osImageURL mismatch for master in rendered-master-94b6dedbfd9e1d6dce3416430266bb87 expected: registry.ci.openshift.org/ocp/4.9-2021-09-03-041553@sha256:059300e82d52ca56a6db70333fb21760aae18651842c69cad15f365cdfc7df18 got: registry.ci.openshift.org/ocp/4.9-2021-09-01-084637@sha256:d892a013b0902c3e46aea9eb5aa8eed3688509953b56267185af210d03ffeba4, retrying"
I0903 05:27:30.046162       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"20308572-e715-4af9-9e52-8aa6019ca4ae", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.9.0-0.ci-2021-09-01-084637}] to [{operator 4.9.0-0.ci-2021-09-03-041553}]
I0903 05:27:30.037363       1 operator.go:262] Starting MachineConfigOperator
I0903 05:27:29.088762       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
I0903 05:25:33.401249       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0903 05:25:33.398617       1 start.go:43] Version: 4.9.0-0.ci-2021-09-03-041553 (Raw: machine-config-daemon-4.6.0-202006240615.p0-1020-g62d11bb9-dirty, Hash: 62d11bb969db4b43770383344826c58049df3803)
```

The operator registers all master nodes got upgraded at 05:46:34.810620. The condition/Degraded changed to True at 05:40:33.558643. It took ~6 additional minutes to finish the upgrade.

From vendor/github.com/openshift/api/config/v1/types_cluster_operator.go:
```
// Degraded indicates that the operator's current state does not match its
// desired state over a period of time resulting in a lower quality of service.
// The period of time may vary by component, but a Degraded state represents
// persistent observation of a condition.  As a result, a component should not
// oscillate in and out of Degraded state.  A service may be Available even
// if its degraded.  For example, your service may desire 3 running pods, but 1
// pod is crash-looping.  The service is Available but Degraded because it
// may have a lower quality of service.  A component may be Progressing but
// not Degraded because the transition from one state to another does not
// persist over a long enough period to report Degraded.  A service should not
// report Degraded during the course of a normal upgrade.  A service may report
// Degraded in response to a persistent infrastructure failure that requires
// eventual administrator intervention.  For example, if a control plane host
// is unhealthy and must be replaced.  An operator should report Degraded if
// unexpected errors occur over a period, but the expectation is that all
// unexpected errors are handled as operators mature.
OperatorDegraded ClusterStatusConditionType = "Degraded"
```

Mainly:
- Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service.
- A component may be Progressing but not Degraded because the transition from one state to another does not persist over a long enough period to report Degraded.
- A service should not report Degraded during the course of a normal upgrade.

Given the master nodes are getting upgraded, the operator is not expected to change the condition/Degraded to True
since "pool master has not progressed to latest configuration" error does not persist over a long period and occurs during the normal upgrade.

Expected behaviour:
- The machine-config-operator does not change condition/Degraded to True during the normal upgrade.

Comment 1 Jan Chaloupka 2021-09-03 10:59:56 UTC
From the same job:
```
1 unexpected clusteroperator state transitions during e2e test run

Sep 03 09:00:00.489 - 289s  E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.0-0.ci-2021-09-03-073535
```

From the machine-config-operator (through loki):
```
I0903 05:40:33.641450       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"20308572-e715-4af9-9e52-8aa6019ca4ae", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorNotAvailable' Cluster not available for 4.9.0-0.ci-2021-09-03-041553
```

From vendor/github.com/openshift/api/config/v1/types_cluster_operator.go:
```
        // Available indicates that the operand (eg: openshift-apiserver for the
        // openshift-apiserver-operator), is functional and available in the cluster.
        // Available=False means at least part of the component is non-functional,
        // and that the condition requires immediate administrator intervention.
        OperatorAvailable ClusterStatusConditionType = "Available"
```

Mainly:
- Available=False means at least part of the component is non-functional, and that the condition requires immediate administrator intervention.

The component is obviously stil functional so it's incorrect to change the condition/Available to False

From https://github.com/openshift/machine-config-operator/blob/62d11bb969db4b43770383344826c58049df3803/pkg/operator/status.go#L109-L111:
```
	available := configv1.ConditionTrue

	if degraded {
		available = configv1.ConditionFalse
		message = fmt.Sprintf("Cluster not available for %s", optrVersion)
	...
```

When the operator goes Degraded=True, it goes Available=false.

Fixing the condition/Degraded going True avoids the condition/Available going False.

Comment 2 Ke Wang 2021-09-15 08:36:41 UTC
Reproduced the bug, upgrade cluster IPI installed on AWS with FIPS on, OVN network from original build: 4.7.29-x86_64 to target_build: 4.8.0-0.nightly-2021-09-08-225533 

$ oc get node
NAME                                         STATUS                     ROLES    AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
09-09 20:24:36.168  ip-10-0-136-50.us-east-2.compute.internal    Ready,SchedulingDisabled   master   4h17m   v1.20.0+9689d22   10.0.136.50    <none>        Red Hat Enterprise Linux CoreOS 47.84.202109010857-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.20.4-14.rhaos4.7.gitbce257b.el8
09-09 20:24:36.168  ip-10-0-150-141.us-east-2.compute.internal   Ready                      worker   4h2m    v1.21.1+9807387   10.0.150.141   <none>        Red Hat Enterprise Linux CoreOS 48.84.202109080059-0 (Ootpa)   4.18.0-305.17.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-09 20:24:36.168  ip-10-0-162-91.us-east-2.compute.internal    Ready                      master   4h13m   v1.20.0+9689d22   10.0.162.91    <none>        Red Hat Enterprise Linux CoreOS 47.84.202109010857-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.20.4-14.rhaos4.7.gitbce257b.el8
09-09 20:24:36.168  ip-10-0-174-70.us-east-2.compute.internal    Ready                      worker   4h2m    v1.21.1+9807387   10.0.174.70    <none>        Red Hat Enterprise Linux CoreOS 48.84.202109080059-0 (Ootpa)   4.18.0-305.17.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-09 20:24:36.168  ip-10-0-196-207.us-east-2.compute.internal   Ready                      worker   4h3m    v1.21.1+9807387   10.0.196.207   <none>        Red Hat Enterprise Linux CoreOS 48.84.202109080059-0 (Ootpa)   4.18.0-305.17.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-09 20:24:36.169  ip-10-0-213-142.us-east-2.compute.internal   Ready                      master   4h13m   v1.21.1+9807387   10.0.213.142   <none>        Red Hat Enterprise Linux CoreOS 48.84.202109080059-0 (Ootpa)   4.18.0-305.17.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8

$oc get co/machine-config
  ...    
  Status:                                  
    Conditions:                            
      Last Transition Time:  2021-09-09T10:13:46Z    
      Message:               Working towards 4.8.0-0.nightly-2021-09-08-225533    
      Status:                True          
      Type:                  Progressing    
      Last Transition Time:  2021-09-09T10:52:31Z    
      Message:               One or more machine config pools are updating, please see `oc get mcp` for further details    
      Reason:                PoolUpdating    
      Status:                False         
      Type:                  Upgradeable    
      Last Transition Time:  2021-09-09T11:05:16Z                                                                                                                                                   Message:               Unable to apply 4.8.0-0.nightly-2021-09-08-225533: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-b2d7c4921f4e843f7549d8157c8bab94 expected a537783ea4a0cd3b4fe2a02626ab27887307ea51 has d6d26e1f4e1fc0ed49e4c443bf02bdc376e756b3: 1 (ready 1) out of 3 nodes are updating to latest configuration rendered-master-538c4f8e12cf9de1b57bed08f594719f, retrying
      Reason:                RequiredPoolsFailed    
      Status:                True          
      Type:                  Degraded      
      Last Transition Time:  2021-09-09T10:23:47Z    
      Message:               Cluster not available for 4.8.0-0.nightly-2021-09-08-225533    
      Status:                False         
      Type:                  Available     
    Extension:                             
      Master:  1 (ready 1) out of 3 nodes are updating to latest configuration rendered-master-538c4f8e12cf9de1b57bed08f594719f    
      Worker:  all 3 nodes are at latest configuration rendered-worker-c7c5061a6524b097d13fc06601495272    
    Related Objects:    
    ...˽                                                                                                                                                                                      
    Versions:    
      Name:     operator    
      Version:  4.7.29    
    ...

Comment 4 Ke Wang 2021-09-15 08:43:52 UTC
Not every time reproduced, not blocker.

Comment 6 Kirsten Garrison 2022-05-18 19:47:43 UTC

*** This bug has been marked as a duplicate of bug 1955300 ***


Note You need to log in before you can comment on or make changes to this bug.