2000276 – EncryptionStateControllerDegraded: failed to get converged static pod revision

Bug 2000276 - EncryptionStateControllerDegraded: failed to get converged static pod revision

Summary: EncryptionStateControllerDegraded: failed to get converged static pod revision

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Damien Grisonnet
QA Contact:	Rahul Gangwar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-01 18:09 UTC by Rutvik
Modified:	2024-10-01 19:13 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:46:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift library-go pull 1276	0	None	open	Bug 2000276: EncryptionStateControllerDegraded: failed to get converged static pod revision	2022-08-24 13:42:29 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:46:32 UTC

Description Rutvik 2021-09-01 18:09:52 UTC

Description of problem:

As per the business requirements, the customer performs shutdown of their cluster every night & starts again in the morning following the procedure that we are recommending as per the given documents.

To stop: https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-shutdown.html
To start: https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-restart.html


They are seeing the "openshift-apiserver" operator in a Degraded phase after the cluster start-up.

~~~
$ oc get co openshift-apiserver -o yaml 

Status:
  Conditions:
    Last Transition Time:  2021-06-23T03:15:53Z
    Message:               EncryptionMigrationControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods
EncryptionKeyControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods
EncryptionStateControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods
EncryptionPruneControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods
    Reason:                EncryptionKeyController_Error::EncryptionMigrationController_Error::EncryptionPruneController_Error::EncryptionStateController_Error
~~~


** This is very reproducible, each morning after the startup of the cluster.

The workaround they have found is;
  - to clean-up all pods that are  in InvalidNodeInfo
  - and if still not ok, to remove all pods in openshift-apiserver-operator

e.g of pods found in InvalidNodeInfo status.

~~~
$ oc get pods -A | grep InvalidNodeInfo
openshift-apiserver                                apiserver-5b86884c7b-87n8m                                        0/2     InvalidNodeInfo   0          28h
openshift-apiserver                                apiserver-5b86884c7b-b67ws                                        0/2     InvalidNodeInfo   0          28h
openshift-cloud-credential-operator                cloud-credential-operator-5456696d5f-rcvl2                        0/2     InvalidNodeInfo   0          35h
openshift-cluster-csi-drivers                      manila-csi-driver-operator-7495dd46cf-cpnlq                       0/1     InvalidNodeInfo   0          35h
openshift-cluster-machine-approver                 machine-approver-5c8bb77695-mrwxp                                 0/2     InvalidNodeInfo   0          35h
~~~

~~~
$ oc get pods -n openshift-apiserver
NAME                        READY  STATUS   RESTARTS  AGE
apiserver-7b47bb7c89-h6x54  2/2    Running  0         12h
apiserver-7b47bb7c89-k822n  2/2    Running  0         2d
apiserver-7b47bb7c89-rtksh  2/2    Running  0         1d
apiserver-7b47bb7c89-x646z  0/2    Failed   0         1d
~~~

~~~
2021-06-24T03:18:02.147008592Z E0624 03:18:02.146999       1 sync_worker.go:348] unable to synchronize image (waiting 21.369562456s): Cluster operator openshift-apiserver is reporting a failure: EncryptionMigrationControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods

2021-06-24T03:18:02.147008592Z EncryptionKeyControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods
2021-06-24T03:18:02.147008592Z EncryptionStateControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods
2021-06-24T03:18:02.147008592Z EncryptionPruneControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods
~~~


Version-Release number of selected component (if applicable):
v4.6.18
cloud: openstack

How reproducible:
Every time after shutting down & start-up of the cluster

Steps to Reproduce:
1. Stop & start the cluster 

Actual results:
1. Multiple pods can be found in Failed, InvalidNodeInfo state.
2. openshift-apiserver was always found in a degraded phase.


Expected results:
- No operators should be degraded / pod shouldn't get stuck in a failed state

Comment 1 Lukasz Szaszkiewicz 2021-11-05 13:06:48 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 2 Lukasz Szaszkiewicz 2021-12-21 15:37:10 UTC

Is there a chance you could provide a must-gather?

I'd like to better understand how some openshift-apiserver pods ended up in a Failed state.
According to the current restart policy (Always), pods should have been restarted.
I'd like to know why they didn't. Thanks.

Comment 4 Lukasz Szaszkiewicz 2021-12-22 10:21:58 UTC

Thanks, I was able to download the must-gather. 
At the moment I am trying to clarify with the node team Failed/InvalidNodeInfo state of a pod. 
Specifically, I'd like to know if a pod in that state will be ever run again.

Comment 8 Rahul Gangwar 2022-10-04 13:53:34 UTC

Before bringing down cluster

 oc get node                                                          
NAME                               STATUS   ROLES                  AGE     VERSION
rgangwar-5d-45kbt-master-0         Ready    control-plane,master   3h37m   v1.24.0+8c7c967
rgangwar-5d-45kbt-master-1         Ready    control-plane,master   3h37m   v1.24.0+8c7c967
rgangwar-5d-45kbt-master-2         Ready    control-plane,master   3h37m   v1.24.0+8c7c967
rgangwar-5d-45kbt-worker-0-9mwgv   Ready    worker                 165m    v1.24.0+8c7c967
rgangwar-5d-45kbt-worker-0-sw8bd   Ready    worker                 164m    v1.24.0+8c7c967
rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      155m    
baremetal                                  4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h35m   
cloud-controller-manager                   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h38m   
cloud-credential                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m   
cluster-autoscaler                         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
config-operator                            4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h35m   
console                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      160m    
control-plane-machine-set                  4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
dns                                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
etcd                                       4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h23m   
image-registry                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      164m    
ingress                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      164m    
insights                                   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h28m   
kube-apiserver                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h14m   
kube-controller-manager                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h32m   
kube-scheduler                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h32m   
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h35m   
machine-api                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      165m    
machine-approver                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
machine-config                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
marketplace                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h35m   
monitoring                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      162m    
network                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h37m   
node-tuning                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
openshift-apiserver                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h8m    
openshift-controller-manager               4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
openshift-samples                          4.12.0-0.nightly-2022-09-28-204419   True        False         False      177m    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m   
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h35m   
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h10m   
service-ca                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h35m   
storage                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h31m   
rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get co openshift-apiserver -o yaml 

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-10-04T09:39:29Z"
  generation: 1
  name: openshift-apiserver
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 4982c7c2-8930-489f-aff6-03a563b752f1
  resourceVersion: "35665"
  uid: d769a783-2119-4c4c-a2cf-fde0befc63e9
spec: {}
status:
  conditions:
  - lastTransitionTime: "2022-10-04T10:07:17Z"
    message: All is well
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-10-04T10:24:45Z"
    message: All is well
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-10-04T10:21:48Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2022-10-04T09:55:09Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: openshiftapiservers
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-apiserver-operator
    resource: namespaces
  - group: ""
    name: openshift-apiserver
    resource: namespaces
  - group: ""
    name: openshift-etcd-operator
    resource: namespaces
  - group: ""
    name: host-etcd-2
    namespace: openshift-etcd
    resource: endpoints
  - group: controlplane.operator.openshift.io
    name: ""
    namespace: openshift-apiserver
    resource: podnetworkconnectivitychecks
  - group: apiregistration.k8s.io
    name: v1.apps.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.authorization.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.build.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.image.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.project.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.quota.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.route.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.security.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.template.openshift.io
    resource: apiservices
  versions:
  - name: operator
    version: 4.12.0-0.nightly-2022-09-28-204419
  - name: openshift-apiserver
    version: 4.12.0-0.nightly-2022-09-28-204419
rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 %  oc get pods -A | grep InvalidNodeInfo
rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get pods -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-5d78d9d964-4q7vz   2/2     Running   0          3h9m
apiserver-5d78d9d964-6b5g7   2/2     Running   0          3h8m
apiserver-5d78d9d964-m967g   2/2     Running   0          3h7m


After bringing down and up the cluster.

 oc get node                           
NAME                               STATUS   ROLES                  AGE     VERSION
rgangwar-5d-45kbt-master-0         Ready    control-plane,master   3h55m   v1.24.0+8c7c967
rgangwar-5d-45kbt-master-1         Ready    control-plane,master   3h55m   v1.24.0+8c7c967
rgangwar-5d-45kbt-master-2         Ready    control-plane,master   3h55m   v1.24.0+8c7c967
rgangwar-5d-45kbt-worker-0-9mwgv   Ready    worker                 3h3m    v1.24.0+8c7c967
rgangwar-5d-45kbt-worker-0-sw8bd   Ready    worker 


oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE       MESSAGE
authentication                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      56s         
baremetal                                  4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m       
cloud-controller-manager                   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h49m       
cloud-credential                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h59m       
cluster-autoscaler                         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
config-operator                            4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m       
console                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      <invalid>   
control-plane-machine-set                  4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
dns                                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
etcd                                       4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h34m       
image-registry                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      175m        
ingress                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      175m        
insights                                   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h39m       
kube-apiserver                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h25m       
kube-controller-manager                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h43m       
kube-scheduler                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h43m       
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m       
machine-api                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      176m        
machine-approver                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
machine-config                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
marketplace                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m       
monitoring                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      173m        
network                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m       
node-tuning                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
openshift-apiserver                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      59s         
openshift-controller-manager               4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
openshift-samples                          4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h8m        
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h45m       
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m       
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h21m       
service-ca                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m       
storage                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h42m       
rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get pods -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-5d78d9d964-4q7vz   2/2     Running   0          3h25m
apiserver-5d78d9d964-6b5g7   2/2     Running   0          3h23m
apiserver-5d78d9d964-m967g   2/2     Running   0          3h22m
rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 %  oc get pods -A | grep InvalidNodeInfo

Comment 11 errata-xmlrpc 2023-01-17 19:46:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.