Description of problem: As per the business requirements, the customer performs shutdown of their cluster every night & starts again in the morning following the procedure that we are recommending as per the given documents. To stop: https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-shutdown.html To start: https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-restart.html They are seeing the "openshift-apiserver" operator in a Degraded phase after the cluster start-up. ~~~ $ oc get co openshift-apiserver -o yaml Status: Conditions: Last Transition Time: 2021-06-23T03:15:53Z Message: EncryptionMigrationControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods EncryptionKeyControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods EncryptionStateControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods EncryptionPruneControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods Reason: EncryptionKeyController_Error::EncryptionMigrationController_Error::EncryptionPruneController_Error::EncryptionStateController_Error ~~~ ** This is very reproducible, each morning after the startup of the cluster. The workaround they have found is; - to clean-up all pods that are in InvalidNodeInfo - and if still not ok, to remove all pods in openshift-apiserver-operator e.g of pods found in InvalidNodeInfo status. ~~~ $ oc get pods -A | grep InvalidNodeInfo openshift-apiserver apiserver-5b86884c7b-87n8m 0/2 InvalidNodeInfo 0 28h openshift-apiserver apiserver-5b86884c7b-b67ws 0/2 InvalidNodeInfo 0 28h openshift-cloud-credential-operator cloud-credential-operator-5456696d5f-rcvl2 0/2 InvalidNodeInfo 0 35h openshift-cluster-csi-drivers manila-csi-driver-operator-7495dd46cf-cpnlq 0/1 InvalidNodeInfo 0 35h openshift-cluster-machine-approver machine-approver-5c8bb77695-mrwxp 0/2 InvalidNodeInfo 0 35h ~~~ ~~~ $ oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-7b47bb7c89-h6x54 2/2 Running 0 12h apiserver-7b47bb7c89-k822n 2/2 Running 0 2d apiserver-7b47bb7c89-rtksh 2/2 Running 0 1d apiserver-7b47bb7c89-x646z 0/2 Failed 0 1d ~~~ ~~~ 2021-06-24T03:18:02.147008592Z E0624 03:18:02.146999 1 sync_worker.go:348] unable to synchronize image (waiting 21.369562456s): Cluster operator openshift-apiserver is reporting a failure: EncryptionMigrationControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods 2021-06-24T03:18:02.147008592Z EncryptionKeyControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods 2021-06-24T03:18:02.147008592Z EncryptionStateControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods 2021-06-24T03:18:02.147008592Z EncryptionPruneControllerDegraded: failed to get converged static pod revision: api server revision 9 has both running and failed pods ~~~ Version-Release number of selected component (if applicable): v4.6.18 cloud: openstack How reproducible: Every time after shutting down & start-up of the cluster Steps to Reproduce: 1. Stop & start the cluster Actual results: 1. Multiple pods can be found in Failed, InvalidNodeInfo state. 2. openshift-apiserver was always found in a degraded phase. Expected results: - No operators should be degraded / pod shouldn't get stuck in a failed state
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Is there a chance you could provide a must-gather? I'd like to better understand how some openshift-apiserver pods ended up in a Failed state. According to the current restart policy (Always), pods should have been restarted. I'd like to know why they didn't. Thanks.
Thanks, I was able to download the must-gather. At the moment I am trying to clarify with the node team Failed/InvalidNodeInfo state of a pod. Specifically, I'd like to know if a pod in that state will be ever run again.
Before bringing down cluster oc get node NAME STATUS ROLES AGE VERSION rgangwar-5d-45kbt-master-0 Ready control-plane,master 3h37m v1.24.0+8c7c967 rgangwar-5d-45kbt-master-1 Ready control-plane,master 3h37m v1.24.0+8c7c967 rgangwar-5d-45kbt-master-2 Ready control-plane,master 3h37m v1.24.0+8c7c967 rgangwar-5d-45kbt-worker-0-9mwgv Ready worker 165m v1.24.0+8c7c967 rgangwar-5d-45kbt-worker-0-sw8bd Ready worker 164m v1.24.0+8c7c967 rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-28-204419 True False False 155m baremetal 4.12.0-0.nightly-2022-09-28-204419 True False False 3h35m cloud-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h38m cloud-credential 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m cluster-autoscaler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m config-operator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h35m console 4.12.0-0.nightly-2022-09-28-204419 True False False 160m control-plane-machine-set 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m csi-snapshot-controller 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m dns 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m etcd 4.12.0-0.nightly-2022-09-28-204419 True False False 3h23m image-registry 4.12.0-0.nightly-2022-09-28-204419 True False False 164m ingress 4.12.0-0.nightly-2022-09-28-204419 True False False 164m insights 4.12.0-0.nightly-2022-09-28-204419 True False False 3h28m kube-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h14m kube-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h32m kube-scheduler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h32m kube-storage-version-migrator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h35m machine-api 4.12.0-0.nightly-2022-09-28-204419 True False False 165m machine-approver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m machine-config 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m marketplace 4.12.0-0.nightly-2022-09-28-204419 True False False 3h35m monitoring 4.12.0-0.nightly-2022-09-28-204419 True False False 162m network 4.12.0-0.nightly-2022-09-28-204419 True False False 3h37m node-tuning 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m openshift-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h8m openshift-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m openshift-samples 4.12.0-0.nightly-2022-09-28-204419 True False False 177m operator-lifecycle-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-28-204419 True False False 3h35m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h10m service-ca 4.12.0-0.nightly-2022-09-28-204419 True False False 3h35m storage 4.12.0-0.nightly-2022-09-28-204419 True False False 3h31m rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get co openshift-apiserver -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-10-04T09:39:29Z" generation: 1 name: openshift-apiserver ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 4982c7c2-8930-489f-aff6-03a563b752f1 resourceVersion: "35665" uid: d769a783-2119-4c4c-a2cf-fde0befc63e9 spec: {} status: conditions: - lastTransitionTime: "2022-10-04T10:07:17Z" message: All is well reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2022-10-04T10:24:45Z" message: All is well reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2022-10-04T10:21:48Z" message: All is well reason: AsExpected status: "True" type: Available - lastTransitionTime: "2022-10-04T09:55:09Z" message: All is well reason: AsExpected status: "True" type: Upgradeable extension: null relatedObjects: - group: operator.openshift.io name: cluster resource: openshiftapiservers - group: "" name: openshift-config resource: namespaces - group: "" name: openshift-config-managed resource: namespaces - group: "" name: openshift-apiserver-operator resource: namespaces - group: "" name: openshift-apiserver resource: namespaces - group: "" name: openshift-etcd-operator resource: namespaces - group: "" name: host-etcd-2 namespace: openshift-etcd resource: endpoints - group: controlplane.operator.openshift.io name: "" namespace: openshift-apiserver resource: podnetworkconnectivitychecks - group: apiregistration.k8s.io name: v1.apps.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.authorization.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.build.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.image.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.project.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.quota.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.route.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.security.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.template.openshift.io resource: apiservices versions: - name: operator version: 4.12.0-0.nightly-2022-09-28-204419 - name: openshift-apiserver version: 4.12.0-0.nightly-2022-09-28-204419 rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get pods -A | grep InvalidNodeInfo rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-5d78d9d964-4q7vz 2/2 Running 0 3h9m apiserver-5d78d9d964-6b5g7 2/2 Running 0 3h8m apiserver-5d78d9d964-m967g 2/2 Running 0 3h7m After bringing down and up the cluster. oc get node NAME STATUS ROLES AGE VERSION rgangwar-5d-45kbt-master-0 Ready control-plane,master 3h55m v1.24.0+8c7c967 rgangwar-5d-45kbt-master-1 Ready control-plane,master 3h55m v1.24.0+8c7c967 rgangwar-5d-45kbt-master-2 Ready control-plane,master 3h55m v1.24.0+8c7c967 rgangwar-5d-45kbt-worker-0-9mwgv Ready worker 3h3m v1.24.0+8c7c967 rgangwar-5d-45kbt-worker-0-sw8bd Ready worker oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-28-204419 True False False 56s baremetal 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m cloud-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h49m cloud-credential 4.12.0-0.nightly-2022-09-28-204419 True False False 3h59m cluster-autoscaler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m config-operator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m console 4.12.0-0.nightly-2022-09-28-204419 True False False <invalid> control-plane-machine-set 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m csi-snapshot-controller 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m dns 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m etcd 4.12.0-0.nightly-2022-09-28-204419 True False False 3h34m image-registry 4.12.0-0.nightly-2022-09-28-204419 True False False 175m ingress 4.12.0-0.nightly-2022-09-28-204419 True False False 175m insights 4.12.0-0.nightly-2022-09-28-204419 True False False 3h39m kube-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h25m kube-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h43m kube-scheduler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h43m kube-storage-version-migrator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m machine-api 4.12.0-0.nightly-2022-09-28-204419 True False False 176m machine-approver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m machine-config 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m marketplace 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m monitoring 4.12.0-0.nightly-2022-09-28-204419 True False False 173m network 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m node-tuning 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m openshift-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 59s openshift-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m openshift-samples 4.12.0-0.nightly-2022-09-28-204419 True False False 3h8m operator-lifecycle-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h45m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h21m service-ca 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m storage 4.12.0-0.nightly-2022-09-28-204419 True False False 3h42m rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-5d78d9d964-4q7vz 2/2 Running 0 3h25m apiserver-5d78d9d964-6b5g7 2/2 Running 0 3h23m apiserver-5d78d9d964-m967g 2/2 Running 0 3h22m rahulgangwar@rgangwar-mac openshift-tests-private_bkp5 % oc get pods -A | grep InvalidNodeInfo
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399