While verifying that operators recover from deleted secrets, we observed that a 4.2 cluster had its kcm operator go degraded and not recover: apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-09-05T15:42:09Z" generation: 1 name: kube-controller-manager resourceVersion: "43743" selfLink: /apis/config.openshift.io/v1/clusteroperators/kube-controller-manager uid: b8ca5827-cff3-11e9-8ea1-125a9a7d8e26 spec: {} status: conditions: - lastTransitionTime: "2019-09-05T16:18:06Z" message: |- InstallerControllerDegraded: missing required resources: secrets: csr-signer-6,kube-controller-manager-client-cert-key-6,service-account-private-key-6 RevisionControllerDegraded: secrets "service-account-private-key" not found reason: MultipleConditionsMatching status: "True" type: Degraded $ oc logs -n openshift-kube-controller-manager kube-controller-manager-ip-10-0-129-120.ec2.internal -c kube-controller-manager-5 I0905 17:10:39.923000 1 leaderelection.go:217] attempting to acquire leader lease kube-system/kube-controller-manager... E0905 17:10:44.736314 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: configmaps "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" To recreate: 1. launch cluster 2. delete all secrets in openshift-* namespaces, except openshift-config-*, that don't have the annotation kubernetes.io/service-account.name set (i.e. are not created by service account controllers) 3. observe the cluster recovering The KCM never recovers or rolls out 6 Clusters must recover if temporary secrets are deleted.
It looks like there's a problem with revision controller, it skips one version, from the current logs I see: type: 'Warning' reason: 'RequiredInstallerResourcesMissing' secrets: csr-signer-4,kube-controller-manager-client-cert-key-4,service-account-private-key-4 but the ones present are from revision 5, there is type: 'Normal' reason: 'RevisionTriggered' new revision 5 triggered by "secret \"csr-signer-4\" not found" which should suggest it will pick up the 5th revision, but it does not which leads to cluster being degraded.
This is not a high priority bug (confirmed with David and Michal) I'm lowering the priority on this for now.
Confirmed with payload: 4.2.0-0.nightly-2019-09-11-202233, the KCM operator pod will CrashLoopBackOff: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-controller-manager 4.2.0-0.nightly-2019-09-11-202233 True True False 3h27m [root@192 ~]# oc get po -n openshift-kube-controller-manager-operator NAME READY STATUS RESTARTS AGE kube-controller-manager-operator-65d99fbc79-rvw5f 0/1 CrashLoopBackOff 13 3h30m [root@192 ~]# oc get co openshift-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-controller-manager 4.2.0-0.nightly-2019-09-11-202233 True True False 3h28m [root@192 ~]# oc describe po/kube-controller-manager-operator-65d99fbc79-rvw5f -n openshift-kube-controller-manager-operator Name: kube-controller-manager-operator-65d99fbc79-rvw5f Namespace: openshift-kube-controller-manager-operator Priority: 2000000000 PriorityClassName: system-cluster-critical Node: ip-10-0-153-168.eu-west-3.compute.internal/10.0.153.168 Start Time: Thu, 12 Sep 2019 11:01:10 +0800 Labels: app=kube-controller-manager-operator pod-template-hash=65d99fbc79 Annotations: <none> Status: Running IP: 10.130.0.5 Controlled By: ReplicaSet/kube-controller-manager-operator-65d99fbc79 Containers: kube-controller-manager-operator: Container ID: cri-o://7b69d53e820655d55e1591336642ed57ef2dbbbd146cb363fc6c6e5c998c784e Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b Port: 8443/TCP Host Port: 0/TCP Command: cluster-kube-controller-manager-operator operator Args: --config=/var/run/configmaps/config/config.yaml State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: I0912 06:28:01.214354 1 cmd.go:177] Using service-serving-cert provided certificates I0912 06:28:01.214828 1 observer_polling.go:106] Starting file observer W0912 06:28:01.239394 1 builder.go:181] unable to get owner reference (falling back to namespace): Unauthorized F0912 06:28:31.441569 1 cmd.go:109] Unauthorized Exit Code: 255 Started: Thu, 12 Sep 2019 14:28:01 +0800 Finished: Thu, 12 Sep 2019 14:28:31 +0800 Ready: False Restart Count: 13 Requests: cpu: 10m memory: 50Mi Environment: IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:26a3c48d461e1cb41e06bfb07bf921f362368f7f99d990137c99a29787cb69a6 OPERATOR_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b OPERATOR_IMAGE_VERSION: 4.2.0-0.nightly-2019-09-11-202233 OPERAND_IMAGE_VERSION: 1.14.6 POD_NAME: kube-controller-manager-operator-65d99fbc79-rvw5f (v1:metadata.name) Mounts: /var/run/configmaps/config from config (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-controller-manager-operator-token-c2wkb (ro) /var/run/secrets/serving-cert from serving-cert (rw) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: serving-cert: Type: Secret (a volume populated by a Secret) SecretName: kube-controller-manager-operator-serving-cert Optional: true config: Type: ConfigMap (a volume populated by a ConfigMap) Name: kube-controller-manager-operator-config Optional: false kube-controller-manager-operator-token-c2wkb: Type: Secret (a volume populated by a Secret) SecretName: kube-controller-manager-operator-token-c2wkb Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 120s node.kubernetes.io/unreachable:NoExecute for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 44m (x5 over 3h29m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Created container kube-controller-manager-operator Normal Started 44m (x5 over 3h29m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Started container kube-controller-manager-operator Warning FailedMount 43m (x9 over 45m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal MountVolume.SetUp failed for volume "kube-controller-manager-operator-token-c2wkb" : secrets "kube-controller-manager-operator-token-c2wkb" not found Normal Pulled 25m (x9 over 3h28m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b" already present on machine Warning BackOff 49s (x177 over 45m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Back-off restarting failed container
How the test was performed, I want to see the exact steps. I just did these 2 tests on 4.2.0-0.ci-2019-09-12-015306 1. for i in $(oc get secret -n openshift-kube-controller-manager -o json | jq -r '.items[] | select(.metadata.annotations | has("kubernetes.io/service-account.name") | not) | .metadata.name'); do oc delete secret/$i -n openshift-kube-controller-manager; done and the kcm-o return to normal functioning after a few minutes. 2. for ns in $(oc get ns|grep openshift-|grep -v openshift-config-); do for i in $(oc get secret -n $ns -o json | jq -r '.items[] | select(.metadata.annotations | has("kubernetes.io/service-account.name") | not) | .metadata.name'); do oc delete secret/$i -n $ns; done and kcm-o returnedto normal functioning after a longer period of time, but not longer than 10-15 mins. It is the other components that I see struggle with getting back in shape, but it'd worth checking with Clayton the exact query he used for deleting the secrets he mentioned in #comment 1
Maciej Szulik : Thanks for confirmed , I forgot to use the "not" option for delete the secrets . Double checked with payload: 4.2.0-0.nightly-2019-09-15-221449, the issue has fixed. [root@dhcp-140-138 ~]# oc get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-controller-manager 4.2.0-0.nightly-2019-09-15-221449 True False False 55m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922