Bug 1749478
Summary: | KCM does not recover when its temporary secrets are deleted | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.2.0 | CC: | aos-bugs, mfojtik, rdave |
Target Milestone: | --- | ||
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-10-16 06:40:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2019-09-05 17:43:28 UTC
It looks like there's a problem with revision controller, it skips one version, from the current logs I see: type: 'Warning' reason: 'RequiredInstallerResourcesMissing' secrets: csr-signer-4,kube-controller-manager-client-cert-key-4,service-account-private-key-4 but the ones present are from revision 5, there is type: 'Normal' reason: 'RevisionTriggered' new revision 5 triggered by "secret \"csr-signer-4\" not found" which should suggest it will pick up the 5th revision, but it does not which leads to cluster being degraded. This is not a high priority bug (confirmed with David and Michal) I'm lowering the priority on this for now. Confirmed with payload: 4.2.0-0.nightly-2019-09-11-202233, the KCM operator pod will CrashLoopBackOff: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-controller-manager 4.2.0-0.nightly-2019-09-11-202233 True True False 3h27m [root@192 ~]# oc get po -n openshift-kube-controller-manager-operator NAME READY STATUS RESTARTS AGE kube-controller-manager-operator-65d99fbc79-rvw5f 0/1 CrashLoopBackOff 13 3h30m [root@192 ~]# oc get co openshift-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-controller-manager 4.2.0-0.nightly-2019-09-11-202233 True True False 3h28m [root@192 ~]# oc describe po/kube-controller-manager-operator-65d99fbc79-rvw5f -n openshift-kube-controller-manager-operator Name: kube-controller-manager-operator-65d99fbc79-rvw5f Namespace: openshift-kube-controller-manager-operator Priority: 2000000000 PriorityClassName: system-cluster-critical Node: ip-10-0-153-168.eu-west-3.compute.internal/10.0.153.168 Start Time: Thu, 12 Sep 2019 11:01:10 +0800 Labels: app=kube-controller-manager-operator pod-template-hash=65d99fbc79 Annotations: <none> Status: Running IP: 10.130.0.5 Controlled By: ReplicaSet/kube-controller-manager-operator-65d99fbc79 Containers: kube-controller-manager-operator: Container ID: cri-o://7b69d53e820655d55e1591336642ed57ef2dbbbd146cb363fc6c6e5c998c784e Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b Port: 8443/TCP Host Port: 0/TCP Command: cluster-kube-controller-manager-operator operator Args: --config=/var/run/configmaps/config/config.yaml State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: I0912 06:28:01.214354 1 cmd.go:177] Using service-serving-cert provided certificates I0912 06:28:01.214828 1 observer_polling.go:106] Starting file observer W0912 06:28:01.239394 1 builder.go:181] unable to get owner reference (falling back to namespace): Unauthorized F0912 06:28:31.441569 1 cmd.go:109] Unauthorized Exit Code: 255 Started: Thu, 12 Sep 2019 14:28:01 +0800 Finished: Thu, 12 Sep 2019 14:28:31 +0800 Ready: False Restart Count: 13 Requests: cpu: 10m memory: 50Mi Environment: IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:26a3c48d461e1cb41e06bfb07bf921f362368f7f99d990137c99a29787cb69a6 OPERATOR_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b OPERATOR_IMAGE_VERSION: 4.2.0-0.nightly-2019-09-11-202233 OPERAND_IMAGE_VERSION: 1.14.6 POD_NAME: kube-controller-manager-operator-65d99fbc79-rvw5f (v1:metadata.name) Mounts: /var/run/configmaps/config from config (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-controller-manager-operator-token-c2wkb (ro) /var/run/secrets/serving-cert from serving-cert (rw) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: serving-cert: Type: Secret (a volume populated by a Secret) SecretName: kube-controller-manager-operator-serving-cert Optional: true config: Type: ConfigMap (a volume populated by a ConfigMap) Name: kube-controller-manager-operator-config Optional: false kube-controller-manager-operator-token-c2wkb: Type: Secret (a volume populated by a Secret) SecretName: kube-controller-manager-operator-token-c2wkb Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 120s node.kubernetes.io/unreachable:NoExecute for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Created 44m (x5 over 3h29m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Created container kube-controller-manager-operator Normal Started 44m (x5 over 3h29m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Started container kube-controller-manager-operator Warning FailedMount 43m (x9 over 45m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal MountVolume.SetUp failed for volume "kube-controller-manager-operator-token-c2wkb" : secrets "kube-controller-manager-operator-token-c2wkb" not found Normal Pulled 25m (x9 over 3h28m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b" already present on machine Warning BackOff 49s (x177 over 45m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Back-off restarting failed container How the test was performed, I want to see the exact steps. I just did these 2 tests on 4.2.0-0.ci-2019-09-12-015306 1. for i in $(oc get secret -n openshift-kube-controller-manager -o json | jq -r '.items[] | select(.metadata.annotations | has("kubernetes.io/service-account.name") | not) | .metadata.name'); do oc delete secret/$i -n openshift-kube-controller-manager; done and the kcm-o return to normal functioning after a few minutes. 2. for ns in $(oc get ns|grep openshift-|grep -v openshift-config-); do for i in $(oc get secret -n $ns -o json | jq -r '.items[] | select(.metadata.annotations | has("kubernetes.io/service-account.name") | not) | .metadata.name'); do oc delete secret/$i -n $ns; done and kcm-o returnedto normal functioning after a longer period of time, but not longer than 10-15 mins. It is the other components that I see struggle with getting back in shape, but it'd worth checking with Clayton the exact query he used for deleting the secrets he mentioned in #comment 1 Maciej Szulik : Thanks for confirmed , I forgot to use the "not" option for delete the secrets . Double checked with payload: 4.2.0-0.nightly-2019-09-15-221449, the issue has fixed. [root@dhcp-140-138 ~]# oc get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-controller-manager 4.2.0-0.nightly-2019-09-15-221449 True False False 55m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |