Bug 1749478
| Summary: | KCM does not recover when its temporary secrets are deleted | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.0 | CC: | aos-bugs, mfojtik, rdave |
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:40:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
It looks like there's a problem with revision controller, it skips one version, from the current logs I see: type: 'Warning' reason: 'RequiredInstallerResourcesMissing' secrets: csr-signer-4,kube-controller-manager-client-cert-key-4,service-account-private-key-4 but the ones present are from revision 5, there is type: 'Normal' reason: 'RevisionTriggered' new revision 5 triggered by "secret \"csr-signer-4\" not found" which should suggest it will pick up the 5th revision, but it does not which leads to cluster being degraded. This is not a high priority bug (confirmed with David and Michal) I'm lowering the priority on this for now. Confirmed with payload: 4.2.0-0.nightly-2019-09-11-202233, the KCM operator pod will CrashLoopBackOff:
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
openshift-controller-manager 4.2.0-0.nightly-2019-09-11-202233 True True False 3h27m
[root@192 ~]# oc get po -n openshift-kube-controller-manager-operator
NAME READY STATUS RESTARTS AGE
kube-controller-manager-operator-65d99fbc79-rvw5f 0/1 CrashLoopBackOff 13 3h30m
[root@192 ~]# oc get co openshift-controller-manager
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
openshift-controller-manager 4.2.0-0.nightly-2019-09-11-202233 True True False 3h28m
[root@192 ~]# oc describe po/kube-controller-manager-operator-65d99fbc79-rvw5f -n openshift-kube-controller-manager-operator
Name: kube-controller-manager-operator-65d99fbc79-rvw5f
Namespace: openshift-kube-controller-manager-operator
Priority: 2000000000
PriorityClassName: system-cluster-critical
Node: ip-10-0-153-168.eu-west-3.compute.internal/10.0.153.168
Start Time: Thu, 12 Sep 2019 11:01:10 +0800
Labels: app=kube-controller-manager-operator
pod-template-hash=65d99fbc79
Annotations: <none>
Status: Running
IP: 10.130.0.5
Controlled By: ReplicaSet/kube-controller-manager-operator-65d99fbc79
Containers:
kube-controller-manager-operator:
Container ID: cri-o://7b69d53e820655d55e1591336642ed57ef2dbbbd146cb363fc6c6e5c998c784e
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b
Port: 8443/TCP
Host Port: 0/TCP
Command:
cluster-kube-controller-manager-operator
operator
Args:
--config=/var/run/configmaps/config/config.yaml
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: I0912 06:28:01.214354 1 cmd.go:177] Using service-serving-cert provided certificates
I0912 06:28:01.214828 1 observer_polling.go:106] Starting file observer
W0912 06:28:01.239394 1 builder.go:181] unable to get owner reference (falling back to namespace): Unauthorized
F0912 06:28:31.441569 1 cmd.go:109] Unauthorized
Exit Code: 255
Started: Thu, 12 Sep 2019 14:28:01 +0800
Finished: Thu, 12 Sep 2019 14:28:31 +0800
Ready: False
Restart Count: 13
Requests:
cpu: 10m
memory: 50Mi
Environment:
IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:26a3c48d461e1cb41e06bfb07bf921f362368f7f99d990137c99a29787cb69a6
OPERATOR_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b
OPERATOR_IMAGE_VERSION: 4.2.0-0.nightly-2019-09-11-202233
OPERAND_IMAGE_VERSION: 1.14.6
POD_NAME: kube-controller-manager-operator-65d99fbc79-rvw5f (v1:metadata.name)
Mounts:
/var/run/configmaps/config from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-controller-manager-operator-token-c2wkb (ro)
/var/run/secrets/serving-cert from serving-cert (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
serving-cert:
Type: Secret (a volume populated by a Secret)
SecretName: kube-controller-manager-operator-serving-cert
Optional: true
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-controller-manager-operator-config
Optional: false
kube-controller-manager-operator-token-c2wkb:
Type: Secret (a volume populated by a Secret)
SecretName: kube-controller-manager-operator-token-c2wkb
Optional: false
QoS Class: Burstable
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 120s
node.kubernetes.io/unreachable:NoExecute for 120s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 44m (x5 over 3h29m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Created container kube-controller-manager-operator
Normal Started 44m (x5 over 3h29m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Started container kube-controller-manager-operator
Warning FailedMount 43m (x9 over 45m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal MountVolume.SetUp failed for volume "kube-controller-manager-operator-token-c2wkb" : secrets "kube-controller-manager-operator-token-c2wkb" not found
Normal Pulled 25m (x9 over 3h28m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:578e0f1bd0f296da6f8a156ffdf61bd24ba16d10ffe30d1410bdce55679aeb5b" already present on machine
Warning BackOff 49s (x177 over 45m) kubelet, ip-10-0-153-168.eu-west-3.compute.internal Back-off restarting failed container
How the test was performed, I want to see the exact steps. I just did these 2 tests on 4.2.0-0.ci-2019-09-12-015306
1. for i in $(oc get secret -n openshift-kube-controller-manager -o json | jq -r '.items[] | select(.metadata.annotations | has("kubernetes.io/service-account.name") | not) | .metadata.name'); do oc delete secret/$i -n openshift-kube-controller-manager; done
and the kcm-o return to normal functioning after a few minutes.
2. for ns in $(oc get ns|grep openshift-|grep -v openshift-config-); do for i in $(oc get secret -n $ns -o json | jq -r '.items[] | select(.metadata.annotations | has("kubernetes.io/service-account.name") | not) | .metadata.name'); do oc delete secret/$i -n $ns; done
and kcm-o returnedto normal functioning after a longer period of time, but not longer than 10-15 mins.
It is the other components that I see struggle with getting back in shape, but it'd worth checking with Clayton the exact
query he used for deleting the secrets he mentioned in #comment 1
Maciej Szulik :
Thanks for confirmed , I forgot to use the "not" option for delete the secrets .
Double checked with payload: 4.2.0-0.nightly-2019-09-15-221449, the issue has fixed.
[root@dhcp-140-138 ~]# oc get co kube-controller-manager
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
kube-controller-manager 4.2.0-0.nightly-2019-09-15-221449 True False False 55m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |
While verifying that operators recover from deleted secrets, we observed that a 4.2 cluster had its kcm operator go degraded and not recover: apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-09-05T15:42:09Z" generation: 1 name: kube-controller-manager resourceVersion: "43743" selfLink: /apis/config.openshift.io/v1/clusteroperators/kube-controller-manager uid: b8ca5827-cff3-11e9-8ea1-125a9a7d8e26 spec: {} status: conditions: - lastTransitionTime: "2019-09-05T16:18:06Z" message: |- InstallerControllerDegraded: missing required resources: secrets: csr-signer-6,kube-controller-manager-client-cert-key-6,service-account-private-key-6 RevisionControllerDegraded: secrets "service-account-private-key" not found reason: MultipleConditionsMatching status: "True" type: Degraded $ oc logs -n openshift-kube-controller-manager kube-controller-manager-ip-10-0-129-120.ec2.internal -c kube-controller-manager-5 I0905 17:10:39.923000 1 leaderelection.go:217] attempting to acquire leader lease kube-system/kube-controller-manager... E0905 17:10:44.736314 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-controller-manager: configmaps "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" To recreate: 1. launch cluster 2. delete all secrets in openshift-* namespaces, except openshift-config-*, that don't have the annotation kubernetes.io/service-account.name set (i.e. are not created by service account controllers) 3. observe the cluster recovering The KCM never recovers or rolls out 6 Clusters must recover if temporary secrets are deleted.