Description of problem: service-ca pod is in 'ConfigError' state after replacing default ingress cert Version-Release number of selected component (if applicable): 4.7.0-rc2 How reproducible: Steps to Reproduce: 1. Replace default ingress cert, following the doc https://docs.openshift.com/container-platform/4.6/security/certificates/replacing-default-ingress-certificate.html 2. service-ca pod is in 'ConfigError' state after replacing default ingress cert Actual results: service-ca pod is in 'ConfigError' state Expected results: All pods should be up and running Additional info: # oc -n openshift-service-ca get po NAME READY STATUS RESTARTS AGE service-ca-76bbb97c47-lwfws 0/1 CreateContainerConfigError 0 4d15h # oc -n openshift-service-ca get po service-ca-76bbb97c47-lwfws -oyaml ... status: conditions: - lastProbeTime: null lastTransitionTime: "2021-02-12T23:30:48Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2021-02-12T23:30:48Z" message: 'containers with unready status: [service-ca-controller]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2021-02-12T23:30:48Z" message: 'containers with unready status: [service-ca-controller]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2021-02-12T23:30:48Z" status: "True" type: PodScheduled containerStatuses: - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a79b840f91d8074737e012c8650313534ea01c880d6d544633d0c34febcca2c imageID: "" lastState: {} name: service-ca-controller ready: false restartCount: 0 started: false state: waiting: message: 'container has runAsNonRoot and image will run as root (pod: "service-ca-76bbb97c47-lwfws_openshift-service-ca(02846586-8661-4a53-9507-3906acf11179)", container: service-ca-controller)' reason: CreateContainerConfigError hostIP: 10.0.0.6 phase: Pending podIP: 10.130.0.15 podIPs: - ip: 10.130.0.15 qosClass: Burstable startTime: "2021-02-12T23:30:48Z" must-gather log location- https://drive.google.com/drive/folders/1DkkzfWwzc51Su09M2r1a3ABfRnCg7vlg?usp=sharing
Visible from the logs: the service-ca pod is running with anyuid SCC. But it shouldn't have access to it. Can you provide kube-apiserver audit logs?
Did you make any changes to RBAC? If not, can you please also get us all the clusterrolebindings containing the "service-ca" service account in subjects, plus the content of all the bound clusterroles from these clusterrolebindings?
Provided access to the cluster to Standa. No changes made to RBAC. The problem occurred after changing the ingress default cert which is causing RHACM install failed down the road. audit log location - https://drive.google.com/file/d/1vyEOVg7JfJB9zCv-n04fZYdsOb8B7edE/view?usp=sharing
Created attachment 1758790 [details] clusterrolebindings
Created attachment 1758791 [details] clusterroles
Created attachment 1758792 [details] clusterrolebindings
I've uploaded the clusterroles and clusterrolebindings from the cluster. In the cluster, I created an IdP to test which capabilities a normal user has. It is possible for any user to create a privileged pod, and apparently, the anyuid is also accessible since it's being picked by pods that should otherwise grab the restricted SCC. The `oc auth can-i use scc/anyuid` returns "no", though.
Having the same issue when upgrading from 4.6.18 to 4.7.0, the upgrade of service-ca running for a few days and caused failure of other components due to no certificate assigned. # oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 8m49s Normal Scheduled pod/service-ca-78c87b4f98-lh64b Successfully assigned openshift-service-ca/service-ca-78c87b4f98-lh64b to master01 8m45s Normal AddedInterface pod/service-ca-78c87b4f98-lh64b Add eth0 [10.128.0.115/23] 3m42s Normal Pulled pod/service-ca-78c87b4f98-lh64b Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e401ae9497bb32c6e29d46048fd6caf75d02dc775fef96e4d3a9c2f62f389f57" already present on machine 6m49s Warning Failed pod/service-ca-78c87b4f98-lh64b Error: container has runAsNonRoot and image will run as root 12m Normal Pulled pod/service-ca-78c87b4f98-wvjxs Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e401ae9497bb32c6e29d46048fd6caf75d02dc775fef96e4d3a9c2f62f389f57" already present on machine 8m49s Normal SuccessfulCreate replicaset/service-ca-78c87b4f98 Created pod: service-ca-78c87b4f98-lh64b
I've seen the same error on 2 clusters. First cluster, it was indeed an easy pick as the SCC was hand edited and fixed, no problem. Second cluster has been more problematic. SCC's, clusterroles and clusterrole bindings all look good compared known good working clusters. Same problem on the pod, service-ca is requesting anyuid from pod: openshift.io/scc: anyuid creationTimestamp: "2021-03-18T21:24:16Z" dumping the pod yaml and running scc review: oc policy scc-review -z system:serviceaccount:openshift-service-ca:service-ca -f pod.yml RESOURCE SERVICE ACCOUNT ALLOWED BY Pod/service-ca-7f68c8cf48-9zn59 service-ca restricted
Interesting, pmoses, did you run the `scc-review` as the ServiceAccount deploying those pods? If not, do you think I might ask you to and share the result here?
So... for me, this turned out to be a fault on an admins end. A cluster role and role binding were created (unknown to me) for system:authenticated for anyuid. When we cleaned out the CR and CRB and removed the old service-ca pods, new pods successfully launched. This wasn't a fault of service-ca but did prevent launch of the pod. I see the PR that was pushed earlier this month and that is nice as well. Thanks for the follow up, to summarize, the two times I've seen this: 1. default SCC was edited improperly 2. CRB and RB was created for system:authenticated and anyuid. Not a bug in service-ca but surely stopped the successful upgrade of service-ca.
(In reply to pmoses from comment #12) > So... for me, this turned out to be a fault on an admins end. > > A cluster role and role binding were created (unknown to me) for > system:authenticated for anyuid. > When we cleaned out the CR and CRB and removed the old service-ca pods, new > pods successfully launched. > > This wasn't a fault of service-ca but did prevent launch of the pod. I see > the PR that was pushed earlier this month and that is nice as well. > > Thanks for the follow up, to summarize, the two times I've seen this: > 1. default SCC was edited improperly > 2. CRB and RB was created for system:authenticated and anyuid. > > Not a bug in service-ca but surely stopped the successful upgrade of > service-ca. Indeed, after deleting scc setting from https://doc.traefik.io/traefik-enterprise/installing/kubernetes/teectl/#security-context-constraints, it's working now! Thanks for the investigation!
This was fixed in https://github.com/openshift/service-ca-operator/pull/136
*** Bug 1934014 has been marked as a duplicate of this bug. ***