1929801 – service-ca is in 'ConfigError' state after replacing default ingress cert

Bug 1929801 - service-ca is in 'ConfigError' state after replacing default ingress cert

Summary: service-ca is in 'ConfigError' state after replacing default ingress cert

Keywords:
Status:	CLOSED DUPLICATE of bug 1914446
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	service-ca
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Standa Laznicka
QA Contact:	scheng
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1934014 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-17 16:53 UTC by Thuy Nguyen
Modified:	2024-06-14 00:20 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-18 06:47:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
clusterrolebindings (263.95 KB, text/plain) 2021-02-23 08:52 UTC, Standa Laznicka	no flags	Details
clusterroles (767.91 KB, text/plain) 2021-02-23 08:53 UTC, Standa Laznicka	no flags	Details
clusterrolebindings (265.29 KB, text/plain) 2021-02-23 08:54 UTC, Standa Laznicka	no flags	Details
Show Obsolete (1) View All

Description Thuy Nguyen 2021-02-17 16:53:01 UTC

Description of problem:
service-ca pod is in 'ConfigError' state after replacing default ingress cert

Version-Release number of selected component (if applicable):
4.7.0-rc2


How reproducible:


Steps to Reproduce:
1. Replace default ingress cert, following the doc https://docs.openshift.com/container-platform/4.6/security/certificates/replacing-default-ingress-certificate.html
2. service-ca pod is in 'ConfigError' state after replacing default ingress cert


Actual results:
service-ca pod is in 'ConfigError' state

Expected results:
All pods should be up and running

Additional info:

# oc -n openshift-service-ca get po
NAME                          READY   STATUS                       RESTARTS   AGE
service-ca-76bbb97c47-lwfws   0/1     CreateContainerConfigError   0          4d15h



# oc -n openshift-service-ca get po service-ca-76bbb97c47-lwfws -oyaml
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-02-12T23:30:48Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-02-12T23:30:48Z"
    message: 'containers with unready status: [service-ca-controller]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-02-12T23:30:48Z"
    message: 'containers with unready status: [service-ca-controller]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-02-12T23:30:48Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a79b840f91d8074737e012c8650313534ea01c880d6d544633d0c34febcca2c
    imageID: ""
    lastState: {}
    name: service-ca-controller
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'container has runAsNonRoot and image will run as root (pod: "service-ca-76bbb97c47-lwfws_openshift-service-ca(02846586-8661-4a53-9507-3906acf11179)", container: service-ca-controller)'
        reason: CreateContainerConfigError
  hostIP: 10.0.0.6
  phase: Pending
  podIP: 10.130.0.15
  podIPs:
  - ip: 10.130.0.15
  qosClass: Burstable
  startTime: "2021-02-12T23:30:48Z"



must-gather log location-
https://drive.google.com/drive/folders/1DkkzfWwzc51Su09M2r1a3ABfRnCg7vlg?usp=sharing

Comment 1 Stefan Schimanski 2021-02-18 11:07:43 UTC

Visible from the logs: the service-ca pod is running with anyuid SCC. But it shouldn't have access to it.

Can you provide kube-apiserver audit logs?

Comment 2 Standa Laznicka 2021-02-18 11:12:06 UTC

Did you make any changes to RBAC? If not, can you please also get us all the clusterrolebindings containing the "service-ca" service account in subjects, plus the content of all the bound clusterroles from these clusterrolebindings?

Comment 3 Thuy Nguyen 2021-02-18 20:19:33 UTC

Provided access to the cluster to Standa. No changes made to RBAC. The problem occurred after changing the ingress default cert which is causing RHACM install failed down the road. 

audit log location - https://drive.google.com/file/d/1vyEOVg7JfJB9zCv-n04fZYdsOb8B7edE/view?usp=sharing

Comment 5 Standa Laznicka 2021-02-23 08:52:44 UTC

Created attachment 1758790 [details]
clusterrolebindings

Comment 6 Standa Laznicka 2021-02-23 08:53:41 UTC

Created attachment 1758791 [details]
clusterroles

Comment 7 Standa Laznicka 2021-02-23 08:54:19 UTC

Created attachment 1758792 [details]
clusterrolebindings

Comment 8 Standa Laznicka 2021-02-23 09:05:38 UTC

I've uploaded the clusterroles and clusterrolebindings from the cluster. In the cluster, I created an IdP to test which capabilities a normal user has. It is possible for any user to create a privileged pod, and apparently, the anyuid is also accessible since it's being picked by pods that should otherwise grab the restricted SCC. The `oc auth can-i use scc/anyuid` returns "no", though.

Comment 9 sam 2021-03-04 05:00:49 UTC

Having the same issue when upgrading from 4.6.18 to 4.7.0, the upgrade of service-ca running for a few days and caused failure of other components due to no certificate assigned.

# oc get events
LAST SEEN   TYPE      REASON             OBJECT                             MESSAGE
8m49s       Normal    Scheduled          pod/service-ca-78c87b4f98-lh64b    Successfully assigned openshift-service-ca/service-ca-78c87b4f98-lh64b to master01
8m45s       Normal    AddedInterface     pod/service-ca-78c87b4f98-lh64b    Add eth0 [10.128.0.115/23]
3m42s       Normal    Pulled             pod/service-ca-78c87b4f98-lh64b    Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e401ae9497bb32c6e29d46048fd6caf75d02dc775fef96e4d3a9c2f62f389f57" already present on machine
6m49s       Warning   Failed             pod/service-ca-78c87b4f98-lh64b    Error: container has runAsNonRoot and image will run as root
12m         Normal    Pulled             pod/service-ca-78c87b4f98-wvjxs    Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e401ae9497bb32c6e29d46048fd6caf75d02dc775fef96e4d3a9c2f62f389f57" already present on machine
8m49s       Normal    SuccessfulCreate   replicaset/service-ca-78c87b4f98   Created pod: service-ca-78c87b4f98-lh64b

Comment 10 pmoses 2021-03-18 22:25:34 UTC

I've seen the same error on 2 clusters. 

First cluster, it was indeed an easy pick as the SCC was hand edited and fixed, no problem. 

Second cluster has been more problematic. SCC's, clusterroles and clusterrole bindings all look good compared known good working clusters. 

Same problem on the pod, service-ca is requesting anyuid 

from pod:

      openshift.io/scc: anyuid
    creationTimestamp: "2021-03-18T21:24:16Z"


dumping the pod yaml and running scc review: 

oc policy scc-review -z system:serviceaccount:openshift-service-ca:service-ca -f pod.yml
RESOURCE                          SERVICE ACCOUNT   ALLOWED BY   
Pod/service-ca-7f68c8cf48-9zn59   service-ca        restricted

Comment 11 Standa Laznicka 2021-03-19 11:34:50 UTC

Interesting, pmoses, did you run the `scc-review` as the ServiceAccount deploying those pods? If not, do you think I might ask you to and share the result here?

Comment 12 pmoses 2021-03-19 16:21:40 UTC

So... for me, this turned out to be a fault on an admins end. 

A cluster role and role binding were created (unknown to me) for system:authenticated for anyuid. 
When we cleaned out the CR and CRB and removed the old service-ca pods, new pods successfully launched. 

This wasn't a fault of service-ca but did prevent launch of the pod. I see the PR that was pushed earlier this month and that is nice as well. 

Thanks for the follow up, to summarize, the two times I've seen this:
1. default SCC was edited improperly
2. CRB and RB was created for system:authenticated and anyuid. 

Not a bug in service-ca but surely stopped the successful upgrade of service-ca.

Comment 13 sam 2021-03-23 17:17:49 UTC

(In reply to pmoses from comment #12)
> So... for me, this turned out to be a fault on an admins end. 
> 
> A cluster role and role binding were created (unknown to me) for
> system:authenticated for anyuid. 
> When we cleaned out the CR and CRB and removed the old service-ca pods, new
> pods successfully launched. 
> 
> This wasn't a fault of service-ca but did prevent launch of the pod. I see
> the PR that was pushed earlier this month and that is nice as well. 
> 
> Thanks for the follow up, to summarize, the two times I've seen this:
> 1. default SCC was edited improperly
> 2. CRB and RB was created for system:authenticated and anyuid. 
> 
> Not a bug in service-ca but surely stopped the successful upgrade of
> service-ca.

Indeed, after deleting scc setting from https://doc.traefik.io/traefik-enterprise/installing/kubernetes/teectl/#security-context-constraints, it's working now!
Thanks for the investigation!

Comment 14 Standa Laznicka 2021-04-16 12:53:36 UTC

This was fixed in https://github.com/openshift/service-ca-operator/pull/136

Comment 15 Standa Laznicka 2021-04-16 13:03:06 UTC

*** Bug 1934014 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.