Bug 1803956
| Summary: | Error creating pods right after cluster provisioning - unable to validate against any security context constraint | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Rogerio Bastos <rbastos> |
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED NOTABUG | QA Contact: | zhou ying <yinzhou> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.2.z | CC: | aos-bugs, cshereme, deads, eparis, jeder, jialiu, jokerman, mfojtik, nmalik, nstielau, slaznick, wsun, xtian, xxia |
| Target Milestone: | --- | Keywords: | ServiceDeliveryImpact |
| Target Release: | 4.5.0 | Flags: | slaznick:
needinfo?
(rbastos) |
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-02 17:02:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Rogerio Bastos
2020-02-17 19:32:37 UTC
Adding some additional context to the issue: 1) It's an intermittent issue that was observed so far in versions 4.2.16 and 4.3.0 2) The issue never gets reconciled to a good state. Even after 4hrs+ the deployment was still with 0 pods, and the only way to unblock it was to manually add the pod's ServiceAccount to the privileged SCC. 3) The deployment and all supporting objects for this component are being provisioned using the hive infrastructure, which has cluster-admin permissions in all OSD clusters 4) The issue wasn't observed in clusters running 4.4 builds yet, will keep this BZ updated as we get examples We used to create the SCCs late when openshift-apiserver came up. That certainly let to those transient errors. This was fixed in https://github.com/openshift/cluster-kube-apiserver-operator/pull/725 and backported to 4.3 in https://github.com/openshift/cluster-kube-apiserver-operator/pull/728 2) surprises me. The deployment controller should repair that as soon as SCCs exist. > and the only way to unblock it was to manually add the pod's ServiceAccount to the privileged SCC. How would that service account work with the SCC normally, i.e. when you don't see the given error? Normally that service account doesn't need or has any special SCC configuration. When this intermittent issue is not happening, the default security configs are enough for this service to be deployed.
Please find below the deployment definition:
- apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: validation-webhook
deployment: validation-webhook
name: validation-webhook
namespace: openshift-validation-webhook
spec:
replicas: 3
template:
metadata:
labels:
app: validation-webhook
spec:
containers:
- command:
- gunicorn
- --config
- /app/gunicorn.py
- --ca-certs
- /service-ca/service-ca.crt
- --keyfile
- /service-certs/tls.key
- --certfile
- /service-certs/tls.crt
- --access-logfile
- '-'
- webhook:app
env:
- name: SUBSCRIPTION_VALIDATION_NAMESPACES
value: openshift-marketplace
- name: GROUP_VALIDATION_ADMIN_GROUP
value: osd-sre-admins,osd-sre-cluster-admins
- name: GROUP_VALIDATION_PREFIX
value: osd-sre-
image: quay.io/app-sre/managed-cluster-validating-webhooks:672d0cf
imagePullPolicy: Always
name: validation-webhook
ports:
- containerPort: 5000
volumeMounts:
- mountPath: /service-certs
name: service-certs
readOnly: true
- mountPath: /service-ca
name: service-ca
readOnly: true
initContainers:
- command:
- python
- /app/init.py
- -a
- managed.openshift.io/inject-cabundle-from
image: quay.io/app-sre/managed-cluster-validating-webhooks:672d0cf
name: inject-cert
restartPolicy: Always
serviceAccountName: validation-webhook
volumes:
- name: service-certs
secret:
secretName: webhook-cert
- configMap:
name: webhook-cert
name: service-ca
So the reason SCC admission is not working is this error message of the openshift-controller-manager: ``` 2020-02-17T11:08:34.205040894Z E0217 11:08:34.204994 1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: Internal error occurred: failed calling webhook "namespace-validation.managed.openshift.io": Post https://validation-webhook.openshift-validation-webhook.svc:443/namespace-validation?timeout=30s: no endpoints available for service "validation-webhook" ``` Did you by any chance create a validationwebhookconfiguration before the underlying service was ready to handle requests, thus breaking any possibility to modify any namespace? Hi Standa, the validationwebhookconfiguration object is created after the deployment and service are deployed, they're all deployed by a hive SelectorSyncSet. The issue is the deployment is not being able to create its pods, failing with the error: forbidden: unable to validate against any security context constraint: [] Note: 1) we have no validating webhook covering pods 2) the error above seems to be related to SCC. E.g if an admin logs into one of those stuck clusters, it can unstuck the process by assigning the privileged SCC to the webhook deployment. The error is intermittent but was also observed in tests with 4.4 nightly builds. The message in comment 10 indicates that someone created an admission plugin controlling namespaces, which fails to skip its own namespace. The pattern prevents a cluster from restarting and in this case it prevents a controller from creating namespace annotations required to create a pod inside of the namespace. This is just a race in the workload/admission plugin that you're adding, not a bug in any openshift code. |