Description of problem: After cluster provisioning a couple of deployments fail to create pods with the following error message: "forbidden: unable to validate against any security context constraint: []" No reason for forbidden action/object is provided between []s and the same problem is happening with multiple deployments. **This issue is happening when provisioning OSD production clusters, and is blocking provisioning, since pods that expose Mutating Webhook endpoints are not being created as expected.** Must-gather from cluster here: https://drive.google.com/file/d/1VQ4uQ77iVTzTC1Er-C-IIc9-VKBKaOp4/view?usp=sharing Version-Release number of selected component (if applicable): Cluster version is 4.2.16 How reproducible: It's an intermittent error, that's manifesting right after provisioning of OSD production clusters. Actual results: A summary of failed deployments with the same issue: 111m Warning FailedCreate replicaset/splunk-forwarder-operator-5855c598cf Error creating: pods "splunk-forwarder-operator-5855c598cf-" is forbidden: unable to validate against any security context constraint: [] 165m Warning FailedCreate job/builds-pruner-1581948000 Error creating: pods "builds-pruner-1581948000-" is forbidden: unable to validate against any security context constraint: [] 111m Warning FailedCreate job/builds-pruner-1581951600 Error creating: pods "builds-pruner-1581951600-" is forbidden: unable to validate against any security context constraint: [] 165m Warning FailedCreate job/deployments-pruner-1581948000 Error creating: pods "deployments-pruner-1581948000-" is forbidden: unable to validate against any security context constraint: [] 111m Warning FailedCreate job/deployments-pruner-1581951600 Error creating: pods "deployments-pruner-1581951600-" is forbidden: unable to validate against any security context constraint: [] 165m Warning FailedCreate job/image-pruner-1581948000 Error creating: pods "image-pruner-1581948000-" is forbidden: unable to validate against any security context constraint: [] 111m Warning FailedCreate job/image-pruner-1581951600 Error creating: pods "image-pruner-1581951600-" is forbidden: unable to validate against any security context constraint: [] No resources found. 121m Warning FailedCreate replicaset/validation-webhook-6f8dcdd9fb Error creating: pods "validation-webhook-6f8dcdd9fb-" is forbidden: unable to validate against any security context constraint: [] 111m Warning FailedCreate replicaset/managed-velero-operator-589fbb966f Error creating: pods "managed-velero-operator-589fbb966f-" is forbidden: unable to validate against any security context constraint: [] Additional info: A must-gather of the impacted cluster has been uploaded here: https://drive.google.com/file/d/1VQ4uQ77iVTzTC1Er-C-IIc9-VKBKaOp4/view?usp=sharing
Adding some additional context to the issue: 1) It's an intermittent issue that was observed so far in versions 4.2.16 and 4.3.0 2) The issue never gets reconciled to a good state. Even after 4hrs+ the deployment was still with 0 pods, and the only way to unblock it was to manually add the pod's ServiceAccount to the privileged SCC. 3) The deployment and all supporting objects for this component are being provisioned using the hive infrastructure, which has cluster-admin permissions in all OSD clusters 4) The issue wasn't observed in clusters running 4.4 builds yet, will keep this BZ updated as we get examples
We used to create the SCCs late when openshift-apiserver came up. That certainly let to those transient errors. This was fixed in https://github.com/openshift/cluster-kube-apiserver-operator/pull/725 and backported to 4.3 in https://github.com/openshift/cluster-kube-apiserver-operator/pull/728 2) surprises me. The deployment controller should repair that as soon as SCCs exist. > and the only way to unblock it was to manually add the pod's ServiceAccount to the privileged SCC. How would that service account work with the SCC normally, i.e. when you don't see the given error?
Normally that service account doesn't need or has any special SCC configuration. When this intermittent issue is not happening, the default security configs are enough for this service to be deployed. Please find below the deployment definition: - apiVersion: extensions/v1beta1 kind: Deployment metadata: labels: app: validation-webhook deployment: validation-webhook name: validation-webhook namespace: openshift-validation-webhook spec: replicas: 3 template: metadata: labels: app: validation-webhook spec: containers: - command: - gunicorn - --config - /app/gunicorn.py - --ca-certs - /service-ca/service-ca.crt - --keyfile - /service-certs/tls.key - --certfile - /service-certs/tls.crt - --access-logfile - '-' - webhook:app env: - name: SUBSCRIPTION_VALIDATION_NAMESPACES value: openshift-marketplace - name: GROUP_VALIDATION_ADMIN_GROUP value: osd-sre-admins,osd-sre-cluster-admins - name: GROUP_VALIDATION_PREFIX value: osd-sre- image: quay.io/app-sre/managed-cluster-validating-webhooks:672d0cf imagePullPolicy: Always name: validation-webhook ports: - containerPort: 5000 volumeMounts: - mountPath: /service-certs name: service-certs readOnly: true - mountPath: /service-ca name: service-ca readOnly: true initContainers: - command: - python - /app/init.py - -a - managed.openshift.io/inject-cabundle-from image: quay.io/app-sre/managed-cluster-validating-webhooks:672d0cf name: inject-cert restartPolicy: Always serviceAccountName: validation-webhook volumes: - name: service-certs secret: secretName: webhook-cert - configMap: name: webhook-cert name: service-ca
So the reason SCC admission is not working is this error message of the openshift-controller-manager: ``` 2020-02-17T11:08:34.205040894Z E0217 11:08:34.204994 1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: Internal error occurred: failed calling webhook "namespace-validation.managed.openshift.io": Post https://validation-webhook.openshift-validation-webhook.svc:443/namespace-validation?timeout=30s: no endpoints available for service "validation-webhook" ``` Did you by any chance create a validationwebhookconfiguration before the underlying service was ready to handle requests, thus breaking any possibility to modify any namespace?
Hi Standa, the validationwebhookconfiguration object is created after the deployment and service are deployed, they're all deployed by a hive SelectorSyncSet. The issue is the deployment is not being able to create its pods, failing with the error: forbidden: unable to validate against any security context constraint: [] Note: 1) we have no validating webhook covering pods 2) the error above seems to be related to SCC. E.g if an admin logs into one of those stuck clusters, it can unstuck the process by assigning the privileged SCC to the webhook deployment. The error is intermittent but was also observed in tests with 4.4 nightly builds.
The message in comment 10 indicates that someone created an admission plugin controlling namespaces, which fails to skip its own namespace. The pattern prevents a cluster from restarting and in this case it prevents a controller from creating namespace annotations required to create a pod inside of the namespace. This is just a race in the workload/admission plugin that you're adding, not a bug in any openshift code.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days