Bug 1803956 - Error creating pods right after cluster provisioning - unable to validate against any security context constraint [NEEDINFO]
Summary: Error creating pods right after cluster provisioning - unable to validate aga...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.2.z
Hardware: All
OS: All
high
urgent
Target Milestone: ---
: 4.5.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-17 19:32 UTC by Rogerio Bastos
Modified: 2020-04-02 17:02 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-02 17:02:35 UTC
Target Upstream Version:
slaznick: needinfo? (rbastos)


Attachments (Terms of Use)

Description Rogerio Bastos 2020-02-17 19:32:37 UTC
Description of problem:
After cluster provisioning a couple of deployments fail to create pods with the following error message:
"forbidden: unable to validate against any security context constraint: []"
No reason for forbidden action/object is provided between []s and the same problem is happening with multiple deployments.
**This issue is happening when provisioning OSD production clusters, and is blocking provisioning, since pods that expose Mutating Webhook endpoints are not being created as expected.**
Must-gather from cluster here: https://drive.google.com/file/d/1VQ4uQ77iVTzTC1Er-C-IIc9-VKBKaOp4/view?usp=sharing
Version-Release number of selected component (if applicable):
Cluster version is 4.2.16
How reproducible:
It's an intermittent error, that's manifesting right after provisioning of OSD production clusters.
Actual results:
A summary of failed deployments with the same issue:
111m        Warning   FailedCreate         replicaset/splunk-forwarder-operator-5855c598cf                   Error creating: pods "splunk-forwarder-operator-5855c598cf-" is forbidden: unable to validate against any security context constraint: []
165m        Warning   FailedCreate       job/builds-pruner-1581948000              Error creating: pods "builds-pruner-1581948000-" is forbidden: unable to validate against any security context constraint: []
111m        Warning   FailedCreate       job/builds-pruner-1581951600              Error creating: pods "builds-pruner-1581951600-" is forbidden: unable to validate against any security context constraint: []
165m        Warning   FailedCreate       job/deployments-pruner-1581948000         Error creating: pods "deployments-pruner-1581948000-" is forbidden: unable to validate against any security context constraint: []
111m        Warning   FailedCreate       job/deployments-pruner-1581951600         Error creating: pods "deployments-pruner-1581951600-" is forbidden: unable to validate against any security context constraint: []
165m        Warning   FailedCreate       job/image-pruner-1581948000               Error creating: pods "image-pruner-1581948000-" is forbidden: unable to validate against any security context constraint: []
111m        Warning   FailedCreate       job/image-pruner-1581951600               Error creating: pods "image-pruner-1581951600-" is forbidden: unable to validate against any security context constraint: []
No resources found.
121m        Warning   FailedCreate        replicaset/validation-webhook-6f8dcdd9fb   Error creating: pods "validation-webhook-6f8dcdd9fb-" is forbidden: unable to validate against any security context constraint: []
111m        Warning   FailedCreate        replicaset/managed-velero-operator-589fbb966f   Error creating: pods "managed-velero-operator-589fbb966f-" is forbidden: unable to validate against any security context constraint: []
Additional info:
A must-gather of the impacted cluster has been uploaded here: https://drive.google.com/file/d/1VQ4uQ77iVTzTC1Er-C-IIc9-VKBKaOp4/view?usp=sharing

Comment 1 Rogerio Bastos 2020-02-18 21:26:48 UTC
Adding some additional context to the issue:

1) It's an intermittent issue that was observed so far in versions 4.2.16 and 4.3.0
2) The issue never gets reconciled to a good state. Even after 4hrs+ the deployment was still with 0 pods, and the only way to unblock it was to manually add the pod's ServiceAccount to the privileged SCC.
3) The deployment and all supporting objects for this component are being provisioned using the hive infrastructure, which has cluster-admin permissions in all OSD clusters
4) The issue wasn't observed in clusters running 4.4 builds yet, will keep this BZ updated as we get examples

Comment 2 Stefan Schimanski 2020-02-20 12:15:27 UTC
We used to create the SCCs late when openshift-apiserver came up. That certainly let to those transient errors. This was fixed in https://github.com/openshift/cluster-kube-apiserver-operator/pull/725 and backported to 4.3 in https://github.com/openshift/cluster-kube-apiserver-operator/pull/728

2) surprises me. The deployment controller should repair that as soon as SCCs exist.

> and the only way to unblock it was to manually add the pod's ServiceAccount to the privileged SCC.

How would that service account work with the SCC normally, i.e. when you don't see the given error?

Comment 3 Rogerio Bastos 2020-02-20 12:56:08 UTC
Normally that service account doesn't need or has any special SCC configuration. When this intermittent issue is not happening, the default security configs are enough for this service to be deployed.

Please find below the deployment definition:

  - apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      labels:
        app: validation-webhook
        deployment: validation-webhook
      name: validation-webhook
      namespace: openshift-validation-webhook
    spec:
      replicas: 3
      template:
        metadata:
          labels:
            app: validation-webhook
        spec:
          containers:
          - command:
            - gunicorn
            - --config
            - /app/gunicorn.py
            - --ca-certs
            - /service-ca/service-ca.crt
            - --keyfile
            - /service-certs/tls.key
            - --certfile
            - /service-certs/tls.crt
            - --access-logfile
            - '-'
            - webhook:app
            env:
            - name: SUBSCRIPTION_VALIDATION_NAMESPACES
              value: openshift-marketplace
            - name: GROUP_VALIDATION_ADMIN_GROUP
              value: osd-sre-admins,osd-sre-cluster-admins
            - name: GROUP_VALIDATION_PREFIX
              value: osd-sre-
            image: quay.io/app-sre/managed-cluster-validating-webhooks:672d0cf
            imagePullPolicy: Always
            name: validation-webhook
            ports:
            - containerPort: 5000
            volumeMounts:
            - mountPath: /service-certs
              name: service-certs
              readOnly: true
            - mountPath: /service-ca
              name: service-ca
              readOnly: true
          initContainers:
          - command:
            - python
            - /app/init.py
            - -a
            - managed.openshift.io/inject-cabundle-from
            image: quay.io/app-sre/managed-cluster-validating-webhooks:672d0cf
            name: inject-cert
          restartPolicy: Always
          serviceAccountName: validation-webhook
          volumes:
          - name: service-certs
            secret:
              secretName: webhook-cert
          - configMap:
              name: webhook-cert
            name: service-ca

Comment 10 Standa Laznicka 2020-03-23 12:26:30 UTC
So the reason SCC admission is not working is this error message of the openshift-controller-manager:
```
2020-02-17T11:08:34.205040894Z E0217 11:08:34.204994       1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: Internal error occurred: failed calling webhook "namespace-validation.managed.openshift.io": Post https://validation-webhook.openshift-validation-webhook.svc:443/namespace-validation?timeout=30s: no endpoints available for service "validation-webhook"
```

Did you by any chance create a validationwebhookconfiguration before the underlying service was ready to handle requests, thus breaking any possibility to modify any namespace?

Comment 12 Rogerio Bastos 2020-03-30 13:20:13 UTC
Hi Standa, the validationwebhookconfiguration object is created after the deployment and service are deployed, they're all deployed by a hive SelectorSyncSet. The issue is the deployment is not being able to create its pods, failing with the error:

forbidden: unable to validate against any security context constraint: []

Note: 1) we have no validating webhook covering pods
2) the error above seems to be related to SCC. E.g if an admin logs into one of those stuck clusters, it can unstuck the process by assigning the privileged SCC to the webhook deployment.


The error is intermittent but was also observed in tests with 4.4 nightly builds.

Comment 14 David Eads 2020-04-02 17:02:35 UTC
The message in comment 10 indicates that someone created an admission plugin controlling namespaces, which fails to skip its own namespace. The pattern prevents a cluster from restarting and in this case it prevents a controller from creating namespace annotations required to create a pod inside of the namespace.  This is just a race in the workload/admission plugin that you're adding, not a bug in any openshift code.


Note You need to log in before you can comment on or make changes to this bug.