Bug 1861917 - installs lagging due to unable to validate against any security context constraint [NEEDINFO]
Summary: installs lagging due to unable to validate against any security context const...
Keywords:
Status: ON_QA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Tomáš Nožička
QA Contact: RamaKasturi
URL:
Whiteboard: LifecycleReset
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-29 21:38 UTC by Abhinav Dahiya
Modified: 2020-09-22 13:09 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
mfojtik: needinfo?


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-kube-controller-manager-operator pull 441 None closed Bug 1867494: Explicitly use internal LB for KCM and CPC 2020-09-21 06:25:41 UTC
Github openshift cluster-kube-controller-manager-operator pull 443 None closed bug 1861917: add render --cluster-policy-controller-image to add that controller to bootstrap yaml 2020-09-21 06:25:40 UTC
Github openshift cluster-kube-controller-manager-operator pull 446 None closed Bug 1861917: Render cluster-policy-controller for bootstrap 2020-09-21 06:25:39 UTC
Github openshift cluster-kube-controller-manager-operator pull 449 None closed Bug 1861917: Fix bootstrap cpc 2020-09-21 06:25:39 UTC
Github openshift cluster-kube-controller-manager-operator pull 451 None closed Bug 1861917: Enforce cpc config 2020-09-21 06:25:39 UTC
Github openshift cluster-kube-controller-manager-operator pull 453 None open [WIP] Bug 1861917: Fix bootstrap cpc config name 2020-09-21 12:55:38 UTC
Github openshift installer pull 4131 None closed bug 1861917: bootkube: add image for cluster-policy-controller 2020-09-21 06:25:42 UTC
Github openshift installer pull 4178 None closed Bug 1861917: Add cpc config to bootstrap 2020-09-21 06:25:42 UTC

Description Abhinav Dahiya 2020-07-29 21:38:26 UTC
Description of problem:

During installs I'm seeing operators pods failing to be accepted due an error

```
Warning  FailedCreate      6m26s (x17 over 11m)  replicaset-controller  Error creating: pods "machine-api-operator-7787b5cbc6-" is forbidden: unable to validate against any security context constraint: []
```

This causes all the pods creation to cause delays from 5 to 11mins..


Version-Release number of selected component (if applicable):

4.5.z and 4.6

How reproducible:

All clusters.

Steps to Reproduce:

If you look at any CI job you can easily see this error.

```
EVENTS_URL=<events.json URL from CI artifcats>
curl -s "$EVENTS_URL" | jq '.items[] | select(.involvedObject.kind == "ReplicaSet") | select (.message | contains("unable to validate against any security context constraint")) | {name: .metadata.name, namespace: .metadata.namespace, count: .count, firstTimestamp: .firstTimestamp, lastTimestamp: .lastTimestamp}' 
```

Some examples are:

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1285719373872369664/artifacts/e2e-azure/events.json

```
{
  "name": "cloud-credential-operator-74bcd7cf6.1623e9b1b641f5e5",
  "namespace": "openshift-cloud-credential-operator",
  "count": 10,
  "firstTimestamp": "2020-07-21T23:54:02Z",
  "lastTimestamp": "2020-07-21T23:59:29Z"
}
{
  "name": "cluster-storage-operator-8c9bb9d97.1623e9b22a6e69d3",
  "namespace": "openshift-cluster-storage-operator",
  "count": 7,
  "firstTimestamp": "2020-07-21T23:54:04Z",
  "lastTimestamp": "2020-07-21T23:59:27Z"
}
{
  "name": "cluster-image-registry-operator-74cd744888.1623e9b65ff9d56c",
  "namespace": "openshift-image-registry",
  "count": 17,
  "firstTimestamp": "2020-07-21T23:54:22Z",
  "lastTimestamp": "2020-07-21T23:59:50Z"
}
{
  "name": "ingress-operator-5ddffb844c.1623e9b546ea80dd",
  "namespace": "openshift-ingress-operator",
  "count": 17,
  "firstTimestamp": "2020-07-21T23:54:17Z",
  "lastTimestamp": "2020-07-21T23:59:45Z"
}
{
  "name": "insights-operator-85846d6568.1623e9b237f063eb",
  "namespace": "openshift-insights",
  "count": 17,
  "firstTimestamp": "2020-07-21T23:54:04Z",
  "lastTimestamp": "2020-07-21T23:59:32Z"
}
{
  "name": "machine-api-operator-7496cc5cc6.1623e9b967659d24",
  "namespace": "openshift-machine-api",
  "count": 17,
  "firstTimestamp": "2020-07-21T23:54:35Z",
  "lastTimestamp": "2020-07-22T00:00:03Z"
}
{
  "name": "marketplace-operator-bddd5df.1623e9b1aa7b6b3c",
  "namespace": "openshift-marketplace",
  "count": 13,
  "firstTimestamp": "2020-07-21T23:54:02Z",
  "lastTimestamp": "2020-07-21T23:59:29Z"
}
{
  "name": "cluster-monitoring-operator-d95dcd94b.1623e9b2323cc69d",
  "namespace": "openshift-monitoring",
  "count": 7,
  "firstTimestamp": "2020-07-21T23:54:04Z",
  "lastTimestamp": "2020-07-21T23:59:27Z"
}
```

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1285813868605476864/artifacts/e2e-azure/events.json

```
{
  "name": "cloud-credential-operator-58b8bdd8b6.1623fdffb710e70e",
  "namespace": "openshift-cloud-credential-operator",
  "count": 10,
  "firstTimestamp": "2020-07-22T06:06:07Z",
  "lastTimestamp": "2020-07-22T06:11:34Z"
}
{
  "name": "cluster-storage-operator-5c99fd6bcf.1623fe000c5cbc92",
  "namespace": "openshift-cluster-storage-operator",
  "count": 7,
  "firstTimestamp": "2020-07-22T06:06:09Z",
  "lastTimestamp": "2020-07-22T06:11:31Z"
}
{
  "name": "cluster-image-registry-operator-7d9ffffc9d.1623fe047f6b0998",
  "namespace": "openshift-image-registry",
  "count": 17,
  "firstTimestamp": "2020-07-22T06:06:28Z",
  "lastTimestamp": "2020-07-22T06:11:56Z"
}
{
  "name": "ingress-operator-b46767f65.1623fe0366503bea",
  "namespace": "openshift-ingress-operator",
  "count": 17,
  "firstTimestamp": "2020-07-22T06:06:23Z",
  "lastTimestamp": "2020-07-22T06:11:51Z"
}
{
  "name": "insights-operator-7897b9856b.1623fe00486df264",
  "namespace": "openshift-insights",
  "count": 17,
  "firstTimestamp": "2020-07-22T06:06:10Z",
  "lastTimestamp": "2020-07-22T06:11:37Z"
}
{
  "name": "machine-api-operator-9876cc94b.1623fe06d4224053",
  "namespace": "openshift-machine-api",
  "count": 17,
  "firstTimestamp": "2020-07-22T06:06:38Z",
  "lastTimestamp": "2020-07-22T06:12:06Z"
}
{
  "name": "marketplace-operator-8f55fdd4c.1623fdffb8d2eb0b",
  "namespace": "openshift-marketplace",
  "count": 10,
  "firstTimestamp": "2020-07-22T06:06:07Z",
  "lastTimestamp": "2020-07-22T06:11:34Z"
}
{
  "name": "cluster-monitoring-operator-769bff7884.1623fe00307249fd",
  "namespace": "openshift-monitoring",
  "count": 7,
  "firstTimestamp": "2020-07-22T06:06:09Z",
  "lastTimestamp": "2020-07-22T06:11:32Z"
}
```

We should not be seeing these error because CVO creates the SCC policies and RBAC for operator pods before everything else.

Comment 1 Michal Fojtik 2020-08-28 21:59:26 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 2 Abhinav Dahiya 2020-08-28 22:04:50 UTC
The team has no comment on this bug since the initial report. So marking this stale or even decreasing the priority is incorrect in my opinion as it looks like the team never actually did anything to look into this.

Comment 3 Abhinav Dahiya 2020-08-28 22:07:34 UTC
This is still reproducible in CI from a run from today https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1299416934579703808

EVENTS_URL="https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1299416934579703808/artifacts/e2e-gcp/events.json"
curl -s "$EVENTS_URL" | jq '.items[] | select(.involvedObject.kind == "ReplicaSet") | select (.message | contains("unable to validate against any security context constraint")) | {name: .metadata.name, namespace: .metadata.namespace, count: .count, firstTimestamp: .firstTimestamp, lastTimestamp: .lastTimestamp}' | xclip -sel c -i


{
  "name": "cloud-credential-operator-577ffdc9f8.162f831c7c00ca1b",
  "namespace": "openshift-cloud-credential-operator",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:11Z",
  "lastTimestamp": "2020-08-28T18:54:39Z"
}
{
  "name": "cloud-credential-operator-577ffdc9f8.162f838de9bfd00d",
  "namespace": "openshift-cloud-credential-operator",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "cluster-samples-operator-5b758b645b.162f83a08ec78abb",
  "namespace": "openshift-cluster-samples-operator",
  "count": 10,
  "firstTimestamp": "2020-08-28T18:58:38Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "csi-snapshot-controller-5779f584d5.162f836f9ba3a635",
  "namespace": "openshift-cluster-storage-operator",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:55:08Z",
  "lastTimestamp": "2020-08-28T18:56:30Z"
}
{
  "name": "csi-snapshot-controller-5779f584d5.162f838dea1d4090",
  "namespace": "openshift-cluster-storage-operator",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "cluster-image-registry-operator-795848d64b.162f831e41818a73",
  "namespace": "openshift-image-registry",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:18Z",
  "lastTimestamp": "2020-08-28T18:54:46Z"
}
{
  "name": "cluster-image-registry-operator-795848d64b.162f838de9ee410f",
  "namespace": "openshift-image-registry",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "ingress-operator-577b779c44.162f831dfe85fe1c",
  "namespace": "openshift-ingress-operator",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:17Z",
  "lastTimestamp": "2020-08-28T18:54:45Z"
}
{
  "name": "ingress-operator-577b779c44.162f838de9ddade4",
  "namespace": "openshift-ingress-operator",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "insights-operator-8896789cf.162f831e38bfabfe",
  "namespace": "openshift-insights",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:18Z",
  "lastTimestamp": "2020-08-28T18:54:46Z"
}
{
  "name": "insights-operator-8896789cf.162f838de9dbd32f",
  "namespace": "openshift-insights",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "cluster-autoscaler-operator-65c4959c6f.162f831e2d24414e",
  "namespace": "openshift-machine-api",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:18Z",
  "lastTimestamp": "2020-08-28T18:54:46Z"
}
{
  "name": "cluster-autoscaler-operator-65c4959c6f.162f838dec179200",
  "namespace": "openshift-machine-api",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "machine-api-operator-5cd4899dd7.162f831e92a1135b",
  "namespace": "openshift-machine-api",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:20Z",
  "lastTimestamp": "2020-08-28T18:54:47Z"
}
{
  "name": "machine-api-operator-5cd4899dd7.162f838dec10a363",
  "namespace": "openshift-machine-api",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "marketplace-operator-678cb86df8.162f831ca509a38a",
  "namespace": "openshift-marketplace",
  "count": 7,
  "firstTimestamp": "2020-08-28T18:49:11Z",
  "lastTimestamp": "2020-08-28T18:54:34Z"
}
{
  "name": "marketplace-operator-678cb86df8.162f838ded2cc26d",
  "namespace": "openshift-marketplace",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
{
  "name": "cluster-monitoring-operator-7c486d94fb.162f831c89199357",
  "namespace": "openshift-monitoring",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:11Z",
  "lastTimestamp": "2020-08-28T18:54:39Z"
}
{
  "name": "cluster-monitoring-operator-7c486d94fb.162f838dec19e917",
  "namespace": "openshift-monitoring",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}

Comment 4 Michal Fojtik 2020-08-28 22:59:30 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 5 Stefan Schimanski 2020-08-31 13:11:57 UTC
I have checked all pods with events in https://bugzilla.redhat.com/show_bug.cgi?id=1861917#c3:

  All of them use the "restricted" SCC.

Doesn't look like a coincidence.

Comment 6 Venkata Siva Teja Areti 2020-09-01 22:08:32 UTC
My initial theory on this..

Specific run that I am debugging

origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.6/1299416934579703808/artifacts/e2e-gcp

Event that I am tracing

```
❯ cat all/e2e-gcp/events.json | jq '.items[] | select(.involvedObject.kind == "ReplicaSet") | select (.message | contains("unable to validate against any security context constraint")) |  select(.metadata.namespace == "openshift-monitoring") |{name: .metadata.name, namespace: .metadata.namespace, count: .count, firstTimestamp: .firstTimestamp, lastTimestamp: .lastTimestamp}' 
{
  "name": "cluster-monitoring-operator-7c486d94fb.162f831c89199357",
  "namespace": "openshift-monitoring",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:11Z",
  "lastTimestamp": "2020-08-28T18:54:39Z"
}
{
  "name": "cluster-monitoring-operator-7c486d94fb.162f838dec19e917",
  "namespace": "openshift-monitoring",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
```

Restricted SCC is created at `creationTimestamp: "2020-08-28T18:49:11Z"`

What is the user/service account that created cluster monitoring operator? 
It is created by cluster version operator.

What is the service account used by this operator?
default

What are the SCCs that this service account can access?
restricted

cluster-version-operator uses default service account to create cluster-monitoring operator. Default service account must get restricted scc. cluster monitoring operator deployment has empty security context. For empty security context, the policy as stated in documentation

	When a container or pod does not request a user ID under which it should be run, the effective UID depends on the SCC that emits this pod. Because restricted SCC is granted to all authenticated users by default, it will be available to all users and service accounts and used in most cases. The restricted SCC uses MustRunAsRange strategy for constraining and defaulting the possible values of the securityContext.runAsUser field. The admission plug-in will look for the openshift.io/sa.scc.uid-range annotation on the current project to populate range fields, as it does not provide this range. In the end, a container will have runAsUser equal to the first value of the range that is hard to predict because every project has different ranges.


openshift.io/sa.scc.uid-range annotation is added to cluster-monitoring namespace at `time: "2020-08-28T18:59:21Z"` by cluster-policy-controller.

From above log, the last instance of this failure is seen on the namespace at `"lastTimestamp": "2020-08-28T18:58:40Z"`. The delay in adding the annotation could be causing this message

To verify this, let's check another component.

```
❯ cat all/e2e-gcp/events.json | jq '.items[] | select(.involvedObject.kind == "ReplicaSet") | select (.message | contains("unable to validate against any security context constraint")) |  select(.metadata.namespace == "openshift-ingress-operator") |{name: .metadata.name, namespace: .metadata.namespace, count: .count, firstTimestamp: .firstTimestamp, lastTimestamp: .lastTimestamp}' 
{
  "name": "ingress-operator-577b779c44.162f831dfe85fe1c",
  "namespace": "openshift-ingress-operator",
  "count": 17,
  "firstTimestamp": "2020-08-28T18:49:17Z",
  "lastTimestamp": "2020-08-28T18:54:45Z"
}
{
  "name": "ingress-operator-577b779c44.162f838de9ddade4",
  "namespace": "openshift-ingress-operator",
  "count": 15,
  "firstTimestamp": "2020-08-28T18:57:18Z",
  "lastTimestamp": "2020-08-28T18:58:40Z"
}
```

OpenShift ingress operator is created by cluster-version-operator at `time: "2020-08-28T18:49:17Z"`. There is no security context mentioned in the deployment spec. So, we must look the annotations on the namespace.

the ranges are annotated by the cluster-policy-controller at `"2020-08-28T18:59:22Z"`. That timestamp is after the last time the validation error was seen.

From event filter, I could see the first time cluster-policy-controller became leader is `18:59:11`

Cluster policy controller runs in parallel with kube-controller-manager after bootstrap. 

time="2020-08-28T19:00:39Z" level=debug msg="Bootstrap status: complete"

It looks like cluster-policy-controller was only started when bootstrap was close to complete.

Then why are the clusteroperators created before the bootstrap is complete?

I am guessing it has something to do with this PR? https://github.com/openshift/cluster-version-operator/commit/2a469e37c1c10c7a6cc4dd71ad264eff89913eb5

Comment 7 Maciej Szulik 2020-09-02 14:28:58 UTC
From a slack discussion we've decided to do both of the following:
1. start cluster-policy-controller in bootstrap
2. use internal LB for cluster-policy-controller

1. requires only updating kcm-o to include cpc container
2. requires changing cpc to be able to pass --master and then update kcm-o accordingly.

Comment 8 David Eads 2020-09-04 13:14:00 UTC
still need the last bit to render the container in the bootstrap pod.

Comment 13 RamaKasturi 2020-09-15 10:12:16 UTC
Clearing the need info as the same has been provided in slack, also based on above comment moving the bug back to assigned state.

Comment 15 RamaKasturi 2020-09-21 08:53:34 UTC
Moving the bug back to assigned because i still see a crash in the cluster-policy-controller.log file as below.

Log file created at: 2020/09/21 05:37:13
Running on machine: ip-10-0-11-229
Binary: Built with gc go1.14.7 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0921 05:37:13.562230       1 cmd.go:55] open /etc/kubernetes/config//assets/kube-controller-manager-bootstrap/cpc-config: no such file or directory
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc0001a1301, 0xc0005a4000, 0x95, 0xeb)
        k8s.io/klog/v2@v2.3.0/klog.go:996 +0xb8
k8s.io/klog/v2.(*loggingT).output(0x360a420, 0xc000000003, 0x0, 0x0, 0xc0005822a0, 0x353b96e, 0x6, 0x37, 0x17b6d00)
        k8s.io/klog/v2@v2.3.0/klog.go:945 +0x19d
k8s.io/klog/v2.(*loggingT).printDepth(0x360a420, 0xc000000003, 0x0, 0x0, 0x1, 0xc000697b88, 0x1, 0x1)
        k8s.io/klog/v2@v2.3.0/klog.go:718 +0x15e
k8s.io/klog/v2.(*loggingT).print(...)
        k8s.io/klog/v2@v2.3.0/klog.go:703
k8s.io/klog/v2.Fatal(...)
        k8s.io/klog/v2@v2.3.0/klog.go:1443
github.com/openshift/cluster-policy-controller/pkg/cmd/cluster-policy-controller.NewClusterPolicyControllerCommand.func1(0xc0001482c0, 0xc0000aa0f0, 0x0, 0x5)
        github.com/openshift/cluster-policy-controller/pkg/cmd/cluster-policy-controller/cmd.go:55 +0x3e5
github.com/spf13/cobra.(*Command).execute(0xc0001482c0, 0xc0000aa0a0, 0x5, 0x5, 0xc0001482c0, 0xc0000aa0a0)
        github.com/spf13/cobra@v1.0.0/command.go:846 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xc000148000, 0xc000148000, 0x0, 0x0)
        github.com/spf13/cobra@v1.0.0/command.go:950 +0x349
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.0.0/command.go:887
main.main()
        github.com/openshift/cluster-policy-controller/cmd/cluster-policy-controller/main.go:67 +0x2cc


Note You need to log in before you can comment on or make changes to this bug.