Bug 1817099 - 4.4. e2e flake: unable to validate against any security context constraint when creating pods
Summary: 4.4. e2e flake: unable to validate against any security context constraint w...
Keywords:
Status: CLOSED DUPLICATE of bug 1820687
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Maciej Szulik
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-25 15:02 UTC by Gabe Montero
Modified: 2020-05-11 16:23 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-11 16:23:33 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Gabe Montero 2020-03-25 15:02:04 UTC
Description of problem:

I've broken this off from https://bugzilla.redhat.com/show_bug.cgi?id=1803956 as my triage to date has seen enough discrepancies with the debug there, hence leading me to think it is a separate issue.

Intermittently some image-ecosystem e2e's fail creating pods with the SCC related exception

Latest incarnation:  https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-samples-operator/242/pull-ci-openshift-cluster-samples-operator-release-4.4-e2e-aws-image-ecosystem/5

One of the failures:

[image_ecosystem][Slow] openshift images should be SCL enabled returning s2i usage when running the image [Top Level] [image_ecosystem][Slow] openshift images should be SCL enabled returning s2i usage when running the image "centos/ruby-23-centos7" should print the usage [Suite:openshift] expand_less 	11s
fail [github.com/openshift/origin/test/extended/image_ecosystem/scl.go:38]: Unexpected error:
    <*errors.StatusError | 0xc0018caa00>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "pods \"test-pod-bae89b70-329c-48b1-b945-696d33657490\" is forbidden: unable to validate against any security context constraint: []",
            Reason: "Forbidden",
            Details: {
                Name: "test-pod-bae89b70-329c-48b1-b945-696d33657490",
                Group: "",
                Kind: "pods",
                UID: "",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 403,
        },
    }
    pods "test-pod-bae89b70-329c-48b1-b945-696d33657490" is forbidden: unable to validate against any security context constraint: []
occurred

Version-Release number of selected component (if applicable):

To date, have only seen/noticed on 4.4 e2e's

How reproducible:

intermittent 


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


So in making another attempt to cross reference this bug with the 4.4 e2e flakes I with Standa's analysis at https://bugzilla.redhat.com/show_bug.cgi?id=1803956#c10 I discovered that namespace_scc_allocation_controller.go was moved to github.com/openshift/cluster-policy-controller in 4.3

When I look in that pod's logs, I see 

I0313 16:42:43.021643       1 cert_rotation.go:137] Starting client certificate rotation controller
I0313 16:42:43.023554       1 policy_controller.go:41] Starting controllers on 0.0.0.0:10357 (v0.0.0-unknown)
I0313 16:42:43.028080       1 standalone_apiserver.go:103] Started health checks at 0.0.0.0:10357
I0313 16:42:43.028252       1 leaderelection.go:242] attempting to acquire leader lease  openshift-kube-controller-manager/cluster-policy-controller...
E0313 16:42:45.999535       1 leaderelection.go:331] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: configmaps "cluster-policy-controller" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "openshift-kube-controller-manager"

So presumably it did not get far enough along to even get to the problem Standa saw.

Taking a stab at "openshift-kube-controller-manager" pods, with the disclaimer I've never looked at those before, It would seem there is some host level pain.  I see variants in the different logs, but as an example from openshift-kube-controller-manager_kube-controller-manager-ip-10-0-143-82.us-west-2.compute.internal_kube-controller-manager.log


W0313 16:43:32.713660       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ip-10-0-146-44.us-west-2.compute.internal" does not exist
W0313 16:43:32.713676       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ip-10-0-158-56.us-west-2.compute.internal" does not exist
W0313 16:43:32.713686       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ip-10-0-135-140.us-west-2.compute.internal" does not exist
W0313 16:43:32.713695       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ip-10-0-136-197.us-west-2.compute.internal" does not exist
W0313 16:43:32.713704       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ip-10-0-140-104.us-west-2.compute.internal" does not exist
W0313 16:43:32.713717       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="ip-10-0-143-82.us-west-2.compute.internal" does not exist

Lastly, I asked in various slack channels how to debug " unable to validate against any security context constraint" errors and David Eads said adding a dump
of the SCCs on test failure would help.

I have PR https://github.com/openshift/origin/pull/24703 up but it is awaiting another round of review/approval.

Comment 1 Maciej Szulik 2020-05-11 16:23:33 UTC

*** This bug has been marked as a duplicate of bug 1820687 ***


Note You need to log in before you can comment on or make changes to this bug.