1970331 – [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]

Bug 1970331 - [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel] [NEEDINFO]

Summary: [sig-auth][Feature:SCC][Early] should not have pod creation failures during i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Yash Tripathi
Docs Contact:
URL:
Whiteboard:	tag-ci LifecycleReset
Duplicates (4):	1928839 1948890 1991628 2010995 (view as bug list)
Depends On:
Blocks:	2017020
TreeView+	depends on / blocked

Reported:	2021-06-10 09:27 UTC by Sergiusz Urbaniak
Modified:	2022-03-10 16:04 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2017020 (view as bug list)
Environment:	job=periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial
Last Closed:	2022-03-10 16:03:59 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift apiserver-library-go pull 66	None	Merged	Bug 1970331: pkg/securitycontextconstraints: wait for MCS/UID labels on namespaces, bump verbosity on errors	2021-10-28 22:02:59 UTC
Github	openshift kubernetes pull 1017	None	Merged	Bug 1970331: UPSTREAM: <drop>: bump apiserver-library-go	2021-10-28 22:03:01 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:04:25 UTC

Description Sergiusz Urbaniak 2021-06-10 09:27:02 UTC

test:
We still see significant amounts of failures related to:
[sig-auth][Feature:SCC][Early] should not have pod creation failures during install 

https://search.ci.openshift.org/?search=unable+to+validate+against+any+security+context+constraint&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Sergiusz Urbaniak 2021-06-18 08:43:44 UTC

raising severity while we are working on it.

Comment 2 Maru Newby 2021-06-23 19:19:08 UTC

This continues to cause ci retests, slowing down merge velocity.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1151/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/1407406822800756736

Comment 3 Michal Fojtik 2021-07-23 19:23:03 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 Eran Cohen 2021-07-28 10:08:31 UTC

This test is still flakey, seems to fail a lot due to SCC errors. 
https://search.ci.openshift.org/?search=should+not+have+pod+creation+failures+during+install&maxAge=168h&context=-1&type=bug%2Bjunit&name=&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job
Removing the "LifecycleStale" from the whiteboard so it will stay open

Comment 5 W. Trevor King 2021-08-04 17:46:40 UTC

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=should+not+have+pod+creation+failures+during+install' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' | sort

has too many hits, so restricting to just 4.9:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=should+not+have+pod+creation+failures+during+install' | grep '4[.]9.*failures match' | grep -v 'rehearse-\|pull-ci-' | sort
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 63 runs, 83% failed, 10% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 16 runs, 94% failed, 27% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 16 runs, 100% failed, 31% of failures match = 31% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-azure (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi (all) - 17 runs, 71% failed, 17% of failures match = 12% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-ovirt (all) - 11 runs, 73% failed, 25% of failures match = 18% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-upi (all) - 9 runs, 78% failed, 14% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact
release-openshift-ocp-installer-e2e-azure-serial-4.9 (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.9 (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
release-openshift-ocp-installer-e2e-metal-compact-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

I dunno if this quite rises to "urgent", but certainly seems like something that needs more interest than this bug seems to be getting today.  And possibly an origin exception to be non-fatal in the meantime?

Comment 6 Sergiusz Urbaniak 2021-08-05 07:33:50 UTC

agreed, raising priority and investigating now. leaving severity as we don't have data that this draws clusters dysfunctional.

Comment 7 Sergiusz Urbaniak 2021-08-05 12:02:24 UTC

I need to find out whether this is happening at admission time or at validation time as the same error message is used in both places:

1. https://github.com/openshift/apiserver-library-go/blob/9445ab4ce8ed1e914cec6178c70650c799001710/pkg/securitycontextconstraints/sccadmission/admission.go#L109
2. https://github.com/openshift/apiserver-library-go/blob/9445ab4ce8ed1e914cec6178c70650c799001710/pkg/securitycontextconstraints/sccadmission/admission.go#L131

The other strange symptom is that there are actually no underlying errors. In all cases I saw the passed error slice to NewForbidden was empty "[]".

Comment 8 Sergiusz Urbaniak 2021-08-05 12:04:12 UTC

What is apparent though from both code blocks is that the presence of errors or length of error slices are not performed. Maybe there should be a `if err != nil` or `if len(errs) > 0` check before that, but I need to verify.

Comment 9 Sergiusz Urbaniak 2021-08-09 14:47:47 UTC

*** Bug 1991628 has been marked as a duplicate of this bug. ***

Comment 10 Sergiusz Urbaniak 2021-08-11 09:55:04 UTC

Further analysis shows that these failures

a) all reconcile at the end
b) happen mostly during bootstrap time

The problem is that we don't get audit logs from bootstrap apiserver in e2e runs.

Late creation of default SCCs was excluded as the creation timestamps are 10 minutes before the actual failures. The optimization has been merged as part of https://github.com/openshift/cluster-kube-apiserver-operator/pull/1049 already.

The current hypothesis is that SCCs are present but role bindings take time to take effect causing delays in SCC admission, especially at bootstrap time.

Comment 11 Sergiusz Urbaniak 2021-08-16 12:13:22 UTC

sprint review: this issue is being worked on actively and prioritized

Comment 12 Sergiusz Urbaniak 2021-08-16 12:49:30 UTC

*** Bug 1948890 has been marked as a duplicate of this bug. ***

Comment 13 Sergiusz Urbaniak 2021-08-16 12:55:08 UTC

*** Bug 1928839 has been marked as a duplicate of this bug. ***

Comment 14 Gabe Montero 2021-08-17 14:54:49 UTC

If it helps, I have a PR related failure in openshift/origin with e2e-gcp with this:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26399/pull-ci-openshift-origin-master-e2e-gcp/1427618592735629312

log snippets:

fail [github.com/openshift/origin/test/extended/authorization/scc.go:73]: 3 pods failed before test on SCC errors
Error creating: pods "router-default-7b5df4f667-" is forbidden: unable to validate against any security context constraint: [provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/router-default-7b5df4f667 -n openshift-ingress happened 13 times
Error creating: pods "router-default-7b5df4f667-" is forbidden: unable to validate against any security context constraint: [provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/router-default-7b5df4f667 -n openshift-ingress happened 14 times
Error creating: pods "router-default-7b5df4f667-" is forbidden: unable to validate against any security context constraint: [provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/router-default-7b5df4f667 -n openshift-ingress happened 11 times

and 

 	
fail [github.com/openshift/origin/test/extended/authorization/scc.go:73]: 3 pods failed before test on SCC errors
Error creating: pods "router-default-7b5df4f667-" is forbidden: unable to validate against any security context constraint: [provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/router-default-7b5df4f667 -n openshift-ingress happened 13 times
Error creating: pods "router-default-7b5df4f667-" is forbidden: unable to validate against any security context constraint: [provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/router-default-7b5df4f667 -n openshift-ingress happened 14 times
Error creating: pods "router-default-7b5df4f667-" is forbidden: unable to validate against any security context constraint: [provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for ReplicaSet.apps/v1/router-default-7b5df4f667 -n openshift-ingress happened 11 times

Comment 16 Stephen Benjamin 2021-08-31 17:09:06 UTC

I opened a PR to relax the threshold:
  https://github.com/openshift/origin/pull/26437

Seems about 5% of jobs are hitting this.  If you look at Sippy, it seems like upgrade jobs hit it more often: 

https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-auth][Feature:SCC][Early]%20should%20not%20have%20pod%20creation%20failures%20during%20install%20[Suite:openshift/conformance/parallel]

Comment 17 Stephen Benjamin 2021-08-31 17:12:17 UTC

Bugzilla does the wrong thing with that Sippy URL, that ] is part of it. https://bit.ly/38toDPu is the above, shortned

Comment 18 Sergiusz Urbaniak 2021-09-03 11:58:43 UTC

we're working actively on this issue

Comment 19 Michal Fojtik 2021-10-03 12:30:06 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 20 Gal Zaidman 2021-10-05 14:32:39 UTC

FYI
It is flaky for oVirt as well:
https://search.ci.openshift.org/?search=Host+ports+are+not+allowed+to+be+used&maxAge=336h&context=1&type=junit&name=ovirt&excludeName=4.6&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 21 Sergiusz Urbaniak 2021-10-06 09:00:21 UTC

*** Bug 2010995 has been marked as a duplicate of this bug. ***

Comment 22 Sergiusz Urbaniak 2021-10-20 14:32:30 UTC

We finally have a good grip and reasoning what is happening:

1. SCC admission, internally, converts SCC resources into internal types, named "providers"
2. During that conversion errors might happen because API calls have to be executed. These calls might fail, especially in early cluster bootstrap phases.
3. If an error occurs, these are ignored and the associated SCC is also ignored for pod admission.

This explains why some SCCs, i.e. restricted, despite being present in the system is not considered for admission.

Comment 23 Michal Fojtik 2021-10-21 14:38:37 UTC

The LifecycleStale keyword was removed because the bug moved to QE.
The bug assignee was notified.

Comment 26 Xingxing Xia 2021-10-28 02:21:05 UTC

We will try to verify it today.

Comment 27 Yash Tripathi 2021-10-28 11:20:56 UTC

Verified using https://search.ci.openshift.org

Comment 28 Yash Tripathi 2021-10-28 11:23:35 UTC

We have used the following query parameters to verify
6 days: https://search.ci.openshift.org/?search=unable+to+validate+against+any+security+context+constraint&maxAge=144h0m0s&context=1&type=junit&name=4%5C.10&excludeName=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
8 days: https://search.ci.openshift.org/?search=unable+to+validate+against+any+security+context+constraint&maxAge=192h0m0s&context=1&type=junit&name=4%5C.10&excludeName=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
8 days results have failures, 6 days one's don't
Hence the bug is no longer seen from the past 6 days

Comment 32 errata-xmlrpc 2022-03-10 16:03:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.