Bug 1995779

Summary:	pipelines-scc fsGroup.type set to MustRunAs which is causing Pipeline pods timeouts. It should be set to RunAsAny
Product:	Red Hat OpenShift Pipelines	Reporter:	Novonil Choudhuri <nchoudhu>
Component:	Operator	Assignee:	Nikhil Thomas <nikthoma>
Status:	CLOSED WORKSFORME	QA Contact:	Nobody <nobody>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	1.4	CC:	kbaig, pgarg, pradkuma, sashture, shmingla, smukhade, sosarkar, vdemeest
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-10-27 09:53:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Novonil Choudhuri 2021-08-19 18:21:42 UTC

Description of problem:

Users are reporting issues where in pipelines are randomly failing as result of volumes mounts of type "ReadWriteMany" failing. Further investigation surfaced this error in the pod logs

"Aug 18 14:57:48 pd103-7h7tj-worker-f-2ft2l hyperkube[2771]: W0818 14:57:48.856233 2771 volume_linux.go:51] Setting volume ownership for /var/lib/kubelet/pods/ddee86f0-e50a-4af0-a794-b7200c7e99e8/volumes/kubernetes.io~portworx-volume/pvc-a6dbd1bd-1c6f-42af-98f1-7c8bf3a316b7 and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699"

This is happening because Kubernetes/OpenShift is trying to recursive chmod/chown, to match random UID assigned by OpenShift, since the "pipelines-scc" has "fsGroup.type" set to "MustRunAs", instead of "RunAsAny".

It looks like Kubernetes/OpenShift allocates a fixed time of 2 minutes is allowed volume to mount and whenever it see fsGroup is would do recursive chmod+chown, thus if you have many file we are seeing volume mount timeouts.

- https://github.com/kubernetes/kubernetes/issues/69699

- https://github.com/openshift/origin/blob/release-4.7/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/volume_manager.go#L71-L80

- https://access.redhat.com/solutions/4900491

- https://bugzilla.redhat.com/show_bug.cgi?id=1503906

This issues does not occur if we switch to PVC with accessMode set "ReadWriteOnce", not only that pipeline completes in substantially faster since the "pipelines-scc" has "supplementalGroups.type" set to "RunAsAny".

This leads me to following conclusion that "pipelines-scc" installed by OpenShift-Pipelines operator must updated to "fsGroup.type=RunAsAny".

Version-Release number of selected component (if applicable):

Openshift v4.7.21
OpenShift Pipeline v1.4

How reproducible:

1. Run parallel Pipeline tasks with some of the tasks having files doing random chmod/chown by the running pods.
2. Kubernetes/OpenShift is trying to recursive chmod/chown, to match random UID assigned by OpenShift, since the "pipelines-scc" has "fsGroup.type" set to "MustRunAs", instead of "RunAsAny".
3. Error found in events logs as in description.

Actual results: Pipeline failures as described in description

Expected results: "pipelines-scc" installed by OpenShift-Pipelines operator must updated to "fsGroup.type=RunAsAny".

Additional info: