Bug 1995779

Summary: pipelines-scc fsGroup.type set to MustRunAs which is causing Pipeline pods timeouts. It should be set to RunAsAny
Product: Red Hat OpenShift Pipelines Reporter: Novonil Choudhuri <nchoudhu>
Component: OperatorAssignee: Nikhil Thomas <nikthoma>
Status: CLOSED WORKSFORME QA Contact: Nobody <nobody>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 1.4CC: kbaig, pgarg, pradkuma, sashture, shmingla, smukhade, sosarkar, vdemeest
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-27 09:53:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Novonil Choudhuri 2021-08-19 18:21:42 UTC
Description of problem: 

Users are reporting issues where in pipelines are randomly failing as result of volumes mounts of type "ReadWriteMany" failing. Further investigation surfaced this error in the pod logs

 

"Aug 18 14:57:48 pd103-7h7tj-worker-f-2ft2l hyperkube[2771]: W0818 14:57:48.856233    2771 volume_linux.go:51] Setting volume ownership for /var/lib/kubelet/pods/ddee86f0-e50a-4af0-a794-b7200c7e99e8/volumes/kubernetes.io~portworx-volume/pvc-a6dbd1bd-1c6f-42af-98f1-7c8bf3a316b7 and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699"

 

This is happening because Kubernetes/OpenShift is trying to recursive chmod/chown, to match random UID assigned by OpenShift, since the "pipelines-scc" has "fsGroup.type" set to "MustRunAs", instead of "RunAsAny".

 

It  looks like Kubernetes/OpenShift allocates a fixed time of 2 minutes is allowed volume to mount and whenever it see fsGroup is would do recursive chmod+chown, thus if you have many file we are seeing volume mount timeouts.

 

- https://github.com/kubernetes/kubernetes/issues/69699

- https://github.com/openshift/origin/blob/release-4.7/vendor/k8s.io/kubernetes/pkg/kubelet/volumemanager/volume_manager.go#L71-L80

- https://access.redhat.com/solutions/4900491

- https://bugzilla.redhat.com/show_bug.cgi?id=1503906

 

This issues does not occur if we switch to PVC with accessMode set "ReadWriteOnce", not only that pipeline completes in substantially faster since the "pipelines-scc" has "supplementalGroups.type" set to "RunAsAny".

 

This leads me to following conclusion that "pipelines-scc" installed by OpenShift-Pipelines operator must updated to "fsGroup.type=RunAsAny".

 

Version-Release number of selected component (if applicable): 

Openshift v4.7.21
OpenShift Pipeline v1.4


How reproducible: 

1. Run parallel Pipeline tasks with some of the tasks having files doing random chmod/chown by the running pods.
2. Kubernetes/OpenShift is trying to recursive chmod/chown, to match random UID assigned by OpenShift, since the "pipelines-scc" has "fsGroup.type" set to "MustRunAs", instead of "RunAsAny".
3. Error found in events logs as in description.


Actual results: Pipeline failures as described in description


Expected results: "pipelines-scc" installed by OpenShift-Pipelines operator must updated to "fsGroup.type=RunAsAny".

Additional info: