Bug 1515907

Summary:	"Unable to mount volume" for volume containing large number of files
Product:	OpenShift Container Platform	Reporter:	Carsten Lichy-Bittendorf <clichybi>
Component:	Storage	Assignee:	aos-storage-staff <aos-storage-staff>
Storage sub component:	Storage	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	aos-bugs, aos-storage-staff, bmilne, chaoyang, clichybi, ekuric, erich, fshaikh, hekumar, jlee, jsafrane, Mathias.Merscher, pdwyer, rhowe, srangana, tidawson
Version:	3.6.1
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: Whenever a pod is mounting a volume with FSGroup SecurityContext being set the GID ownership of the files needs to be updated recursively for all the files on the volume. Consequence: The ownership change takes time and for volumes with very large number of files it may mean the pod takes long time to start. Workaround (if any): No workaround known yet. Result: Pods using volumes with large number of files and FSGroup SecurityContext setting may take very long to start.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:11:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Carsten Lichy-Bittendorf 2017-11-21 15:14:34 UTC

Description of problem:

when a pod mounts a volume containing a very large number of file the startup times out

Version-Release number of selected component (if applicable):

all, incl. latest OCP 3.6.x

How reproducible:


Steps to Reproduce:
1. let a pod create a very large number of file on mounted volume; at the customer it broke on AWS at about 8M files using GP2 storage
2. Force a reschedule of the pod, so that this volume gets mounted to a new pod
3. monitor what happens

Actual results:
- with an increasing number of files in the volume the startup time increases
- from a certain number of files in the volume the pod starts to fail to startup

Expected results:
- there should be no limitations on the number of files in a volume which causes a pod to fail during startup

Additional info:
As any pod is started in a new namespace while the user-id of the namespace will differ on any start, a function "SetVolumeOwnership" in the volume package gets called. This performs a recursive "chown" and "chmod" on the volume. On file-systems containing a large number of files this causes a huge IO load and it takes a lot of time. 
From a certain point of number of files this will take longer than the startup timeout for the pod/container.

Comment 15 Tomas Smetana 2018-03-14 09:26:08 UTC

The proposal PR:
https://github.com/kubernetes/community/pull/1717

It doesn't seem to get much attention: one of the reasons is that while we may be able to work around the problem with recursive chown-ing there is a similar issue with SELinux relabelling which is being done by container runtime (docker) and we have no control of. That means the proposal is still not complete remedy to the problem.

A tweakble timeout with exponential backoff or something similar is the only thing that might mitigate the issue.

Comment 16 Tomas Smetana 2018-03-15 12:38:36 UTC

I ran some tests te be sure: the pod events show the timeout messages while the volume files are being chowned but once this is done the volume mount succeeds and the pod starts...

I understand this is really inconvenient, however it's good to point out the proposal from the comment #15 would also mean the user would have to wait for some other container (init) to do the work (albeit asynchronously).

I can try to add some more events "Still changing file ownership, please wait" which may make the user at least informed what is going on. But a generic solution that would not traverse the fs and still make sure the files have proper ownership and labels without having to wait... I simply have no idea how would I do that.

Comment 17 Carsten Lichy-Bittendorf 2018-03-15 12:55:05 UTC

There might be no generic solution. 
So we at least should go for:
- enhance the logging to give good pointers on where the time gets consumed
- enhance the documentation to explain to our customers that this can happen and how to tune to get around.
m2c

Comment 18 Tomas Smetana 2018-03-22 17:15:32 UTC

Kubernetes PR: https://github.com/kubernetes/kubernetes/pull/61550

Comment 20 Tomas Smetana 2019-10-16 14:26:03 UTC

*** Bug 1761938 has been marked as a duplicate of this bug. ***

Comment 21 Jan Safranek 2019-12-03 15:46:18 UTC

*** Bug 1725275 has been marked as a duplicate of this bug. ***

Comment 22 Jan Safranek 2019-12-03 15:48:02 UTC

We're tracking this issue in our JIRA, https://jira.coreos.com/browse/STOR-267. It requires an API change and must go through alpha/beta/ga process upstream. For the time being, we do not have a really useful workaround, the best is not to use fsGroup in pods that use volumes with large number of files.

Comment 23 Jan Safranek 2020-02-04 15:33:31 UTC

Good news: we have Kubernetes enhancement merged: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20200120-skip-permission-change.md
Bad news: it will take some time to implement, as it probably needs go through alpha/beta stage.

Comment 29 errata-xmlrpc 2020-07-13 17:11:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409