Description of problem:
when a pod mounts a volume containing a very large number of file the startup times out
Version-Release number of selected component (if applicable):
all, incl. latest OCP 3.6.x
Steps to Reproduce:
1. let a pod create a very large number of file on mounted volume; at the customer it broke on AWS at about 8M files using GP2 storage
2. Force a reschedule of the pod, so that this volume gets mounted to a new pod
3. monitor what happens
- with an increasing number of files in the volume the startup time increases
- from a certain number of files in the volume the pod starts to fail to startup
- there should be no limitations on the number of files in a volume which causes a pod to fail during startup
As any pod is started in a new namespace while the user-id of the namespace will differ on any start, a function "SetVolumeOwnership" in the volume package gets called. This performs a recursive "chown" and "chmod" on the volume. On file-systems containing a large number of files this causes a huge IO load and it takes a lot of time.
From a certain point of number of files this will take longer than the startup timeout for the pod/container.
The proposal PR:
It doesn't seem to get much attention: one of the reasons is that while we may be able to work around the problem with recursive chown-ing there is a similar issue with SELinux relabelling which is being done by container runtime (docker) and we have no control of. That means the proposal is still not complete remedy to the problem.
A tweakble timeout with exponential backoff or something similar is the only thing that might mitigate the issue.
I ran some tests te be sure: the pod events show the timeout messages while the volume files are being chowned but once this is done the volume mount succeeds and the pod starts...
I understand this is really inconvenient, however it's good to point out the proposal from the comment #15 would also mean the user would have to wait for some other container (init) to do the work (albeit asynchronously).
I can try to add some more events "Still changing file ownership, please wait" which may make the user at least informed what is going on. But a generic solution that would not traverse the fs and still make sure the files have proper ownership and labels without having to wait... I simply have no idea how would I do that.
There might be no generic solution.
So we at least should go for:
- enhance the logging to give good pointers on where the time gets consumed
- enhance the documentation to explain to our customers that this can happen and how to tune to get around.
Kubernetes PR: https://github.com/kubernetes/kubernetes/pull/61550
*** Bug 1761938 has been marked as a duplicate of this bug. ***
*** Bug 1725275 has been marked as a duplicate of this bug. ***
We're tracking this issue in our JIRA, https://jira.coreos.com/browse/STOR-267. It requires an API change and must go through alpha/beta/ga process upstream. For the time being, we do not have a really useful workaround, the best is not to use fsGroup in pods that use volumes with large number of files.
Good news: we have Kubernetes enhancement merged: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20200120-skip-permission-change.md
Bad news: it will take some time to implement, as it probably needs go through alpha/beta stage.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.