1597867 – [WIP] PVCs with large numbers of files take significant time to attach and/or cause pod init to timeout

Bug 1597867 - [WIP] PVCs with large numbers of files take significant time to attach and/or cause pod init to timeout

Summary: [WIP] PVCs with large numbers of files take significant time to attach and/or...

Keywords:
Status:	CLOSED DUPLICATE of bug 1459106
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mrunal Patel
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-03 18:02 UTC by Mike McLane
Modified:	2018-07-03 19:06 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-03 19:06:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mike McLane 2018-07-03 18:02:17 UTC

[Bug details work in progress] -- (collecting data around timings and generating a reproducer)

Description of problem:
In cases where a network-attached PVC has tens of thousands of files, the time required to attach the PVC to a pod increases significantly with the more files present. This can lead to POD initialization timeouts.

In the use case of OpenShift.io, it is a usual case for the IDE (che) workspaces to make use of tens of thousands of files. It has been observed that volumes with <10k files, the che POD is able to start successfully with no additional pod start parameters. In the case of >=30k files, the che POD is unable to start as the mount time introduces an init timeout.

In mounting gluster-subvol, we were able to work with the storage team to observe the operations being performed on a volume during attachment. There appears to be a recursive ownership change that happens on every attach/mount event.

Version-Release number of selected component (if applicable):

$ oc version
oc v3.9.14
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.starter-us-east-2.openshift.com:443
openshift v3.9.14
kubernetes v1.9.1+a0ce1bc657

How reproducible:
Every time.

Steps to Reproduce:
1. Start a pod (without a DC) with an attached pvc containing 35,000 files.

Actual results:
Pod initializaiton will timeout

Expected results:
Pod initialization will succeed

Additional info:

It looks like the section of container code that handles permission application to PVCs is here [1]. In cases where we use a deployment config to start a pod, it looks like the replication controller/deploy pods allow recovery during longer pod initialization times, leading to more successful spin-ups.

[1] https://github.com/kubernetes/kubernetes/blob/692b34825f4e505b403c063270d1e007ee139ea8/pkg/volume/volume_linux.go#L35-L91

Comment 1 Mrunal Patel 2018-07-03 19:06:59 UTC


*** This bug has been marked as a duplicate of bug 1459106 ***

Note You need to log in before you can comment on or make changes to this bug.