Description of problem: Persistent volume which contains lots of files (>6'000'000) which formatted with ext4 (-T option). OSE is not able to launch pods with this volume attached. Issue appears even if application is not accessing this storage - it is just mounted. Version-Release number of selected component (if applicable): 3.4 How reproducible: Steps to Reproduce: 1. create a PV similar to: # oc export pv pv-vol-05fad0544281b9a79 apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/bound-by-controller: "yes" creationTimestamp: null labels: failure-domain.beta.kubernetes.io/region: eu-west-1 failure-domain.beta.kubernetes.io/zone: eu-west-1c name: pv-vol-05fad0544281b9a79 spec: accessModes: - ReadWriteOnce awsElasticBlockStore: fsType: ext4 volumeID: aws://eu-west-1c/vol-05fad0544281b9a79 capacity: storage: 400Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: pvc-vol-05fad0544281b9a79 namespace: small-files resourceVersion: "23371" uid: 3be1c9c1-49f1-11e7-a3dc-06578a1a613c persistentVolumeReclaimPolicy: Retain status: {} 2. create files on the storage: # for i in {1..10 000 000}; do echo $i > ./$i.txt; done 3. scale up deployment with this volume attached. system will take several minutes to run an application. The following is reported in the events: Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 13m 13m 1 {default-scheduler } Normal Scheduled Successfully assigned test-app-3-eyhvf to ip-172-31-27-242.eu-west-1.compute.internal 11m 6m 3 {kubelet ip-172-31-27-242.eu-west-1.compute.internal} Warning FailedMount Unable to mount volumes for pod "test-app-3-eyhvf_small-files(edc7aab8-49f9-11e7-a3dc-06578a1a613c)": timeout expired waiting for volumes to attach/mount for pod "test-app-3-eyhvf"/"small-files". list of unattached/unmounted volumes=[small-files] 11m 6m 3 {kubelet ip-172-31-27-242.eu-west-1.compute.internal} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "test-app-3-eyhvf"/"small-files". list of unattached/unmounted volumes=[small-files] 6m 6m 1 {kubelet ip-172-31-27-242.eu-west-1.compute.internal} spec.containers{test-app} Normal Pulling pulling image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4" 6m 6m 1 {kubelet ip-172-31-27-242.eu-west-1.compute.internal} spec.containers{test-app} Normal Pulled Successfully pulled image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4" 6m 6m 1 {kubelet ip-172-31-27-242.eu-west-1.compute.internal} spec.containers{test-app} Normal Created Created container with docker id 3b229a3c2095; Security:[seccomp=unconfined] 4m 4m 1 {kubelet ip-172-31-27-242.eu-west-1.compute.internal} spec.containers{test-app} Warning Failed Failed to start container with docker id 3b229a3c2095 with error: operation timeout: context deadline exceeded 4m 4m 1 {kubelet ip-172-31-27-242.eu-west-1.compute.internal} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "test-app" with RunContainerError: "runContainer: operation timeout: context deadline exceeded" Actual results: Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
Most likely IMO it is the recursive chown + chmod fsGroup that ensures the pod can write to the volume causing this slowdown; can you please provide the output of 'oc get pod -o yaml' to confirm if it the pod has an fsGroup set?
=== In Red Hat Customer Portal Case 01862362 === --- Comment by Koksharov, Alexander on 6/7/2017 9:49 AM --- yes, fsGroup is set for a pod, but is is not set in dc. # oc get dc test-app | grep -i fsGroup # # oc get pod test-app-3-2qb48 -o yaml apiVersion: v1 kind: Pod metadata: annotations: kubernetes.io/created-by: | {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"small-files","name":"test-app-3","uid":"601c90f0-49f1-11e7-a3dc-06578a1a613c","apiVersion":"v1","resourceVersion":"66439"}} openshift.io/container.test-app.image.entrypoint: '["container-entrypoint","/bin/sh","-c","$STI_SCRIPTS_PATH/usage"]' openshift.io/deployment-config.latest-version: "3" openshift.io/deployment-config.name: test-app openshift.io/deployment.name: test-app-3 openshift.io/generated-by: OpenShiftNewApp openshift.io/scc: restricted creationTimestamp: 2017-06-07T07:44:59Z generateName: test-app-3- labels: app: test-app deployment: test-app-3 deploymentconfig: test-app name: test-app-3-2qb48 namespace: small-files resourceVersion: "66506" selfLink: /api/v1/namespaces/small-files/pods/test-app-3-2qb48 uid: 34f92c81-4b55-11e7-a116-06578a1a613c spec: containers: - image: 172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4 imagePullPolicy: Always name: test-app ports: - containerPort: 8080 protocol: TCP resources: {} securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID - SYS_CHROOT privileged: false runAsUser: 1000070000 seLinuxOptions: level: s0:c8,c7 terminationMessagePath: /dev/termination-log volumeMounts: - mountPath: /data name: small-files - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-ooelj readOnly: true dnsPolicy: ClusterFirst imagePullSecrets: - name: default-dockercfg-j0w2s nodeName: ip-172-31-27-242.eu-west-1.compute.internal restartPolicy: Always securityContext: fsGroup: 1000070000 seLinuxOptions: level: s0:c8,c7 serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 volumes: - name: small-files persistentVolumeClaim: claimName: pvc-vol-05fad0544281b9a79 - name: default-token-ooelj secret: defaultMode: 420 secretName: default-token-ooelj status: conditions: - lastProbeTime: null lastTransitionTime: 2017-06-07T07:44:59Z status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: 2017-06-07T07:46:52Z status: "True" type: Ready - lastProbeTime: null lastTransitionTime: 2017-06-07T07:44:59Z status: "True" type: PodScheduled containerStatuses: - containerID: docker://bcf9fcce7bee45654210c96fd34dbc71a6c67274a0457a64520472e1adbdfa2b image: 172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4 imageID: docker-pullable://172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4 lastState: {} name: test-app ready: true restartCount: 0 state: running: startedAt: 2017-06-07T07:46:51Z hostIP: 172.31.27.242 phase: Running podIP: 10.131.0.20 startTime: 2017-06-07T07:44:59Z
OK, the easiest way to test if fsGroup is responsible for the delay is to prevent OpenShift from automatically setting the pod's fsGroup and observe. Here's one way to do it: 1. Create a serviceAccount fstest and specify it in the deployment at spec.template.spec.serviceAccount 2. Add an SCC to fstest that *doesn't* automatically set pods' fsGroups (one with fsgroup: RunAsAny) such as SCC nonroot $ oadm policy add-scc-to-user nonroot system:serviceaccount:default:fstest 3. Recreate the deployment if needed Then the pod should not have an fsGroup automatically set and we can observe if there is a long delay before ContainerStarted In my brief testing, fsGroup is indeed responsible for a bulk of the time it takes to mount a volume. I only tested with about 1M inodes though, not 6M so I don't know if 6M could cause minutes-long delays; regardless, the difference is clear with and without fsGroup. Avg time between "MountVolume.MountDevice succeeded" and "MountVolume.Setup succeeded" without fsGroup: 0s with fsGroup: 45s
Hello Matthew, There is some improvement if I set my deployment to use nonroot scc. Events about timeouts while attaching storage disappear. However, deployment still takes 3-4 minutes. Here is some info from my last test: - with 6M files # oc describe pod test-app-8-eyhix | grep "Security Policy:" Security Policy: nonroot # oc get pods -w NAME READY STATUS RESTARTS AGE test-app-3-build 0/1 Completed 0 2d test-app-8-eyhix 0/1 ContainerCreating 0 2m NAME READY STATUS RESTARTS AGE test-app-8-eyhix 1/1 Running 0 4m # oc get event -o go-template="{{ range .items }} {{.metadata.creationTimestamp}} : {{.message }} ; {{ end }}" | tr ';' '\n' | sort ... 2017-06-08T11:39:33Z : Created pod: test-app-8-eyhix 2017-06-08T11:39:33Z : Successfully assigned test-app-8-eyhix to ip-172-31-27-242.eu-west-1.compute.internal 2017-06-08T11:39:52Z : Created container with docker id 17d6353592e3 2017-06-08T11:39:52Z : pulling image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4" 2017-06-08T11:39:52Z : Successfully pulled image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4" 2017-06-08T11:41:52Z : Error syncing pod, skipping: failed to "StartContainer" for "test-app" with RunContainerError: "runContainer: operation timeout: context deadline exceeded" 2017-06-08T11:41:52Z : Failed to start container with docker id 17d6353592e3 with error: operation timeout: context deadline exceeded # oc rsh test-app-8-eyhix sh-4.2$ cd /data/ sh-4.2$ ls -la | wc -l 6000004 - with 2,7M files sh-4.2$ ls -la /data/ | wc -l 2742013 # oc get pods -w NAME READY STATUS RESTARTS AGE test-app-3-build 0/1 Completed 0 3d test-app-8-ooay4 0/1 ContainerCreating 0 10s NAME READY STATUS RESTARTS AGE test-app-8-ooay4 1/1 Running 0 2m - with no files only 45s sh-4.2$ perl -e 'chdir("/data") or die; opendir(D, "."); while($f = readdir D) {unlink($f);}' # oc get pods -w NAME READY STATUS RESTARTS AGE test-app-3-build 0/1 Completed 0 3d test-app-8-6wisk 0/1 ContainerCreating 0 4s NAME READY STATUS RESTARTS AGE test-app-8-6wisk 1/1 Running 0 45s looks like there is something else what does scanning of filesystem
I'm certain now that the non-fsGroup slowdown is due to SELinux relabeling (recursive chcon on the entire contents of the disk). From observation/profiling, openshift more or less idles during the time between pod creation and container start time. Meanwhile, Docker daemon does significant reading from the EBS disk & uses a lot of CPU. So I did a basic test with and without SELinux relabeling (outside of openshift) and the difference was clear. $ docker run -v /ebs-mountpoint:/test:Z ... takes minutes to start where $ docker run -v /ebs-mountpoint:/test ... takes a second. Unfortunately, at the moment disabling openshift's (docker's) recursive chconing is not possible. All EBS disks will undergo it. Should also note re: the fsGroup slowdown that removing fsGroup from pods that use the volume is somewhat of a workaround in that: without openshift to automatically chown+chmod the disk, it becomes the adminstrator's responsibility to ensure pods that use the volume have the right permissions to read/write it. e.g.: ensure every file on that disk is owned by group 1234 and ensure every pod that uses that disk has supplemental group 1234. Hope that's acceptable.
As I was saying in IRC - perhaps we can mitigate this little bit by using :z (small case) and hence avoid recursive relabelling.
Docker recursively sets the SELINUX label on the volume and this will take time.. if its 1 file it probably wont take much time.. if its a million directories of a million files each, it will take a lot of time. There is no way around this except to disable selinux or buy faster hardware.. In this case the POD does eventually start. There's nothing we can do to optimize security, performance, and usefulness of the volume.. if we don't perform a recursive label on the volume, some files wont be accessible. If we do, it may take time to perform the operation. All that to say, this turns into an expectation problem.. The events indicates that things aren't happening as fast as usual, but the POD eventually comes up (after 14 mins). The fix here-- if possible, is to give better Event messages and document that starting a POD may take some time for volumes with a lot of files. If there is a feature request around this (allow customize or switch of selinux labels) we can take that into consideration as an RFE. Lowering priority for now.
Docker takes a long time to recursively apply the selinux label (:Z). The pod does appear to start correctly eventually despite "Failed to start container with docker id" but there are two possible consequences of this delay between CreateContainer and StartContainer "Symlink container logs to the legacy container log location for cluster logging support." does not occur. Any pod container post-start lifecycle hooks, if any are specified, are never run. because SyncPod appears to only run once.
Transferring this to the kubernetes component as the storage portion is working as intended but kubelet is behaving weird due to the longer startup time of the docker container.. Summary: - Docker container takes a long time to perform operation & start container, but does not error out. Operation is expected to be slow/take long time (using a volume with lots of files and applying selinux label recursively to the volume is expected to take some time) - Kube warns of errors during this 'long wait' Events are marked as "Warning" but the message has language of "Failure".. This isn't ideal, but not blocker problem. - Docker container eventually starts and so does POD, but POD behavior isn't correct. Post-start lifecycle hooks aren't run and container logs aren't linked (as matthew mentions in comment #2) Also upping priority for node triage.
From a cursory look at the situation, this will not be easy to address. We have the basic issue of how long do we wait for a container to start before we consider it a failure. There is no way to detect that something in the runtime is not hung, but actually making forward progress. With a sufficiently slow disk, it could take hours. Post-start hooks and creation of the log symlink are synchronous with StartContainer. So if that errors out, they will not be done. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L121-L154 When the container is started later, IIRC it just gets pulled in like a pod would on node restart as a container that we don't have state about but should be running so we just reconcile the state and keep the pod running. However, we assume the post-start hooks and log symlinks have already been done by the node process in a previous life. tl;dr I don't think this is something we can address for 3.6. It will take some significant reworking.
The only thing that can really be done here is manually adjust the runtime-request-timeout node option. This controls the timeout on the StartContainer call and every other runtime related call.
This option can be set in the /etc/origin/node/node-config.yaml file. For example, to set the timeout to 10 minutes: kubeletArguments: ... runtime-request-timeout: - "10m"
Is it possible to create a configuration switch which controls automatic selinux relabeling of attached storages? I think it can be a solution to the subject issue if we can add a possibility to switch off automatic relabeling and delegate this responsibility to the administrator of an application.
afaik, there is no such switch today. I imagine it can be created but would involve plumbing through the stack. If you would like to suggest the introduction of such a feature, please open an RFE bug requesting it.
I am reopening this case as an RFE instead of creating a new one because it already has all the context about the issue. To summarize: - pod is using block storage which contains a lot for files (several millions) - every time container is created system runs selinux relabelling task, which takes tens of minutes. - such a delay from docker, causing timeout on openshift side. - there is no way to disable relabelling at the moment. The RFE is to create such a possibility to allow administrator to control selinux settings and disabel automatic actions. Customer also did some research at the area and found the interesting article, which potentially may help to resolve this issue without 'plumbing through the stack' [1]. [1] https://danwalsh.livejournal.com/38157.html
(In reply to Seth Jennings from comment #11) > The only thing that can really be done here is manually adjust the > runtime-request-timeout node option. > > This controls the timeout on the StartContainer call and every other runtime > related call. Is this achievable with OCP 3.5? As the below one, is just for 3.6 and later. (In reply to Seth Jennings from comment #12) > This option can be set in the /etc/origin/node/node-config.yaml file. > > For example, to set the timeout to 10 minutes: > > kubeletArguments: > ... > runtime-request-timeout: > - "10m"
*** Bug 1597867 has been marked as a duplicate of this bug. ***
Where does this issue stand? In 3.9, when I have a GlusterFS PVC of 8k files it might take <1min to attach.. but if I have a GlusterFS pvc of >=30k files, it may take several minutes to attach.. this can cause timeouts for an RC to deploy a pod if the container is taking too long.
I think it's important to mention that this issue happens even when a volume empty subpath is mounted [1] and thousands of files exist in different subpaths of the same PV (but are not accessible from within the container). [1] https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath
To mloriedo's point, shouldn't the flexvol [1a] code (which gluster-subvol uses) prevent repeat chown operations [1b] when it doesn't need to execute? it looks like there is a gid check performed (release-1.9 & release-1.10) before the gid change + pvc attachment, but only for local file system types? [mmclane@mmclane volume]$ git diff origin/master volume_linux.go diff --git a/pkg/volume/volume_linux.go b/pkg/volume/volume_linux.go index ef1f452..d67ee4a 100644 --- a/pkg/volume/volume_linux.go +++ b/pkg/volume/volume_linux.go @@ -89,3 +89,17 @@ func SetVolumeOwnership(mounter Mounter, fsGroup *int64) error { return nil }) } + +// IsSameFSGroup is called only for requests to mount an already mounted +// volume. It checks if fsGroup of new mount request is the same or not. +// It returns false if it not the same. It also returns current Gid of a path +// provided for dir variable. +func IsSameFSGroup(dir string, fsGroup int64) (bool, int, error) { + info, err := os.Stat(dir) + if err != nil { + glog.Errorf("Error getting stats for %s (%v)", dir, err) + return false, 0, err + } + s := info.Sys().(*syscall.Stat_t) + return int(s.Gid) == int(fsGroup), int(s.Gid), nil +} [mmclane@mmclane volume]$ find . -type f -exec egrep -Hi issamefsgroup {} \; ./volume_unsupported.go:func IsSameFSGroup(dir string, fsGroup int64) (bool, int, error) { ./volume_linux.go:// IsSameFSGroup is called only for requests to mount an already mounted ./volume_linux.go:func IsSameFSGroup(dir string, fsGroup int64) (bool, int, error) { ./local/local.go: fsGroupSame, fsGroupOld, err := volume.IsSameFSGroup(m.globalPath, fsGroupNew) [mmclane@mmclane volume]$ git status # On branch release-1.9 nothing to commit, working directory clean Looks like fsGroup bits were initially added here [2]. Of note, the chown is skipped if fsGroup isn't set -- which is probably why this isn't as visible/problematic for Jenkins. [1a] https://github.com/kubernetes/kubernetes/blob/release-1.10/pkg/volume/flexvolume/mounter.go#L96 [1b] https://github.com/kubernetes/kubernetes/blob/release-1.9/pkg/volume/volume_linux.go#L38 [2] https://github.com/kubernetes/kubernetes/pull/49988/
Am I correct that changing of flexvolume code will affect only chown operation, but not selinux relabeling? Do we know what part of the time is taken by the relabeling and what by chown operation? As far as I understand PR https://github.com/kubernetes/kubernetes/pull/49988/ that flag was added to prevent repeated remappings when a volume is already mounted on a node. It wasn't supposed to cover the situation when we need to change GID of files and when not. And, actually, I think that it is not that straightforward operation because we would have to respect chown execution failures and interruptions. Here is what I have in mind: 1. On volume mount get the last GID for the volume from some attribute (label or so) for this particular volume 2. If last GID is the same as the current one do not chown. 3. Otherwise (last GID is unknown or not the same): 3.1. clear last GID attribute 3.2. chown files 3.3. set new last GID attribute In this case interruption of chown, need to change GID, no need to change GID should be respected. I'm not sure whether we can do that inside of a script. We might need to implement this in the code because interaction with k8s entities would be needed. I also don't know where we can store this last GID info about the volume. Can we do that in some volume metadata? This line in git volume looks similar to that https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/volume/git_repo/git_repo.go#L242
@Oleksandr > Am I correct that changing of flexvolume code will affect only chown operation, > but not selinux relabeling? Do we know what part of the time is taken by the > relabeling and what by chown operation? FYI SELinux relabeling is not performed as driver is explicitly disabling it.
*** Bug 1677362 has been marked as a duplicate of this bug. ***
With the introduction of OpenShift 4, Red Hat has delivered or roadmapped a substantial number of features based on feedback by our customers. Many of the enhancements encompass specific RFEs which have been requested, or deliver a comparable solution to a customer problem, rendering an RFE redundant. This bz (RFE) has been identified as a feature request not yet planned or scheduled for an OpenShift release and is being closed. If this feature is still an active request that needs to be tracked, Red Hat Support can assist in filing a request in the new JIRA RFE system, as well as provide you with updates as the RFE progress within our planning processes. Please open a new support case: https://access.redhat.com/support/cases/#/case/new Opening a New Support Case: https://access.redhat.com/support/cases/#/case/new As the new Jira RFE system is not yet public, Red Hat Support can help answer your questions about your RFEs via the same support case system.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days