Bug 1459106 - Deployments timing out if storage contains lots of small files
Deployments timing out if storage contains lots of small files
Status: ASSIGNED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Kubernetes (Show other bugs)
3.4.0
Unspecified Unspecified
unspecified Severity medium
: ---
: 3.7.0
Assigned To: Seth Jennings
DeShuai Ma
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-06 06:22 EDT by Alexander Koksharov
Modified: 2017-08-08 13:00 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Alexander Koksharov 2017-06-06 06:22:52 EDT
Description of problem:
Persistent volume which contains lots of files (>6'000'000) which formatted with ext4 (-T option). OSE is not able to launch pods with this volume attached.

Issue appears even if application is not accessing this storage - it is just mounted.


Version-Release number of selected component (if applicable):
3.4

How reproducible:

Steps to Reproduce:
1. create a PV similar to:
# oc export pv pv-vol-05fad0544281b9a79 
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/bound-by-controller: "yes"
  creationTimestamp: null
  labels:
    failure-domain.beta.kubernetes.io/region: eu-west-1
    failure-domain.beta.kubernetes.io/zone: eu-west-1c
  name: pv-vol-05fad0544281b9a79
spec:
  accessModes:
  - ReadWriteOnce
  awsElasticBlockStore:
    fsType: ext4
    volumeID: aws://eu-west-1c/vol-05fad0544281b9a79
  capacity:
    storage: 400Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pvc-vol-05fad0544281b9a79
    namespace: small-files
    resourceVersion: "23371"
    uid: 3be1c9c1-49f1-11e7-a3dc-06578a1a613c
  persistentVolumeReclaimPolicy: Retain
status: {}

2. create files on the storage:
# for i in {1..10 000 000}; do echo $i > ./$i.txt; done

3. scale up deployment with this volume attached. system will take several minutes to run an application. The following is reported in the events:
Events:
  FirstSeen	LastSeen	Count	From							SubobjectPath			Type		Reason		Message
  ---------	--------	-----	----							-------------			--------	------		-------
  13m		13m		1	{default-scheduler }									Normal		Scheduled	Successfully assigned test-app-3-eyhvf to ip-172-31-27-242.eu-west-1.compute.internal
  11m		6m		3	{kubelet ip-172-31-27-242.eu-west-1.compute.internal}					Warning		FailedMount	Unable to mount volumes for pod "test-app-3-eyhvf_small-files(edc7aab8-49f9-11e7-a3dc-06578a1a613c)": timeout expired waiting for volumes to attach/mount for pod "test-app-3-eyhvf"/"small-files". list of unattached/unmounted volumes=[small-files]
  11m		6m		3	{kubelet ip-172-31-27-242.eu-west-1.compute.internal}					Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "test-app-3-eyhvf"/"small-files". list of unattached/unmounted volumes=[small-files]
  6m		6m		1	{kubelet ip-172-31-27-242.eu-west-1.compute.internal}	spec.containers{test-app}	Normal		Pulling		pulling image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4"
  6m		6m		1	{kubelet ip-172-31-27-242.eu-west-1.compute.internal}	spec.containers{test-app}	Normal		Pulled		Successfully pulled image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4"
  6m		6m		1	{kubelet ip-172-31-27-242.eu-west-1.compute.internal}	spec.containers{test-app}	Normal		Created		Created container with docker id 3b229a3c2095; Security:[seccomp=unconfined]
  4m		4m		1	{kubelet ip-172-31-27-242.eu-west-1.compute.internal}	spec.containers{test-app}	Warning		Failed		Failed to start container with docker id 3b229a3c2095 with error: operation timeout: context deadline exceeded
  4m		4m		1	{kubelet ip-172-31-27-242.eu-west-1.compute.internal}					Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "test-app" with RunContainerError: "runContainer: operation timeout: context deadline exceeded"


Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
Comment 1 Matthew Wong 2017-06-06 18:14:43 EDT
Most likely IMO it is the recursive chown + chmod fsGroup that ensures the pod can write to the volume causing this slowdown; can you please provide the output of 'oc get pod -o yaml' to confirm if it the pod has an fsGroup set?
Comment 2 Alexander Koksharov 2017-06-07 03:49:49 EDT
 === In Red Hat Customer Portal Case 01862362 ===
--- Comment by Koksharov, Alexander on 6/7/2017 9:49 AM ---

yes, fsGroup is set for a pod, but is is not set in dc.

# oc get dc test-app | grep -i fsGroup
# 


# oc get pod test-app-3-2qb48 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"small-files","name":"test-app-3","uid":"601c90f0-49f1-11e7-a3dc-06578a1a613c","apiVersion":"v1","resourceVersion":"66439"}}
    openshift.io/container.test-app.image.entrypoint: '["container-entrypoint","/bin/sh","-c","$STI_SCRIPTS_PATH/usage"]'
    openshift.io/deployment-config.latest-version: "3"
    openshift.io/deployment-config.name: test-app
    openshift.io/deployment.name: test-app-3
    openshift.io/generated-by: OpenShiftNewApp
    openshift.io/scc: restricted
  creationTimestamp: 2017-06-07T07:44:59Z
  generateName: test-app-3-
  labels:
    app: test-app
    deployment: test-app-3
    deploymentconfig: test-app
  name: test-app-3-2qb48
  namespace: small-files
  resourceVersion: "66506"
  selfLink: /api/v1/namespaces/small-files/pods/test-app-3-2qb48
  uid: 34f92c81-4b55-11e7-a116-06578a1a613c
spec:
  containers:
  - image: 172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4
    imagePullPolicy: Always
    name: test-app
    ports:
    - containerPort: 8080
      protocol: TCP
    resources: {}
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
        - SYS_CHROOT
      privileged: false
      runAsUser: 1000070000
      seLinuxOptions:
        level: s0:c8,c7
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /data
      name: small-files
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-ooelj
      readOnly: true
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: default-dockercfg-j0w2s
  nodeName: ip-172-31-27-242.eu-west-1.compute.internal
  restartPolicy: Always
  securityContext:
    fsGroup: 1000070000
    seLinuxOptions:
      level: s0:c8,c7
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: small-files
    persistentVolumeClaim:
      claimName: pvc-vol-05fad0544281b9a79
  - name: default-token-ooelj
    secret:
      defaultMode: 420
      secretName: default-token-ooelj
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2017-06-07T07:44:59Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2017-06-07T07:46:52Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2017-06-07T07:44:59Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://bcf9fcce7bee45654210c96fd34dbc71a6c67274a0457a64520472e1adbdfa2b
    image: 172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4
    imageID: docker-pullable://172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4
    lastState: {}
    name: test-app
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2017-06-07T07:46:51Z
  hostIP: 172.31.27.242
  phase: Running
  podIP: 10.131.0.20
  startTime: 2017-06-07T07:44:59Z
Comment 3 Matthew Wong 2017-06-07 17:36:32 EDT
OK, the easiest way to test if fsGroup is responsible for the delay is to prevent OpenShift from automatically setting the pod's fsGroup and observe. Here's one way to do it:

1. Create a serviceAccount fstest and specify it in the deployment at spec.template.spec.serviceAccount
2. Add an SCC to fstest that *doesn't* automatically set pods' fsGroups (one with fsgroup: RunAsAny) such as SCC nonroot
$ oadm policy add-scc-to-user nonroot system:serviceaccount:default:fstest
3. Recreate the deployment if needed

Then the pod should not have an fsGroup automatically set and we can observe if there is a long delay before ContainerStarted

 
In my brief testing, fsGroup is indeed responsible for a bulk of the time it takes to mount a volume. I only tested with about 1M inodes though, not 6M so I don't know if 6M could cause minutes-long delays; regardless, the difference is clear with and without fsGroup.

Avg time between "MountVolume.MountDevice succeeded" and "MountVolume.Setup succeeded"
without fsGroup: 0s
with fsGroup: 45s
Comment 4 Alexander Koksharov 2017-06-08 10:30:36 EDT
Hello Matthew,

There is some improvement if I set my deployment to use nonroot scc. Events about timeouts while attaching storage disappear. 

However, deployment still takes 3-4 minutes. Here is some info from my last test:

- with 6M files
# oc describe pod test-app-8-eyhix | grep "Security Policy:"
Security Policy:	nonroot

# oc get pods -w
NAME               READY     STATUS              RESTARTS   AGE
test-app-3-build   0/1       Completed           0          2d
test-app-8-eyhix   0/1       ContainerCreating   0          2m
NAME               READY     STATUS    RESTARTS   AGE
test-app-8-eyhix   1/1       Running   0          4m

# oc get event -o go-template="{{ range  .items }} {{.metadata.creationTimestamp}} : {{.message }} ; {{ end }}" | tr ';' '\n' | sort
...
  2017-06-08T11:39:33Z : Created pod: test-app-8-eyhix 
  2017-06-08T11:39:33Z : Successfully assigned test-app-8-eyhix to ip-172-31-27-242.eu-west-1.compute.internal 
  2017-06-08T11:39:52Z : Created container with docker id 17d6353592e3
  2017-06-08T11:39:52Z : pulling image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4" 
  2017-06-08T11:39:52Z : Successfully pulled image "172.30.139.143:5000/small-files/test-app@sha256:34a69513e8551d069d1cd63e56982c247b3b7f3d81f54e04a7b1c167d6b624e4" 
  2017-06-08T11:41:52Z : Error syncing pod, skipping: failed to "StartContainer" for "test-app" with RunContainerError: "runContainer: operation timeout: context deadline exceeded"
  2017-06-08T11:41:52Z : Failed to start container with docker id 17d6353592e3 with error: operation timeout: context deadline exceeded 

# oc rsh test-app-8-eyhix
sh-4.2$ cd /data/
sh-4.2$ ls -la | wc -l
6000004


- with 2,7M files
sh-4.2$ ls -la /data/ | wc -l
2742013

# oc get pods -w
NAME               READY     STATUS              RESTARTS   AGE
test-app-3-build   0/1       Completed           0          3d
test-app-8-ooay4   0/1       ContainerCreating   0          10s
NAME               READY     STATUS    RESTARTS   AGE
test-app-8-ooay4   1/1       Running   0          2m


- with no files only 45s
sh-4.2$ perl -e 'chdir("/data") or die; opendir(D, "."); while($f = readdir D) {unlink($f);}'
# oc get pods -w
NAME               READY     STATUS              RESTARTS   AGE
test-app-3-build   0/1       Completed           0          3d
test-app-8-6wisk   0/1       ContainerCreating   0          4s
NAME               READY     STATUS    RESTARTS   AGE
test-app-8-6wisk   1/1       Running   0          45s



looks like there is something else what does scanning of filesystem
Comment 5 Matthew Wong 2017-06-08 17:21:46 EDT
I'm certain now that the non-fsGroup slowdown is due to SELinux relabeling (recursive chcon on the entire contents of the disk).

From observation/profiling, openshift more or less idles during the time between pod creation and container start time. Meanwhile, Docker daemon does significant  reading from the EBS disk & uses a lot of CPU.

So I did a basic test with and without SELinux relabeling (outside of openshift) and the difference was clear.

$ docker run -v /ebs-mountpoint:/test:Z ... takes minutes to start where
$ docker run -v /ebs-mountpoint:/test ... takes a second.

Unfortunately, at the moment disabling openshift's (docker's) recursive chconing is not possible. All EBS disks will undergo it.


Should also note re: the fsGroup slowdown that removing fsGroup from pods that use the volume is somewhat of a workaround in that: without openshift to automatically chown+chmod the disk, it becomes the adminstrator's responsibility to ensure pods that use the volume have the right permissions to read/write it. e.g.: ensure every file on that disk is owned by group 1234 and ensure every pod that uses that disk has supplemental group 1234. Hope that's acceptable.
Comment 6 Hemant Kumar 2017-06-08 19:31:10 EDT
As I was saying in IRC - perhaps we can mitigate this little bit by using :z (small case) and hence avoid recursive relabelling.
Comment 7 Bradley Childs 2017-06-21 16:21:05 EDT
Docker recursively sets the SELINUX label on the volume and this will take time.. if its 1 file it probably wont take much time.. if its a million directories of a million files each, it will take a lot of time.  There is no way around this except to disable selinux or buy faster hardware..

In this case the POD does eventually start.  There's nothing we can do to optimize security, performance, and usefulness of the volume.. if we don't perform a recursive label on the volume, some files wont be accessible.  If we do, it may take time to perform the operation.

All that to say, this turns into an expectation problem..  The events indicates that things aren't happening as fast as usual, but the POD eventually comes up (after 14 mins).  The fix here-- if possible, is to give better Event messages and document that starting a POD may take some time for volumes with a lot of files.

If there is a feature request around this (allow customize or switch of selinux labels) we can take that into consideration as an RFE.

Lowering priority for now.
Comment 8 Matthew Wong 2017-06-22 10:29:09 EDT
Docker takes a long time to recursively apply the selinux label (:Z). The pod does appear to start correctly eventually despite "Failed to start container with docker id" but there are two possible consequences of this delay between CreateContainer and StartContainer

"Symlink container logs to the legacy container log location for cluster logging support." does not occur.

Any pod container post-start lifecycle hooks, if any are specified, are never run.

because SyncPod appears to only run once.
Comment 9 Bradley Childs 2017-06-22 14:05:18 EDT
Transferring this to the kubernetes component as the storage portion is working as intended but kubelet is behaving weird due to the longer startup time of the docker container..

Summary:

- Docker container takes a long time to perform operation & start container, but does not error out.  Operation is expected to be slow/take long time (using a volume with lots of files and applying selinux label recursively to the volume is expected to take some time)

- Kube warns of errors during this 'long wait'  Events are marked as "Warning" but the message has language of "Failure".. This isn't ideal, but not blocker problem.

- Docker container eventually starts and so does POD, but POD behavior isn't correct.  Post-start lifecycle hooks aren't run and container logs aren't linked (as matthew mentions in comment #2)

Also upping priority for node triage.
Comment 10 Seth Jennings 2017-06-22 23:09:57 EDT
From a cursory look at the situation, this will not be easy to address.

We have the basic issue of how long do we wait for a container to start before we consider it a failure.  There is no way to detect that something in the runtime is not hung, but actually making forward progress.  With a sufficiently slow disk, it could take hours.

Post-start hooks and creation of the log symlink are synchronous with StartContainer. So if that errors out, they will not be done.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L121-L154

When the container is started later, IIRC it just gets pulled in like a pod would on node restart as a container that we don't have state about but should be running so we just reconcile the state and keep the pod running.  However, we assume the post-start hooks and log symlinks have already been done by the node process in a previous life.

tl;dr I don't think this is something we can address for 3.6.  It will take some significant reworking.

Note You need to log in before you can comment on or make changes to this bug.