Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1392357 - Volume Attach/Mount is slow on AWS
Volume Attach/Mount is slow on AWS
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage (Show other bugs)
3.4.0
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Hemant Kumar
chaoyang
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-11-07 05:06 EST by chaoyang
Modified: 2017-08-16 15 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Improve efficiency of attach operation on AWS. Reduce overall number of API calls being made. Reason: Attach on AWS could be slow because duplicate API calls were being made and frequent polling of volume information can cause API quota to be exceeded. Result: Removed duplicate API calls from AWS code path. Also implemented bulk polling of AWS volumes to avoid API quota problems.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:15:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log message (20.00 KB, application/x-tar)
2016-11-09 05:42 EST, chaoyang
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 05:02:50 EDT

  None (edit)
Description chaoyang 2016-11-07 05:06:28 EST
Description of problem:
6 pods could not be running if create 50 pods using ebs volumes at one time

Version-Release number of selected component (if applicable):
openshift v3.4.0.22+5c56720
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
[root@ip-172-18-7-175 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)

How reproducible:
80%

Steps to Reproduce:
1.Update env to support 50 ebs volumes per node
2.Create 50 pv, pvc and pods
3.After some time,6 pods could not be running
4.
[root@ip-172-18-12-250 test]# oc describe pods mypod50
Name:			mypod50
Namespace:		default
Security Policy:	anyuid
Node:			ip-172-18-7-175.ec2.internal/172.18.7.175
Start Time:		Mon, 07 Nov 2016 04:14:27 -0500
Labels:			name=frontendhttp
Status:			Pending
IP:			
Controllers:		<none>
Containers:
  myfrontend:
    Container ID:	
    Image:		aosqe/hello-openshift
    Image ID:		
    Port:		80/TCP
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /tmp from aws (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8qnna (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  aws:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	ebs50
    ReadOnly:	false
  default-token-8qnna:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-8qnna
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath	Type		Reason		Message
  ---------	--------	-----	----					-------------	--------	------		-------
  24m		24m		1	{default-scheduler }					Normal		Scheduled	Successfully assigned mypod50 to ip-172-18-7-175.ec2.internal
  22m		2m		10	{kubelet ip-172-18-7-175.ec2.internal}			Warning		FailedMount	Unable to mount volumes for pod "mypod50_default(92172d4a-a4ca-11e6-bf36-0e259d96f826)": timeout expired waiting for volumes to attach/mount for pod "mypod50"/"default". list of unattached/unmounted volumes=[aws]
  22m		2m		10	{kubelet ip-172-18-7-175.ec2.internal}			Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mypod50"/"default". list of unattached/unmounted volumes=[aws]



Actual results:
Some of pods could not be running

Expected results:
All of 50 pods should be running

Additional info:
[root@ip-172-18-7-175 ~]# grep -r ddd7c443 /var/log/messages
Nov  7 04:14:35 ip-172-18-7-175 atomic-openshift-node: I1107 04:14:35.146015   78899 reconciler.go:225] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-ddd7c443" (spec.Name: "ebs19") pod "92172d4a-a4ca-11e6-bf36-0e259d96f826" (UID: "92172d4a-a4ca-11e6-bf36-0e259d96f826")
Nov  7 04:14:35 ip-172-18-7-175 atomic-openshift-node: E1107 04:14:35.146133   78899 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-ddd7c443\"" failed. No retries permitted until 2016-11-07 04:14:43.146109937 -0500 EST (durationBeforeRetry 8s). Error: Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-ddd7c443" (spec.Name: "ebs19") pod "92172d4a-a4ca-11e6-bf36-0e259d96f826" (UID: "92172d4a-a4ca-11e6-bf36-0e259d96f826") has not yet been added to the list of VolumesInUse in the node's volume status.
Nov  7 04:16:35 ip-172-18-7-175 atomic-openshift-node: I1107 04:16:35.538277   78899 reconciler.go:225] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-ddd7c443" (spec.Name: "ebs19") pod "92172d4a-a4ca-11e6-bf36-0e259d96f826" (UID: "92172d4a-a4ca-11e6-bf36-0e259d96f826")
Nov  7 04:16:35 ip-172-18-7-175 atomic-openshift-node: E1107 04:16:35.544628   78899 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-ddd7c443\"" failed. No retries permitted until 2016-11-07 04:18:35.544615109 -0500 EST (durationBeforeRetry 2m0s). Error: Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-ddd7c443" (spec.Name: "ebs19") pod "92172d4a-a4ca-11e6-bf36-0e259d96f826" (UID: "92172d4a-a4ca-11e6-bf36-0e259d96f826") is not yet attached according to node status.
Comment 1 Bradley Childs 2016-11-08 12:44:46 EST
This looks like a scenario where the AWS API Qutoa is being exhausted and the attach calls for the PVs are failing due to the lack of quota.

Can you attach the entire logs?

Thanks,
Comment 2 chaoyang 2016-11-09 05:42 EST
Created attachment 1218889 [details]
log message
Comment 3 chaoyang 2016-11-09 05:45:55 EST
Hi I fould the pod will be running at last, not sure which will impact the attach/mount time. please see the log attached, pv name is pvc-3d764334-a64e-11e6-94f8-0e178310007a
Comment 4 Hemant Kumar 2016-11-09 13:11:39 EST
What is the format of log file? I untared it and it seems to have created a .bak file which looks like a binary file.
Comment 5 Troy Dawson 2016-11-09 14:53:58 EST
Moving to "Assigned".  Modified is for when the fix is in origin and/or OCP.
Comment 6 Hemant Kumar 2016-11-09 15:50:23 EST
Also the `.bak` file appears to be empty.
Comment 7 chaoyang 2016-11-10 03:58:22 EST
The Log is too big to upload. So I mailed to you as attachment.
Comment 8 Hemant Kumar 2016-11-14 15:56:14 EST
According to AWS, *40* is the number of allowed EBS volumes on a single instance.  

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html

Creating 50 or more volumes will likely lead to API limits problems we are seeing here. So we have to tread carefully to avoid this.

Having said that, I am going to try and investigate more and see how high we can go, before the limit kicks in.
Comment 10 chaoyang 2016-11-15 01:45:06 EST
These pods will be running after some time. So I changed the bug title
And I don't think it is caused by the limit of AWS API quota. If I try to use NFS pv I also met this issue.

Maybe there are some reason that will impact the attach and mount
Comment 11 Jan Safranek 2016-11-15 10:33:59 EST
chaoyang, AWS is pretty slow when creating + attaching volumes to a node and you're hitting AWS API call throttling. How slow is actually slow?
Comment 12 chaoyang 2016-11-16 02:58:45 EST
It did not reproduce on the OCP openshift v3.4.0.26+f7e109e
I created 50 pv, 50 pvc and 50 pods, all of pods are running in 3 minutes.

But on the OCP openshift v3.4.0.22+5c56720
6 of 50 pods are in the status of "ContainerCreating" more than 7 minutes and can met the "timeout expired waiting for volumes to attach/mount" error message

(In reply to Jan Safranek from comment #11)
> chaoyang, AWS is pretty slow when creating + attaching volumes to a node and
> you're hitting AWS API call throttling. How slow is actually slow?
Comment 14 Eric Paris 2017-01-28 13:05:45 EST
@hemant I see that 39842 has merged upstream. Is it in 1.5.2? If not, will you backport it to origin?

Is there anything we can/should do to make 39564 go faster?
Comment 17 Hemant Kumar 2017-03-29 15:32:58 EDT
The code that removes duplicate AWS calls in code has been merged in 1.5 - https://github.com/openshift/origin/pull/12740

Also code for recording metrics for aws is here - https://github.com/kubernetes/kubernetes/pull/43477
Comment 19 chaoyang 2017-06-15 02:49:41 EDT
Test is passed on below version:
openshift v3.6.105
kubernetes v1.6.1+5115d708d7

1.Create 50 pv, pvc ,pods
2.Pods is running.
3.Delete 50 pods, pvc, pv
Repeat above steps for several times and all pods are running.
Comment 21 errata-xmlrpc 2017-08-10 01:15:47 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.