Bug 1392650

Summary:	[Preview] Slow provisioning of PVs
Product:	OpenShift Online	Reporter:	Stefanie Forrester <dakini>
Component:	Storage	Assignee:	Abhishek Gupta <abhgupta>
Status:	CLOSED WORKSFORME	QA Contact:	zhaliu
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.x	CC:	agoldste, aos-bugs, bchilds, bingli, bparees, dev, erich, jordiszrz, jsafrane, kwoodson, mturansk, pep, yasun
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-12-20 10:52:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1303130

Description Stefanie Forrester 2016-11-08 00:51:50 UTC

Description of problem:

When creating a PV-backed app, persistent volumes take about 7-8 minutes to bind to the PVC. It takes even longer for the PV to get an AWS volume ID.

Version-Release number of selected component (if applicable):

atomic-openshift-3.3.1.3-1.git.0.86dc49a.el7.x86_64
oso-volume-provisioner:v3.3.0.0-4

How reproducible:
100%

Steps to Reproduce:
1. oadm new-project <projectname>
2. oc new-app cakephp-mysql-example -n <projectname>
3. oc get pvc -n <projectname> -w

Actual results:

It takes about 7-8 minutes for the PVC to bind. And the PV is bound without a real AWS volume ID.

Expected results:

PVC should bind within about a minute and have a real AWS volume ID.

Additional info:

# These logs show the process of creating a new project and app.
# Even though I've never used this project name before, it said it already exists. This particular behavior is not very reproducible, but the overall behavior here is (PVs are slow to provision/bind).

[root@preview-master-e69da ~]# oadm new-project dakinitestpvs
Error from server: namespaces "dakinitestpvs" already exists

[root@preview-master-e69da ~]# oc new-app cakephp-mysql-example -n dakinitestpvs

# after 8 minutes, the PV still does not have an AWS volume ID.

[root@preview-master-e69da ~]# oc get pvc -n dakinitestpvs -w
NAME      STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
mysql     Pending                                      54s
NAME      STATUS    VOLUME         CAPACITY   ACCESSMODES   AGE
mysql     Pending   pv-aws-vy3qj   0                        5m
mysql     Bound     pv-aws-vy3qj   1Gi       RWO       5m

[root@preview-master-e69da ~]# oc get pvc -n dakinitestpvs -w
NAME      STATUS    VOLUME         CAPACITY   ACCESSMODES   AGE
mysql     Bound     pv-aws-vy3qj   1Gi        RWO           8m


[root@preview-master-e69da ~]# oc describe pv pv-aws-vy3qj
Name:           pv-aws-vy3qj
Labels:         <none>
Status:         Bound
Claim:          dakinitestpvs/mysql
Reclaim Policy: Delete
Access Modes:   RWO
Capacity:       1Gi
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   placeholder-for-provisioning
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
No events.


# after 10-11 minutes I noticed it finally has an AWS volume ID

[root@preview-master-e69da ~]# oc get pvc -n dakinitestpvs -w
The connection to the server internal.api.preview.openshift.com was refused - did you specify the right host or port?

[root@preview-master-e69da ~]# oc get pvc -n dakinitestpvs 
NAME      STATUS    VOLUME         CAPACITY   ACCESSMODES   AGE
mysql     Bound     pv-aws-vy3qj   1Gi        RWO           11m


[root@preview-master-e69da ~]# oc describe pv pv-aws-vy3qj
Name:           pv-aws-vy3qj
Labels:         failure-domain.beta.kubernetes.io/region=us-east-1
                failure-domain.beta.kubernetes.io/zone=us-east-1c
Status:         Bound
Claim:          dakinitestpvs/mysql
Reclaim Policy: Delete
Access Modes:   RWO
Capacity:       1Gi
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-east-1c/vol-01472e33d2fc71a03
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
No events.


# PV provisioner pod logs

[root@preview-master-e69da ~]# oc logs  --tail=500 online-volume-provisioner-4-kq3o0 -n openshift-infra  -f |grep dakinitestpvs

E1107 19:08:10.189364       1 provisioner.go:362] PersistentVolumeClaim[dakinitestpvs/mysql] error updating claim. This could lead to a duplicate volumes being provisioned: Operation cannot be fulfilled on persistentvolumeclaims "mysql": the object has been modified; please apply your changes to the latest version and try again
I1107 19:08:10.189504       1 event.go:211] Event(api.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dakinitestpvs", Name:"mysql", UID:"9b972792-a546-11e6-8c72-0ebeb1070c7f", APIVersion:"v1", ResourceVersion:"322792491", FieldPath:""}): type: 'Normal' reason: 'ProvisioningStarted' Volume pv-aws-vy3qj is being created and provisioned for this claim.
E1107 19:08:10.541872       1 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"mysql.1484e9d7ab882fcf", GenerateName:"", Namespace:"dakinitestpvs", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dakinitestpvs", Name:"mysql", UID:"9b972792-a546-11e6-8c72-0ebeb1070c7f", APIVersion:"v1", ResourceVersion:"322792491", FieldPath:""}, Reason:"ProvisioningStarted", Message:"Volume pv-aws-vy3qj is being created and provisioned for this claim.", Source:api.EventSource{Component:"online-volume-provisioner", Host:""}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63614160490, nsec:189434831, loc:(*time.Location)(0x279ed20)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63614160490, nsec:189434831, loc:(*time.Location)(0x279ed20)}}, Count:1, Type:"Normal"}': 'User "system:serviceaccount:openshift-infra:volume-provisioner" cannot create events in project "dakinitestpvs"' (will not retry!)

# at this point, about 5 minutes after the app was created, the placeholder-for-provisioning is still present in the PV

I1107 19:13:10.383860       1 event.go:211] Event(api.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dakinitestpvs", Name:"mysql", UID:"9b972792-a546-11e6-8c72-0ebeb1070c7f", APIVersion:"v1", ResourceVersion:"322832521", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSuccessful' A volume has been provisioned for this claim.
E1107 19:13:10.394895       1 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"mysql.1484ea1d9081a9b2", GenerateName:"", Namespace:"dakinitestpvs", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dakinitestpvs", Name:"mysql", UID:"9b972792-a546-11e6-8c72-0ebeb1070c7f", APIVersion:"v1", ResourceVersion:"322832521", FieldPath:""}, Reason:"ProvisioningSuccessful", Message:"A volume has been provisioned for this claim.", Source:api.EventSource{Component:"online-volume-provisioner", Host:""}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63614160790, nsec:383733170, loc:(*time.Location)(0x279ed20)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63614160790, nsec:383733170, loc:(*time.Location)(0x279ed20)}}, Count:1, Type:"Normal"}': 'User "system:serviceaccount:openshift-infra:volume-provisioner" cannot create events in project "dakinitestpvs"' (will not retry!)


# controller logs

[sedgar@use-tower1 ~]$  opssh -i -c preview "journalctl -lu atomic-openshift-master-controllers --since=today |grep dakinitestpvs" --v3 -t master                                                                  
[1] 19:22:30 [FAILURE] 54.211.248.61 Exited with error code 1
[2] 19:22:30 [SUCCESS] 54.83.159.229
Nov 08 00:01:37 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[57968]: I1108 00:01:37.076227   57968 vnids_master.go:119] Allocated netid 3370675 for namespace "dakinitestpvs"
Nov 08 00:02:35 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[57968]: I1108 00:02:35.575129   57968 factory.go:212] Replication controller "dakinitestpvs/" has been deleted
Nov 08 00:03:10 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[57968]: I1108 00:03:10.085651   57968 factory.go:212] Replication controller "dakinitestpvs/" has been deleted
[3] 19:22:30 [FAILURE] 54.173.224.22 Exited with error code 1


# related pods never start up
[root@preview-master-e69da ~]# oc get pods -n dakinitestpvs
NAME                            READY     STATUS    RESTARTS   AGE
cakephp-mysql-example-1-build   0/1       Error     0          20m
mysql-1-deploy                  0/1       Error     0          20m

# this could be related to some controller crashes

[sedgar@use-tower1 ~]$  opssh -i -c preview "journalctl -lu atomic-openshift-master-controllers --since=today |grep 'failed state'" --v3 -t master
[1] 19:40:59 [SUCCESS] 54.211.248.61
Nov 08 00:22:20 ip-172-31-10-23.ec2.internal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Nov 08 00:26:42 ip-172-31-10-23.ec2.internal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
[...]

[sedgar@use-tower1 ~]$  opssh -i -c preview "journalctl -lu atomic-openshift-master-controllers --since 2016-11-07 --until 2016-11-08 |grep -c 'failed state'" --v3 -t master 
[1] 19:47:02 [SUCCESS] 54.211.248.61
101
[2] 19:47:05 [SUCCESS] 54.173.224.22
112
[3] 19:47:08 [SUCCESS] 54.83.159.229
108

Comment 2 Stefanie Forrester 2016-11-09 18:05:49 UTC

This is the raw data for the API DoS issue in Preview. It appears to be mostly caused by DescribeInstances calls. DescribeVolumes is also happening often enough to cause issues.


[dakini@samadhi ~]$ for error in $(grep errorCode ~/preview_cloudtrail_20161108.txt|sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $error ---------" ; grep errorCode ~/preview_cloudtrail_20161108.
txt |grep -c $error ; done
-------- AccessPointNotFoundException ---------
1625
-------- Client.InvalidParameterValue ---------
21
-------- Client.InvalidVolume.NotFound ---------
1
-------- Client.RequestLimitExceeded ---------
269900
-------- Client.UnauthorizedOperation ---------
20
-------- Client.VolumeInUse ---------
102747
-------- Server.InternalError ---------
116
-------- Server.Unavailable ---------
9

[dakini@samadhi ~]$ for event in $(grep eventName ~/preview_cloudtrail_20161108.txt|sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $event ---------" ; grep eventName ~/preview_cloudtrail_20161108.txt |grep -c $event ; done
-------- AssumeRoleWithSAML ---------
1
-------- AttachVolume ---------
59691
-------- CheckMfa ---------
1
-------- ConsoleLogin ---------
1
-------- CreateSnapshot ---------
1374
-------- CreateTags ---------
2922
-------- CreateVolume ---------
133
-------- DeleteAccessKey ---------
1
-------- DeleteLoginProfile ---------
1
-------- DeleteSnapshot ---------
66
-------- DeleteUser ---------
1
-------- DeleteVolume ---------
46528
-------- DeregisterInstancesFromLoadBalancer ---------
5
-------- DescribeConfigurationRecorders ---------
3
-------- DescribeConfigurationRecorderStatus ---------
3
-------- DescribeInstanceHealth ---------
15
-------- DescribeInstances ---------
622988
-------- DescribeLoadBalancers ---------
1643
-------- DescribeReservedInstances ---------
1
-------- DescribeSnapshots ---------
23
-------- DescribeSubnets ---------
20
-------- DescribeTrails ---------
8
-------- DescribeVolumes ---------
211991
-------- DetachVolume ---------
207
-------- GetBucketLocation ---------
2
-------- GetBucketVersioning ---------
1
-------- GetUser ---------
1
-------- ListAccessKeys ---------
2
-------- ListBuckets ---------
4
-------- LookupEvents ---------
3
-------- RegisterInstancesWithLoadBalancer ---------
10
-------- RemoveUserFromGroup ---------
1

If we can find a way to reduce either of those calls, Preview should stabilize. But I'm not sure how to do that offhand. I'll have to look through more logs to see at what points openshift is calling DescribeInstances.

I might be able to reduce the DescribeVolumes calls by finding and deleting any volumes that haven't been properly cleaned up by the cluster.

Comment 3 Stefanie Forrester 2016-11-10 00:30:08 UTC

I wasn't able to work around the issue by detaching volumes. It turns out the affected volumes are attaching repeatedly and sometimes succeeding. So we need to figure out why they're attempting to attach so often, even when their previous attach attempts were successful.

This shows one volume where the attach succeeded 73 times today, and only failed 11 times. Why does it keep trying to attach? Maybe due to node evacuations?

[root@preview-master-e69da ~]# journalctl -lu atomic-openshift-master-controllers --since=today |grep vol-08030aec4f6f0e0e3 |grep 'Attach succeeded' -c
73

[root@preview-master-e69da ~]# journalctl -lu atomic-openshift-master-controllers --since=today |grep vol-08030aec4f6f0e0e3 |grep 'Attach failed' -c
11

The number of attach attempts seems to line up with the events listed in the project:

[root@preview-master-e69da ~]# oc get events -w -n swd-test-project
LASTSEEN               FIRSTSEEN              COUNT     NAME              KIND      SUBOBJECT   TYPE      REASON        SOURCE                                  MESSAGE
2016-11-10T00:22:01Z   2016-11-09T21:35:42Z   76        jenkins-1-pftp6   Pod                   Warning   FailedMount   {kubelet ip-172-31-2-89.ec2.internal}   Unable to mount volumes for pod "jenkins-1-pftp6_swd-test-project(2fa9aed3-a6c4-11e6-8286-0ebeb1070c7f)": timeout expired waiting for volumes to attach/mount for pod "jenkins-1-pftp6"/"swd-test-project". list of unattached/unmounted volumes=[jenkins-data]
2016-11-10T00:22:01Z   2016-11-09T21:35:42Z   76        jenkins-1-pftp6   Pod                   Warning   FailedSync    {kubelet ip-172-31-2-89.ec2.internal}   Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "jenkins-1-pftp6"/"swd-test-project". list of unattached/unmounted volumes=[jenkins-data]

In the default project, I see a lot of node NotReady events and tons of RegisteredNode events, but no evacuations yet.

In general, etcd is struggling to keep up on this cluster. I noticed this around the same time we started having trouble with the PVs, though I'm not sure if it's related.

We could try adding a couple more etcd nodes (or replace the storage on our current nodes) and see if it does a better job keeping up with all the events on the cluster. Here are some etcd errors I'm seeing often:

etcd[1675]: failed to send out heartbeat on time (deadline exceeded for 44.589112ms)
etcd[1675]: server is likely overloaded
etcd[1675]: etcdserver: request timed out, possibly due to previous leader failure [merged 17 repeated lines in 1s]

Comment 4 yasun 2016-11-10 11:06:58 UTC

I have run on dev-preview-int, the pvc are created and bound to a pv quickly, the time cost is 42 seconds

Version-Release number for dev-preview-int:
openshift v3.4.0.21+ca4702d
kubernetes v1.4.0+776c994

1. [root@ymsun origin]# oc new-project dakinitestpvs
2. [root@ymsun origin]# oc new-app cakephp-mysql-example 
--> Deploying template "openshift/cakephp-mysql-example" to project dakinitestpvs
....
<skips>
3. [root@ymsun origin]#  oc get pvc -n dakinitestpvs -w
NAME      STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
mysql     Bound     pvc-e4bcaf0f-a733-11e6-9c35-0a30d6fc46dc   1Gi        RWO           42s

Comment 6 Kenny Woodson 2016-11-10 20:20:34 UTC

Just adding more numbers.

Statistics for the entire day look like this:

Count: 86    Key: CreateVolume
Count: 2    Key: ListAliases
Count: 4    Key: LookupEvents
Count: 6    Key: DescribeServices
Count: 1019850    Key: DescribeInstances
Count: 4    Key: DescribeConfigurationRecorders
Count: 445    Key: DeleteSnapshot
Count: 4    Key: DescribeConfigurationRecorderStatus
Count: 9    Key: DescribeTrails
Count: 6    Key: DescribeCases
Count: 6    Key: DescribeAlarms
Count: 1    Key: DescribeReservedInstances
Count: 2401    Key: DescribeLoadBalancers
Count: 1    Key: CreateCase
Count: 1    Key: CreateNetworkInterface
Count: 4    Key: DescribeSeverityLevels
Count: 1    Key: ConsoleLogin
Count: 41    Key: DetachVolume
Count: 211972    Key: AttachVolume
Count: 25    Key: DescribeSnapshots
Count: 2    Key: ListBuckets
Count: 10703    Key: DeleteVolume
Count: 2101    Key: CreateTags
Count: 246051    Key: DescribeVolumes
Count: 3    Key: DeleteNetworkInterface
Count: 1    Key: CheckMfa
Count: 1    Key: DescribeAccountLimits
Count: 997    Key: CreateSnapshot

That is for around 20 hours of logging in cloudtrail.  This comes out to be 20.7601 api calls per second.

This does not include the api calls that were disallowed due to the limiting.

Comment 7 Stefanie Forrester 2016-11-10 22:24:39 UTC

AWS support says we should implement an exponential backoff if we don't have one already.

http://docs.aws.amazon.com/general/latest/gr/api-retries.html

We should look at the controller logic and see if something like this can be implemented with regards to our PV attaches.

Comment 8 Abhishek Gupta 2016-11-12 00:36:24 UTC

*** Bug 1394428 has been marked as a duplicate of this bug. ***

Comment 9 Jan Safranek 2016-11-14 14:12:03 UTC

There are couple of things here.

1) OpenShift Online uses its own provisioner and not the OpenShift/Kubernetes one. Mark Turansky should know more if there is exponential backoff or not.

2) Looking at the API call numbers, DescribeVolumes and AttachVolume is called a lot. It's used to attach a volume to a node and to check if the volume is attached there (DescribeVolumes). Maybe we could tune the timeouts here to ask less, now we ask AWS every 10 seconds. Os is here an event that would tell Kubernetes that a volume got attached?

Comment 12 Stefanie Forrester 2016-11-14 19:04:44 UTC

Today I did some more testing and showed some new apps to Mark who determined that we have a race condition between the storage controller and the PV provisioning pod. My understanding is that the controller moves the PVC into Bound state before it's ready, which updates the object's resourceVersion, which prevents the provisioning pod from provisioning the volume because it's always 1 resourceVersion behind. The controller is also trying to attach the PV before it has a valid AWS VolumeID, so it's trying to mount a volume that doesn't exist yet, which also updates the resourceVersion and keeps the race going.

Eventually the provisioning pod is able to provision the volume, but it can take 10-20+ minutes before it finally wins the race between itself and the controller. The race condition is causing extra AWS API calls and contributing to our AWS API throttling, which contributes to the instability of the controller. (We're still seeing frequent controller restarts).

So next we either need to come up with a potential bug fix for the race condition, or get 3.4 out to prod so we can use the built-in provisioning instead.

Comment 13 Abhishek Gupta 2016-11-15 19:33:52 UTC

*** Bug 1392189 has been marked as a duplicate of this bug. ***

Comment 15 yasun 2016-11-17 07:00:03 UTC

Testing on dev-preview-stg, pvc can be bounded to pv quickly.

Version-Release number for dev-preview-stg:
openshift v3.3.1.4
kubernetes v1.3.0+52492b4

[root@ymsun default-stg]# oc get pvc
NAME      STATUS    VOLUME         CAPACITY   ACCESSMODES   AGE
mysql     Bound     pv-aws-ry30z   1Gi        RWO           8s

Comment 17 yasun 2016-11-22 02:04:01 UTC

Testing on Dev_preview_Prod, pvc can be bounded to pv quickly. 

Version-Release number for Dev_preview_Prod:
openshift v3.3.1.3
kubernetes v1.3.0+52492b4

[root@ymsun online-test]# oc get pvc
NAME      STATUS    VOLUME                                            CAPACITY   ACCESSMODES   AGE
mysql     Bound     pv-1-preview-master-afbb8-vol-04750624d98803d7c   1Gi        RWO           9s