Bug 1392650
Summary: | [Preview] Slow provisioning of PVs | ||
---|---|---|---|
Product: | OpenShift Online | Reporter: | Stefanie Forrester <dakini> |
Component: | Storage | Assignee: | Abhishek Gupta <abhgupta> |
Status: | CLOSED WORKSFORME | QA Contact: | zhaliu |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.x | CC: | agoldste, aos-bugs, bchilds, bingli, bparees, dev, erich, jordiszrz, jsafrane, kwoodson, mturansk, pep, yasun |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-20 10:52:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1303130 |
Description
Stefanie Forrester
2016-11-08 00:51:50 UTC
This is the raw data for the API DoS issue in Preview. It appears to be mostly caused by DescribeInstances calls. DescribeVolumes is also happening often enough to cause issues. [dakini@samadhi ~]$ for error in $(grep errorCode ~/preview_cloudtrail_20161108.txt|sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $error ---------" ; grep errorCode ~/preview_cloudtrail_20161108. txt |grep -c $error ; done -------- AccessPointNotFoundException --------- 1625 -------- Client.InvalidParameterValue --------- 21 -------- Client.InvalidVolume.NotFound --------- 1 -------- Client.RequestLimitExceeded --------- 269900 -------- Client.UnauthorizedOperation --------- 20 -------- Client.VolumeInUse --------- 102747 -------- Server.InternalError --------- 116 -------- Server.Unavailable --------- 9 [dakini@samadhi ~]$ for event in $(grep eventName ~/preview_cloudtrail_20161108.txt|sort -u|awk '{print $2}'|tr -d \":,) ; do echo "-------- $event ---------" ; grep eventName ~/preview_cloudtrail_20161108.txt |grep -c $event ; done -------- AssumeRoleWithSAML --------- 1 -------- AttachVolume --------- 59691 -------- CheckMfa --------- 1 -------- ConsoleLogin --------- 1 -------- CreateSnapshot --------- 1374 -------- CreateTags --------- 2922 -------- CreateVolume --------- 133 -------- DeleteAccessKey --------- 1 -------- DeleteLoginProfile --------- 1 -------- DeleteSnapshot --------- 66 -------- DeleteUser --------- 1 -------- DeleteVolume --------- 46528 -------- DeregisterInstancesFromLoadBalancer --------- 5 -------- DescribeConfigurationRecorders --------- 3 -------- DescribeConfigurationRecorderStatus --------- 3 -------- DescribeInstanceHealth --------- 15 -------- DescribeInstances --------- 622988 -------- DescribeLoadBalancers --------- 1643 -------- DescribeReservedInstances --------- 1 -------- DescribeSnapshots --------- 23 -------- DescribeSubnets --------- 20 -------- DescribeTrails --------- 8 -------- DescribeVolumes --------- 211991 -------- DetachVolume --------- 207 -------- GetBucketLocation --------- 2 -------- GetBucketVersioning --------- 1 -------- GetUser --------- 1 -------- ListAccessKeys --------- 2 -------- ListBuckets --------- 4 -------- LookupEvents --------- 3 -------- RegisterInstancesWithLoadBalancer --------- 10 -------- RemoveUserFromGroup --------- 1 If we can find a way to reduce either of those calls, Preview should stabilize. But I'm not sure how to do that offhand. I'll have to look through more logs to see at what points openshift is calling DescribeInstances. I might be able to reduce the DescribeVolumes calls by finding and deleting any volumes that haven't been properly cleaned up by the cluster. I wasn't able to work around the issue by detaching volumes. It turns out the affected volumes are attaching repeatedly and sometimes succeeding. So we need to figure out why they're attempting to attach so often, even when their previous attach attempts were successful. This shows one volume where the attach succeeded 73 times today, and only failed 11 times. Why does it keep trying to attach? Maybe due to node evacuations? [root@preview-master-e69da ~]# journalctl -lu atomic-openshift-master-controllers --since=today |grep vol-08030aec4f6f0e0e3 |grep 'Attach succeeded' -c 73 [root@preview-master-e69da ~]# journalctl -lu atomic-openshift-master-controllers --since=today |grep vol-08030aec4f6f0e0e3 |grep 'Attach failed' -c 11 The number of attach attempts seems to line up with the events listed in the project: [root@preview-master-e69da ~]# oc get events -w -n swd-test-project LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 2016-11-10T00:22:01Z 2016-11-09T21:35:42Z 76 jenkins-1-pftp6 Pod Warning FailedMount {kubelet ip-172-31-2-89.ec2.internal} Unable to mount volumes for pod "jenkins-1-pftp6_swd-test-project(2fa9aed3-a6c4-11e6-8286-0ebeb1070c7f)": timeout expired waiting for volumes to attach/mount for pod "jenkins-1-pftp6"/"swd-test-project". list of unattached/unmounted volumes=[jenkins-data] 2016-11-10T00:22:01Z 2016-11-09T21:35:42Z 76 jenkins-1-pftp6 Pod Warning FailedSync {kubelet ip-172-31-2-89.ec2.internal} Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "jenkins-1-pftp6"/"swd-test-project". list of unattached/unmounted volumes=[jenkins-data] In the default project, I see a lot of node NotReady events and tons of RegisteredNode events, but no evacuations yet. In general, etcd is struggling to keep up on this cluster. I noticed this around the same time we started having trouble with the PVs, though I'm not sure if it's related. We could try adding a couple more etcd nodes (or replace the storage on our current nodes) and see if it does a better job keeping up with all the events on the cluster. Here are some etcd errors I'm seeing often: etcd[1675]: failed to send out heartbeat on time (deadline exceeded for 44.589112ms) etcd[1675]: server is likely overloaded etcd[1675]: etcdserver: request timed out, possibly due to previous leader failure [merged 17 repeated lines in 1s] I have run on dev-preview-int, the pvc are created and bound to a pv quickly, the time cost is 42 seconds Version-Release number for dev-preview-int: openshift v3.4.0.21+ca4702d kubernetes v1.4.0+776c994 1. [root@ymsun origin]# oc new-project dakinitestpvs 2. [root@ymsun origin]# oc new-app cakephp-mysql-example --> Deploying template "openshift/cakephp-mysql-example" to project dakinitestpvs .... <skips> 3. [root@ymsun origin]# oc get pvc -n dakinitestpvs -w NAME STATUS VOLUME CAPACITY ACCESSMODES AGE mysql Bound pvc-e4bcaf0f-a733-11e6-9c35-0a30d6fc46dc 1Gi RWO 42s Just adding more numbers. Statistics for the entire day look like this: Count: 86 Key: CreateVolume Count: 2 Key: ListAliases Count: 4 Key: LookupEvents Count: 6 Key: DescribeServices Count: 1019850 Key: DescribeInstances Count: 4 Key: DescribeConfigurationRecorders Count: 445 Key: DeleteSnapshot Count: 4 Key: DescribeConfigurationRecorderStatus Count: 9 Key: DescribeTrails Count: 6 Key: DescribeCases Count: 6 Key: DescribeAlarms Count: 1 Key: DescribeReservedInstances Count: 2401 Key: DescribeLoadBalancers Count: 1 Key: CreateCase Count: 1 Key: CreateNetworkInterface Count: 4 Key: DescribeSeverityLevels Count: 1 Key: ConsoleLogin Count: 41 Key: DetachVolume Count: 211972 Key: AttachVolume Count: 25 Key: DescribeSnapshots Count: 2 Key: ListBuckets Count: 10703 Key: DeleteVolume Count: 2101 Key: CreateTags Count: 246051 Key: DescribeVolumes Count: 3 Key: DeleteNetworkInterface Count: 1 Key: CheckMfa Count: 1 Key: DescribeAccountLimits Count: 997 Key: CreateSnapshot That is for around 20 hours of logging in cloudtrail. This comes out to be 20.7601 api calls per second. This does not include the api calls that were disallowed due to the limiting. AWS support says we should implement an exponential backoff if we don't have one already. http://docs.aws.amazon.com/general/latest/gr/api-retries.html We should look at the controller logic and see if something like this can be implemented with regards to our PV attaches. *** Bug 1394428 has been marked as a duplicate of this bug. *** There are couple of things here. 1) OpenShift Online uses its own provisioner and not the OpenShift/Kubernetes one. Mark Turansky should know more if there is exponential backoff or not. 2) Looking at the API call numbers, DescribeVolumes and AttachVolume is called a lot. It's used to attach a volume to a node and to check if the volume is attached there (DescribeVolumes). Maybe we could tune the timeouts here to ask less, now we ask AWS every 10 seconds. Os is here an event that would tell Kubernetes that a volume got attached? Today I did some more testing and showed some new apps to Mark who determined that we have a race condition between the storage controller and the PV provisioning pod. My understanding is that the controller moves the PVC into Bound state before it's ready, which updates the object's resourceVersion, which prevents the provisioning pod from provisioning the volume because it's always 1 resourceVersion behind. The controller is also trying to attach the PV before it has a valid AWS VolumeID, so it's trying to mount a volume that doesn't exist yet, which also updates the resourceVersion and keeps the race going. Eventually the provisioning pod is able to provision the volume, but it can take 10-20+ minutes before it finally wins the race between itself and the controller. The race condition is causing extra AWS API calls and contributing to our AWS API throttling, which contributes to the instability of the controller. (We're still seeing frequent controller restarts). So next we either need to come up with a potential bug fix for the race condition, or get 3.4 out to prod so we can use the built-in provisioning instead. *** Bug 1392189 has been marked as a duplicate of this bug. *** Testing on dev-preview-stg, pvc can be bounded to pv quickly. Version-Release number for dev-preview-stg: openshift v3.3.1.4 kubernetes v1.3.0+52492b4 [root@ymsun default-stg]# oc get pvc NAME STATUS VOLUME CAPACITY ACCESSMODES AGE mysql Bound pv-aws-ry30z 1Gi RWO 8s Testing on Dev_preview_Prod, pvc can be bounded to pv quickly. Version-Release number for Dev_preview_Prod: openshift v3.3.1.3 kubernetes v1.3.0+52492b4 [root@ymsun online-test]# oc get pvc NAME STATUS VOLUME CAPACITY ACCESSMODES AGE mysql Bound pv-1-preview-master-afbb8-vol-04750624d98803d7c 1Gi RWO 9s |