Hide Forgot
Description of problem: Version-Release number of selected component (if applicable): [root@ip-172-18-6-39 configs]# oc version oc v3.3.1.5 kubernetes v1.3.0+52492b4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-6-39.ec2.internal:8443 openshift v3.3.1.5 kubernetes v1.3.0+52492b4 How reproducible: Following steps are not reliable way to reproduce this problem. It just happens sometimes and sometimes it doesn't. Steps to Reproduce: 1. I used alpha storage annotations to create a dynamic pvc. kind: PersistentVolumeClaim apiVersion: v1 metadata: name: dyn-claim annotations: volume.alpha.kubernetes.io/storage-class: "bar" spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi 2. Create a pod that uses this volume: apiVersion: v1 kind: Pod metadata: name: testpod labels: name: test spec: restartPolicy: Never containers: - resources: limits : cpu: 0.5 image: gcr.io/google_containers/busybox command: - "/bin/sh" - "-c" - "while true; do date; date >>/mnt/test/date; sleep 1; done" name: busybox volumeMounts: - name: vol mountPath: /mnt/test volumes: - name: vol persistentVolumeClaim: claimName: dyn-claim 3. Pod is stuck in "ContainerCreating". Actual results: The pod doesn't get created and is stuck. The error I saw in logs look like: Nov 29 15:53:18 ip-212.23.323.123.ec2.internal atomic-openshift-node[33783]: E1129 15:53:18.643084 33783 nestedpendingoperations.go:233] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-21b50db0\"" failed. No retries permitted until 2016-11-29 15:55:18.643063014 -0500 EST (durationBeforeRetry 2m0s). Error: Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-21b50db0" (spec.Name: "pvc-743d2611-b675-11e6-a61c-0e852813636a") pod "8ef25258-b675-11e6-a61c-0e852813636a" (UID: "8ef25258-b675-11e6-a61c-0e852813636a") is not yet attached according to node statu s. Expected results: The pod should have been created. Additional info: The pod scheduling on AWS seems to be bit fragile in openshift-3.3
You are running out of AWS API quota. This is 'not a bug' as the resolution is increase API quota for the account. Is it a shared account? Maybe another user is consuming all of the quota. If its not shared, then some other component may be hammering the system using up quota. An invalid password, key, etc can cause that. Once the API quota is used up, storage attach/detach will fail and this is expected. After a set amount of time API calls can be made again (the timestamp in the description).
I'm going to reopen to make sure we look in this area possbily during 3.5. https://trello.com/c/JHgdxVw0/356-rfe-better-handle-of-aws-api-quota
Hmm, now that I understand this bit more, the error above is not really a API quota problem either. It is just an error that gets printed when an operation is being tried too quickly and exponential backoff detects that, this operation shouldn't be firing right now. So this is just the exponential backoff mechanism working as intended and not a bug.
I am going to close this bug. The error messages we see in logs are not error message but are shown because exponential backoff is forcing an operation to not run too quickly.