1399835 – "No retries permitted until" error while mounting a volume

Bug 1399835 - "No retries permitted until" error while mounting a volume

Summary: "No retries permitted until" error while mounting a volume

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OKD
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.x
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Bradley Childs
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-29 21:04 UTC by Hemant Kumar
Modified:	2016-12-09 19:46 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-09 19:46:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hemant Kumar 2016-11-29 21:04:40 UTC

Description of problem:


Version-Release number of selected component (if applicable):

[root@ip-172-18-6-39 configs]# oc version
oc v3.3.1.5
kubernetes v1.3.0+52492b4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-6-39.ec2.internal:8443
openshift v3.3.1.5
kubernetes v1.3.0+52492b4



How reproducible:

Following steps are not reliable way to reproduce this problem. It just happens sometimes and sometimes it doesn't. 

Steps to Reproduce:
1. I used alpha storage annotations to create a dynamic pvc.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: dyn-claim
  annotations:
    volume.alpha.kubernetes.io/storage-class: "bar" 
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi


2. Create a pod that uses this volume:

apiVersion: v1
kind: Pod
metadata:
  name: testpod
  labels:
    name: test
spec:
  restartPolicy: Never
  containers:
    - resources:
        limits :
          cpu: 0.5
      image: gcr.io/google_containers/busybox
      command:
        - "/bin/sh"
        - "-c"
        - "while true; do date; date >>/mnt/test/date; sleep 1; done"
      name: busybox
      volumeMounts:
        - name: vol
          mountPath: /mnt/test
  volumes:
      - name: vol
        persistentVolumeClaim:
          claimName: dyn-claim




3. Pod is stuck in "ContainerCreating".  

Actual results:

The pod doesn't get created and is stuck. The error I saw in logs look like:

Nov 29 15:53:18 ip-212.23.323.123.ec2.internal atomic-openshift-node[33783]: E1129 15:53:18.643084   33783 nestedpendingoperations.go:233] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-21b50db0\"" failed. No retries permitted until 2016-11-29 15:55:18.643063014 -0500 EST 
(durationBeforeRetry 2m0s). Error: Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-21b50db0" (spec.Name: "pvc-743d2611-b675-11e6-a61c-0e852813636a") pod "8ef25258-b675-11e6-a61c-0e852813636a" (UID: "8ef25258-b675-11e6-a61c-0e852813636a") is not yet attached according to node statu
s.





Expected results:

The pod should have been created.


Additional info:

The pod scheduling on AWS seems to be bit fragile in openshift-3.3

Comment 1 Bradley Childs 2016-12-01 17:44:50 UTC

You are running out of AWS API quota. This is 'not a bug' as the resolution is increase API quota for the account.  Is it a shared account?  Maybe another user is consuming all of the quota.  If its not shared, then some other component may be hammering the system using up quota.  An invalid password, key, etc can cause that.

Once the API quota is used up, storage attach/detach will fail and this is expected.  After a set amount of time API calls can be made again (the timestamp in the description).

Comment 2 Eric Paris 2016-12-02 19:52:56 UTC

I'm going to reopen to make sure we look in this area possbily during 3.5.

https://trello.com/c/JHgdxVw0/356-rfe-better-handle-of-aws-api-quota

Comment 3 Hemant Kumar 2016-12-07 13:48:33 UTC

Hmm, now that I understand this bit more, the error above is not really a API quota problem either. It is just an error that gets printed when an operation is being tried too quickly and exponential backoff detects that, this operation shouldn't be firing right now. 

So this is just the exponential backoff mechanism working as intended and not a bug.

Comment 4 Hemant Kumar 2016-12-09 19:46:11 UTC

I am going to close this bug. The error messages we see in logs are not error message but are shown because exponential backoff is forcing an operation to not run too quickly.

Note You need to log in before you can comment on or make changes to this bug.