Bug 1405214

Summary:	[paid][online-int][free][prod]Pods fail to mount a PV when a detach of same PV is followed by an attach.
Product:	OpenShift Online	Reporter:	bernard
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Status:	CLOSED CURRENTRELEASE	QA Contact:	yasun
Severity:	high	Docs Contact:
Priority:	high
Version:	3.x	CC:	abhgupta, aos-bugs, bernard, bingli, dakini, hchen, jgoulding, lxia, sampah_budi, xtian, yasun, yufchang, zhezli
Target Milestone:	---	Keywords:	OpsBlocker
Target Release:	---	Flags:	yasun: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-09 18:54:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description bernard 2016-12-15 21:26:01 UTC

I am on the Dev Preview of Openshift v3 (user account: Bernard)

My pods don't seem to be able to mount a PV - the errors generated are like this:

11:36:30 AM 
	dbsurvey-4-yay2b 	Pod	Warning 	Failed mount  	Unable to mount volumes for pod "dbsurvey-4-yay2b_dbsurvey(b0c580fa-bfbf-11e6-9d4e-0e3d364e19a5)": timeout expired waiting for volumes to attach/mount for pod "dbsurvey-4-yay2b"/"dbsurvey". list of unattached/unmounted volumes=[upload-data] 
11:36:30 AM	dbsurvey-4-yay2b 	Pod	Warning 	Failed sync  	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "dbsurvey-4-yay2b"/"dbsurvey". list of unattached/unmounted volumes=[upload-data] 

When I create the app from scratch (using the PHP 5.6 image), it works perfectly. Then, I scale down the pods to 0 and change the config to mount the PV. This works (most of the time).

However, if I later on rebuild the image (eg because of changed code in GIT), it fails to remount that PV. I do each time scale down to 0 just to make sure but when I then scale up again, I keep getting that error and can no longer get a new pod to start. It keeps retrying to mount but fails each time.
My only recourse at that time is to restart from scratch - which seems the wrong approach.

Comment 1 bernard 2016-12-15 21:27:26 UTC

ps. I first posted on the Google group and received the advice there to create a bug report as this apparently is a known issue and requires the operations team to intervene.

Comment 2 Abhishek Gupta 2017-01-03 20:14:44 UTC

This is most likely due to the fact that you have "Rolling" as your deployment strategy and your pods require a volume. In Developer Preview, the PVCs are backed by EBS volumes and these cannot be mounted on two different nodes. In case of a rolling deployment, the new pod comes up first as a canary and only when that is successful does the deployment proceed and the old pod taken down. The new pod (from the new deployment) will try to mount the volume and fail since the old pod (from the current/existing deployment) still has the volume mounted. These pods can (and most likely will) be scheduled on different nodes and hence this is not going to work. The solution is to use "Recreate" as the deployment strategy in case the pods rely on PVCs.

Can you confirm this to be the issue and that the suggested change resolves this?

Comment 3 budi 2017-01-06 10:29:32 UTC

I have similar issue that the original poster experience

I have changed the YAML of the deployment (change Rolling to Recreate) and recreate the PVC and it works for 2-3 times
Now, the PVC can't even be used anymore

Here is the log
--> Scaling beta-23 down to zero
--> Scaling beta-24 to 1 before performing acceptance check
--> Waiting up to 10m0s for pods in deployment beta-24 to become ready
error: update acceptor rejected beta-24: pods for deployment "beta-24" took longer than 600 seconds to become ready

Let me know if you need anything else

thanks

Comment 4 budi 2017-01-06 10:37:51 UTC

My openshift email account is the same as my bugzilla's email address - and I haven't deleted my pod so you can look at it

thanks

Comment 6 Hemant Kumar 2017-01-27 20:01:07 UTC

Yeah this seems similar to problem in https://bugzilla.redhat.com/show_bug.cgi?id=1404811.

But storage team has made lot of improvements in attach/detach code path for AWS. If you can upgrade to 3.4 and try and if it doesn't work let us know.

Comment 7 Abhishek Gupta 2017-01-30 21:01:29 UTC

The Online environment has recently been upgraded. Can you please confirm if this is still an issue?

Comment 8 budi 2017-02-10 03:09:09 UTC

It works better (or probably flawlessly) until a few days ago when I can't even mount a volume to a pod

This is the error that I got
W0210 03:01:37.993356 1 reflector.go:330] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:468: watch of *api.Pod ended with: too old resource version: 850065961 (850090417)

Please look into this

thanks

Comment 9 Hemant Kumar 2017-02-10 03:11:34 UTC

Can you post more details:

1. What does your pod and pv yamls look like? If you can't post it publicly you can email it to me.
2. What kind of volume type you were using?
3. How did you deploy openshift? 
4. Exact version of openshift you were using.
5. Also more logs around that error will be helpful.

Comment 10 budi 2017-02-10 04:20:51 UTC

1. here is my pod YAML

apiVersion: v1
kind: Pod
metadata:
  name: beta-2-deploy
  namespace: divvy
  selfLink: /api/v1/namespaces/divvy/pods/beta-2-deploy
  uid: 0343a47a-ef46-11e6-b599-0e63b9c1c48f
  resourceVersion: '850334773'
  creationTimestamp: '2017-02-10T04:04:26Z'
  labels:
    openshift.io/deployer-pod-for.name: beta-2
  annotations:
    kubernetes.io/limit-ranger: >-
      LimitRanger plugin set: cpu, memory request for container deployment; cpu,
      memory limit for container deployment
    openshift.io/deployment.name: beta-2
    openshift.io/scc: restricted
spec:
  volumes:
    - name: deployer-token-o8eyc
      secret:
        secretName: deployer-token-o8eyc
        defaultMode: 420
  containers:
    - name: deployment
      image: 'registry.ops.openshift.com/openshift3/ose-deployer:v3.4.1.2'
      env:
        - name: KUBERNETES_MASTER
          value: 'https://ip-172-31-10-24.ec2.internal'
        - name: OPENSHIFT_MASTER
          value: 'https://ip-172-31-10-24.ec2.internal'
        - name: BEARER_TOKEN_FILE
          value: /var/run/secrets/kubernetes.io/serviceaccount/token
        - name: OPENSHIFT_CA_DATA
          value: |
            -----BEGIN CERTIFICATE-----
            MIIC5jCCAdCgAwIBAgIBATALBgkqhkiG9w0BAQswJjEkMCIGA1UEAwwbb3BlbnNo
            aWZ0LXNpZ25lckAxNDYzMTU2NTg2MB4XDTE2MDUxMzE2MjMwNloXDTIxMDUxMjE2
            MjMwN1owJjEkMCIGA1UEAwwbb3BlbnNoaWZ0LXNpZ25lckAxNDYzMTU2NTg2MIIB
            IjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEArp4BlumhbaZiJxnPJPd78jqp
            scHOa71PnC8Pd/Uzg/cr6kCz8cqFadVpHyAYxR2MVPzwGEjJ2ScP2f5iVby8w10n
            408WfAv3HelPCcw5z1yp4pb2WnFNy1eglGl2fQp7Z/Od8TgO2OOpeVvLfxSL/K9V
            OXYmt9HFnfhO/0c5Cv5T7OJc997h3++006yi/qt0lGTHgeF/eUCmnZ0tosjCRhAS
            7AJrYAXN8ERI3s91mrzDMC4q3FjOLlWVa9ZrXeUrbvJYCYgbdtgG2wup2ETy2nFJ
            6meeYRYF/7JaVXsOZWkJYfH2K6Lg1wGjFyOXNZkA2jLqOlRMUZWHNnA/DTpL3wID
            AQABoyMwITAOBgNVHQ8BAf8EBAMCAKQwDwYDVR0TAQH/BAUwAwEB/zALBgkqhkiG
            9w0BAQsDggEBADQPZ3eyz2OtWdsxzG//lq1DXguV7T5KUfgp76mkZuDjp5ermC42
            m1DjFtEP8HvFTZgz+LYsAIhv7MShe/bZOieHnz4A/vc3oFi6uVrcLffR+CVjdlSP
            UDKZzOkf7/jTxOzSQImNk3AQAuIeVCcMXF4v4zVRlyMaWcTtOuNGWdEmLZUhUrjT
            E5Gh+KQOW1jFDYKeZ1RGkAMCL8aD6p7jNvmxVGzQasIleKylDteGblcEdn8M3Xjp
            hHUVIWnru5CBTwCxCqSXkxMFUsZqSIy+hiMeJPFmkDIdSBb7n2BwgcG0cXu/Zuju
            2PKZGzVqvgHhcIlwFZ2g9g1S/SwlVEGUvZs=
            -----END CERTIFICATE-----
        - name: OPENSHIFT_DEPLOYMENT_NAME
          value: beta-2
        - name: OPENSHIFT_DEPLOYMENT_NAMESPACE
          value: divvy
      resources:
        limits:
          cpu: '1'
          memory: 512Mi
        requests:
          cpu: 60m
          memory: 307Mi
      volumeMounts:
        - name: deployer-token-o8eyc
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      imagePullPolicy: Always
      securityContext:
        capabilities:
          drop:
            - KILL
            - MKNOD
            - NET_RAW
            - SETGID
            - SETUID
            - SYS_CHROOT
        privileged: false
        seLinuxOptions:
          level: 's0:c227,c194'
        runAsUser: 1051690000
  restartPolicy: Never
  terminationGracePeriodSeconds: 10
  activeDeadlineSeconds: 3600
  dnsPolicy: ClusterFirst
  nodeSelector:
    type: compute
  serviceAccountName: deployer
  serviceAccount: deployer
  nodeName: ip-172-31-10-175.ec2.internal
  securityContext:
    seLinuxOptions:
      level: 's0:c227,c194'
    fsGroup: 1051690000
  imagePullSecrets:
    - name: deployer-dockercfg-vk2yt
status:
  phase: Failed
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2017-02-10T04:04:26Z'
    - type: Ready
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2017-02-10T04:14:56Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [deployment]'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2017-02-10T04:04:26Z'
  hostIP: 172.31.10.175
  podIP: 10.1.102.15
  startTime: '2017-02-10T04:04:26Z'
  containerStatuses:
    - name: deployment
      state:
        terminated:
          exitCode: 1
          reason: Error
          startedAt: '2017-02-10T04:04:54Z'
          finishedAt: '2017-02-10T04:14:55Z'
          containerID: >-
            docker://2875722a329fe71b7b2eefe416395e0274f8e7aa623d2ec5a17995bf4dc65c9a
      lastState: {}
      ready: false
      restartCount: 0
      image: 'registry.ops.openshift.com/openshift3/ose-deployer:v3.4.1.2'
      imageID: >-
        docker-pullable://registry.ops.openshift.com/openshift3/ose-deployer@sha256:37adf782e29f09c815ae0bd91299e99ae84e2849b25de100c6581df36c6a7920
      containerID: >-
        docker://2875722a329fe71b7b2eefe416395e0274f8e7aa623d2ec5a17995bf4dc65c9a

I don't know how to get PV YAML, is it the same as deployment YAML?

2. I've tried both RWO (Read-Write-Once) & RWX (Read-Write-Many)

3. I am using web interface (as opposed to oc command line)

4. Openshift 3

5. Here is the log from failed pod
--> Scaling beta-1 down to zero
--> Scaling beta-2 to 1 before performing acceptance check
--> Waiting up to 10m0s for pods in deployment beta-2 to become ready
W0210 04:14:06.692864       1 reflector.go:330] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:468: watch of *api.Pod ended with: too old resource version: 850302230 (850328358)
error: update acceptor rejected beta-2: pods for deployment "beta-2" took longer than 600 seconds to become ready

Comment 11 Hemant Kumar 2017-02-10 14:33:32 UTC

I don't see any persistent volumes mounted in your pod. What you are seeing is probably a different bug and unrelated to this bug which was specially opened for persistent volumes.

Comment 12 budi 2017-02-13 03:01:12 UTC

I'm including YAML from deployment which shows persisted volume, hope this helps

apiVersion: v1
kind: DeploymentConfig
metadata:
  name: beta
  namespace: divvy
  selfLink: /oapi/v1/namespaces/divvy/deploymentconfigs/beta
  uid: 7d91daff-ef45-11e6-b125-0e3d364e19a5
  resourceVersion: '850335639'
  generation: 5
  creationTimestamp: '2017-02-10T04:00:41Z'
  labels:
    app: beta
  annotations:
    openshift.io/generated-by: OpenShiftWebConsole
spec:
  strategy:
    type: Recreate
    recreateParams:
      timeoutSeconds: 600
    rollingParams:
      updatePeriodSeconds: 1
      intervalSeconds: 1
      timeoutSeconds: 600
      maxUnavailable: 25%
      maxSurge: 25%
    resources: {}
  triggers:
    - type: ImageChange
      imageChangeParams:
        automatic: true
        containerNames:
          - beta
        from:
          kind: ImageStreamTag
          namespace: divvy
          name: 'beta:latest'
        lastTriggeredImage: >-
          172.30.47.227:5000/divvy/beta@sha256:91ed279cee18e4f1ce31ae00a46d49192f7270c2c6253cf129cdfc7f56323e3e
    - type: ConfigChange
  replicas: 1
  test: false
  selector:
    deploymentconfig: beta
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: beta
        deploymentconfig: beta
    spec:
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: data
      containers:
        - name: beta
          image: >-
            172.30.47.227:5000/divvy/beta@sha256:91ed279cee18e4f1ce31ae00a46d49192f7270c2c6253cf129cdfc7f56323e3e
          ports:
            - containerPort: 8080
              protocol: TCP
          resources: {}
          volumeMounts:
            - name: data
              mountPath: /data
          terminationMessagePath: /dev/termination-log
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
status:
  latestVersion: 2
  observedGeneration: 5
  replicas: 1
  availableReplicas: 1
  details:
    message: config change
    causes:
      - type: ConfigChange
  conditions:
    - type: Progressing
      status: 'False'
      lastTransitionTime: '2017-02-10T04:14:57Z'
      reason: ProgressDeadlineExceeded
      message: Replication controller "beta-2" has failed progressing
    - type: Available
      status: 'True'
      lastTransitionTime: '2017-02-10T04:15:12Z'
      message: Deployment config has minimum availability.

Comment 13 Hemant Kumar 2017-02-13 16:23:39 UTC

Okay thank you for the Deployment YAML. But I don't see any errors on this Deployment object you posted. It shows - "availableReplicas:1" and "replicas:1" which means whatever Pods that were required for running the deployment are running correctly.

Are you certain that - this Deployment you posted above is not running properly and stuck because of unable to mount volumes? 


If that is indeed the case - can you also find pods that were created for that deployment (you can do `oc get pods`) and then post output of following commands:

~> `oc describe pod <pod_name_from_above>`
~> `oc logs <pod_name_from_above>`

Comment 16 Hemant Kumar 2017-02-14 14:57:33 UTC

@budi - If you are still affected with this bug, can you open a new bug with following items:

1. Steps to reproduce
2. Whatever logs you have. (such oc logs)
3. output of oc describe pod - for pod that is stuck. 


Also, you are saying openshift version as "3". Can you be more specific? There has been lot of fixes between 3.3 and 3.4.

Comment 18 budi 2017-02-15 06:13:24 UTC

I was able to reproduce the bug consistently before (not sure which version of Openshift), but I wasn't able to reproduce the error today (using Openshift 3.4.0.13)

Comment 24 yasun 2017-04-12 10:46:32 UTC

New bug: https://bugzilla.redhat.com/show_bug.cgi?id=1441602

Comment 28 Stefanie Forrester 2017-04-17 22:47:08 UTC

*** Bug 1441602 has been marked as a duplicate of this bug. ***

Comment 30 yasun 2017-04-20 06:30:19 UTC

Testing on free-int:
1. Create a persistent application with the template mysql-persistent, and create some data on the persistent volume
2. After the pod is ready, scale down the pod to 0
3. Wait for a while, scale up the pod to 1
4. repeat the step 2~3 100 times

The pv can be attached successfully.

Comment 32 yasun 2017-04-21 07:07:47 UTC

Wait for online-int env is ready for testing