Bug 1335293 - EBS volumes remain attached to the wrong instance
Summary: EBS volumes remain attached to the wrong instance
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: hchen
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-11 19:32 UTC by Stefanie Forrester
Modified: 2016-08-17 13:20 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-17 13:20:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
event log showing error (6.40 KB, text/plain)
2016-05-11 19:32 UTC, Stefanie Forrester
no flags Details

Description Stefanie Forrester 2016-05-11 19:32:42 UTC
Created attachment 1156272 [details]
event log showing error

Description of problem:

Sometimes a pod is unable to start because its PV is already attached to another instance. I see several occurrences of this happening in dev-preview-int. One of them has been in this state for 2 days so far.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Xingxing Xia 2016-05-18 03:25:48 UTC
Encountered this error too, in dev-preview-stg.
Steps to reproduce:
1. oc login and create project (xxia-proj)

2. Create a dc using https://raw.githubusercontent.com/openshift/origin/master/examples/gitserver/gitserver.yaml
Due to https://bugzilla.redhat.com/show_bug.cgi?id=1336318#c1 , prepare pvc's first:

$ cat pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  creationTimestamp: null
  name: mypvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
status: {}

$ cat pvc2.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  creationTimestamp: null
  name: mypvc2
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
status: {}

$ oc create -f pvc.yaml
$ oc create -f pvc2.yaml
$ oc get pvc
NAME      STATUS    VOLUME         CAPACITY   ACCESSMODES   AGE
mypvc     Bound     pv-aws-vq862   1Gi        RWO           20h
mypvc2    Bound     pv-aws-3y95l   1Gi        RWO           20h

$ wget https://raw.githubusercontent.com/openshift/origin/master/examples/gitserver/gitserver.yaml

Change gitserver.yaml "volumeMounts" and "volumes" as follows:
        volumeMounts:
        - mountPath: /var/lib/git
          name: git
        - mountPath: /var/lib/origin
          name: origin
      ......
      volumes:
      - name: git
        persistentVolumeClaim:
          claimName: mypvc
      - name: origin
        persistentVolumeClaim:
          claimName: mypvc2

Then create it:
$ oc create -f gitserver.yaml
deploymentconfig "git" created

$ oc get pod -l deploymentconfig=git -o wide
NAME          READY     STATUS              RESTARTS   AGE       NODE
git-1-6rs6o   1/1       Running             1          19h       ip-172-31-9-165.ec2.internal

$ oc edit dc git # Trigger re-deployment

Check pods scheduled to which nodes:
$ oc get pod -l deploymentconfig=git -o wide
NAME          READY     STATUS              RESTARTS   AGE       NODE
git-1-6rs6o   1/1       Running             1          19h       ip-172-31-9-165.ec2.internal
git-2-xn715   0/1       ContainerCreating   0          7m        ip-172-31-9-167.ec2.internal

Check pod event:
$ oc describe pod/git-2-xn715
Name:		git-2-xn715
Namespace:	xxia-proj
Node:		ip-172-31-9-167.ec2.internal/172.31.9.167
Start Time:	Wed, 18 May 2016 10:26:10 +0800
Labels:		deployment=git-2,deploymentconfig=git,run-container=git
Status:		Pending
IP:		
Controllers:	ReplicationController/git-2
Containers:
  git:
    Container ID:	
    Image:		openshift/origin-gitserver:latest
    Image ID:		
    Port:		8080/TCP
    QoS Tier:
      memory:	Burstable
      cpu:	Burstable
    Limits:
      cpu:	500m
      memory:	256Mi
    Requests:
      cpu:		30m
      memory:		153Mi
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Environment Variables:
      POD_NAMESPACE:		xxia-proj (v1:metadata.namespace)
      PUBLIC_URL:		http://git.$(POD_NAMESPACE).svc.cluster.local:8080
      INTERNAL_URL:		http://git:8080
      GIT_HOME:			/var/lib/git
      HOOK_PATH:		/var/lib/git-hooks
      GENERATE_ARTIFACTS:	true
      DETECTION_SCRIPT:		
      ALLOW_GIT_PUSH:		true
      ALLOW_GIT_HOOKS:		true
      ALLOW_LAZY_CREATE:	true
      ALLOW_ANON_GIT_PULL:	true
      REQUIRE_SERVER_AUTH:	-
      AUTH_NAMESPACE:		$(POD_NAMESPACE)
      REQUIRE_GIT_AUTH:		
      AUTOLINK_KUBECONFIG:	-
      AUTOLINK_NAMESPACE:	$(POD_NAMESPACE)
      AUTOLINK_HOOK:		
Conditions:
  Type		Status
  Ready 	False 
Volumes:
  git:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	mypvc
    ReadOnly:	false
  origin:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	mypvc2
    ReadOnly:	false
  git-token-080j1:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	git-token-080j1
Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath	Type		Reason		Message
  ---------	--------	-----	----					-------------	--------	------		-------
  8m		8m		1	{default-scheduler }					Normal		Scheduled	Successfully assigned git-2-xn715 to ip-172-31-9-167.ec2.internal
  7m		13s		7	{kubelet ip-172-31-9-167.ec2.internal}			Warning		FailedMount	Unable to mount volumes for pod "git-2-xn715_xxia-proj(e28c4344-1c9f-11e6-ae12-0ee251450653)": Could not attach EBS Disk "aws://us-east-1c/vol-9e9d393b": Error attaching EBS volume: VolumeInUse: vol-9e9d393b is already attached to an instance
		status code: 400, request id: 
  7m		13s	7	{kubelet ip-172-31-9-167.ec2.internal}		Warning	FailedSync	Error syncing pod, skipping: Could not attach EBS Disk "aws://us-east-1c/vol-9e9d393b": Error attaching EBS volume: VolumeInUse: vol-9e9d393b is already attached to an instance
		status code: 400, request id:

Finally re-deployment failed:
$ oc get pod
git-1-6rs6o                 1/1       Running     1          20h
git-2-deploy                0/1       Error       0          26m

Comment 2 Xingxing Xia 2016-05-19 01:23:02 UTC
Relevant bug 1329040

Comment 3 Jianwei Hou 2016-05-23 10:54:48 UTC
Tried to investigate the issue with @xxia, we found the possible reason is:
1. `oc edit dc` triggered a re-deployment
2. The original pod was deleted, new pod was scheduled to another node whereas the volume  was still attached to the previous instance.

Comment 5 Troy Dawson 2016-07-22 19:56:37 UTC
This has been merged and is in OSE v3.3.0.9 or newer.

Comment 6 Chao Yang 2016-07-25 07:55:12 UTC
This is failed on 
openshift v3.3.0.9
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

Step is as below:
1. Create a PV 
oc get pv -o yaml

apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/bound-by-controller: "yes"
    creationTimestamp: 2016-07-25T05:45:25Z
    labels:
      failure-domain.beta.kubernetes.io/region: us-east-1
      failure-domain.beta.kubernetes.io/zone: us-east-1d
      type: local
    name: ebs
    resourceVersion: "3977"
    selfLink: /api/v1/persistentvolumes/ebs
    uid: fc06fb71-522a-11e6-bf9c-0ef1eb2be359
  spec:
    accessModes:
    - ReadWriteOnce
    awsElasticBlockStore:
      fsType: ext4
      volumeID: aws://us-east-1d/vol-2f40058b
    capacity:
      storage: 1Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: ebs
      namespace: chao
      resourceVersion: "3975"
      uid: b9a3d667-522b-11e6-bf9c-0ef1eb2be359
    persistentVolumeReclaimPolicy: Retain
  status:
    phase: Bound
kind: List
metadata: {}
2. Create a pvc in the namespace chao
3. wget https://raw.githubusercontent.com/openshift/origin/master/examples/deployment/recreate-example.yaml
Add below in this file 
          volumeMounts:
          - mountPath: /var/lib/test
            name: test
        volumes:
        - name: test
          persistentVolumeClaim:
            claimName: ebs
4. oc create -f recreate-example.yaml
5. Pod is running
NAME                       READY     STATUS    RESTARTS   AGE       NODE
recreate-example-1-xufes   1/1       Running   0          9m        ip-172-18-0-79.ec2.internal
6. oadm manage-node ip-172-18-0-79.ec2.internal --schedulable=false
7. edit dc from recreate-example:latest to recreate-example:v1 , to trigger re-deployment 
    from:
        kind: ImageStreamTag
        name: recreate-example:v1
[root@dhcp-128-8 ~]# oc status
In project chao on server https://ec2-52-90-208-19.compute-1.amazonaws.com:443

http://recreate-example-chao.0725-hu0.qe.rhcloud.com (svc/recreate-example)
  dc/recreate-example deploys istag/recreate-example:v1 
    deployment #2 running for 2 minutes - 1 pod
    deployment #1 deployed about an hour ago

8. Check pod status
[root@dhcp-128-8 ~]# oc describe pods recreate-example-2-a00r3
Name:		recreate-example-2-a00r3
Namespace:	chao
Node:		ip-172-18-9-202.ec2.internal/172.18.9.202
Start Time:	Mon, 25 Jul 2016 15:40:49 +0800
Labels:		deployment=recreate-example-2,deploymentconfig=recreate-example
Status:		Pending
IP:		
Controllers:	ReplicationController/recreate-example-2
Containers:
  deployment-example:
    Container ID:	
    Image:		openshift/deployment-example@sha256:c505b916f7e5143a356ff961f2c21aee40fbd2cd906c1e3feeb8d5e978da284b
    Image ID:		
    Port:		8080/TCP
    QoS Tier:
      cpu:		BestEffort
      memory:		BestEffort
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Environment Variables:
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  test:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	ebs
    ReadOnly:	false
  default-token-6p2xr:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-6p2xr
Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath	Type		Reason		Message
  ---------	--------	-----	----					-------------	--------	------		-------
  11m		11m		1	{default-scheduler }					Normal		Scheduled	Successfully assigned recreate-example-2-a00r3 to ip-172-18-9-202.ec2.internal
  9m		17s		5	{kubelet ip-172-18-9-202.ec2.internal}			Warning		FailedMount	Unable to mount volumes for pod "recreate-example-2-a00r3_chao(1b242d80-523b-11e6-bf9c-0ef1eb2be359)": timeout expired waiting for volumes to attach/mount for pod "recreate-example-2-a00r3"/"chao". list of unattached/unmounted volumes=[test]
  9m		17s		5	{kubelet ip-172-18-9-202.ec2.internal}			Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "recreate-example-2-a00r3"/"chao". list of unattached/unmounted volumes=[test]

Comment 7 hchen 2016-07-25 14:30:03 UTC
"oc describe pods" output doesn't indicate the EBS volume is attached to the wrong node. 

Can you get the openshift node log or kubelet log from ip-172-18-9-202.ec2.internal?

Comment 8 Chao Yang 2016-07-26 01:46:22 UTC
First, the ebs volume is mounted to the node ip-172-18-0-79.ec2.internal, and after re-deploy the dc, the ebs volume should mount to the node ip-172-18-9-202.ec2.internal. 
But the ebs volume still attached to the node ip-172-18-0-79.ec2.internal

Please see the detailed info in the end of this https://github.com/kubernetes/kubernetes/issues/28671 ,

Comment 9 hchen 2016-08-16 13:52:22 UTC
per latest comment[1], is the issue resolved?

1. https://github.com/kubernetes/kubernetes/issues/28671#issuecomment-240039479

Comment 10 Chao Yang 2016-08-17 08:08:29 UTC
I could not reproduce the wrong info on OCP right now.

Comment 11 hchen 2016-08-17 13:20:45 UTC
Thanks. I close it for now. if it is still a problem, please reopen it.


Note You need to log in before you can comment on or make changes to this bug.