Bug 1335293
| Summary: | EBS volumes remain attached to the wrong instance | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Stefanie Forrester <dakini> | ||||
| Component: | Storage | Assignee: | hchen | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Chao Yang <chaoyang> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.2.0 | CC: | aos-bugs, chaoyang, jokerman, mmccomas, tdawson, wmeng, xxia | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-08-17 13:20:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Encountered this error too, in dev-preview-stg. Steps to reproduce: 1. oc login and create project (xxia-proj) 2. Create a dc using https://raw.githubusercontent.com/openshift/origin/master/examples/gitserver/gitserver.yaml Due to https://bugzilla.redhat.com/show_bug.cgi?id=1336318#c1 , prepare pvc's first: $ cat pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: mypvc spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi status: {} $ cat pvc2.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: mypvc2 spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi status: {} $ oc create -f pvc.yaml $ oc create -f pvc2.yaml $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESSMODES AGE mypvc Bound pv-aws-vq862 1Gi RWO 20h mypvc2 Bound pv-aws-3y95l 1Gi RWO 20h $ wget https://raw.githubusercontent.com/openshift/origin/master/examples/gitserver/gitserver.yaml Change gitserver.yaml "volumeMounts" and "volumes" as follows: volumeMounts: - mountPath: /var/lib/git name: git - mountPath: /var/lib/origin name: origin ...... volumes: - name: git persistentVolumeClaim: claimName: mypvc - name: origin persistentVolumeClaim: claimName: mypvc2 Then create it: $ oc create -f gitserver.yaml deploymentconfig "git" created $ oc get pod -l deploymentconfig=git -o wide NAME READY STATUS RESTARTS AGE NODE git-1-6rs6o 1/1 Running 1 19h ip-172-31-9-165.ec2.internal $ oc edit dc git # Trigger re-deployment Check pods scheduled to which nodes: $ oc get pod -l deploymentconfig=git -o wide NAME READY STATUS RESTARTS AGE NODE git-1-6rs6o 1/1 Running 1 19h ip-172-31-9-165.ec2.internal git-2-xn715 0/1 ContainerCreating 0 7m ip-172-31-9-167.ec2.internal Check pod event: $ oc describe pod/git-2-xn715 Name: git-2-xn715 Namespace: xxia-proj Node: ip-172-31-9-167.ec2.internal/172.31.9.167 Start Time: Wed, 18 May 2016 10:26:10 +0800 Labels: deployment=git-2,deploymentconfig=git,run-container=git Status: Pending IP: Controllers: ReplicationController/git-2 Containers: git: Container ID: Image: openshift/origin-gitserver:latest Image ID: Port: 8080/TCP QoS Tier: memory: Burstable cpu: Burstable Limits: cpu: 500m memory: 256Mi Requests: cpu: 30m memory: 153Mi State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment Variables: POD_NAMESPACE: xxia-proj (v1:metadata.namespace) PUBLIC_URL: http://git.$(POD_NAMESPACE).svc.cluster.local:8080 INTERNAL_URL: http://git:8080 GIT_HOME: /var/lib/git HOOK_PATH: /var/lib/git-hooks GENERATE_ARTIFACTS: true DETECTION_SCRIPT: ALLOW_GIT_PUSH: true ALLOW_GIT_HOOKS: true ALLOW_LAZY_CREATE: true ALLOW_ANON_GIT_PULL: true REQUIRE_SERVER_AUTH: - AUTH_NAMESPACE: $(POD_NAMESPACE) REQUIRE_GIT_AUTH: AUTOLINK_KUBECONFIG: - AUTOLINK_NAMESPACE: $(POD_NAMESPACE) AUTOLINK_HOOK: Conditions: Type Status Ready False Volumes: git: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mypvc ReadOnly: false origin: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mypvc2 ReadOnly: false git-token-080j1: Type: Secret (a volume populated by a Secret) SecretName: git-token-080j1 Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 8m 8m 1 {default-scheduler } Normal Scheduled Successfully assigned git-2-xn715 to ip-172-31-9-167.ec2.internal 7m 13s 7 {kubelet ip-172-31-9-167.ec2.internal} Warning FailedMount Unable to mount volumes for pod "git-2-xn715_xxia-proj(e28c4344-1c9f-11e6-ae12-0ee251450653)": Could not attach EBS Disk "aws://us-east-1c/vol-9e9d393b": Error attaching EBS volume: VolumeInUse: vol-9e9d393b is already attached to an instance status code: 400, request id: 7m 13s 7 {kubelet ip-172-31-9-167.ec2.internal} Warning FailedSync Error syncing pod, skipping: Could not attach EBS Disk "aws://us-east-1c/vol-9e9d393b": Error attaching EBS volume: VolumeInUse: vol-9e9d393b is already attached to an instance status code: 400, request id: Finally re-deployment failed: $ oc get pod git-1-6rs6o 1/1 Running 1 20h git-2-deploy 0/1 Error 0 26m Relevant bug 1329040 Tried to investigate the issue with @xxia, we found the possible reason is: 1. `oc edit dc` triggered a re-deployment 2. The original pod was deleted, new pod was scheduled to another node whereas the volume was still attached to the previous instance. Should be addressed by https://github.com/kubernetes/kubernetes/pull/25502 and https://github.com/kubernetes/kubernetes/pull/25888 This has been merged and is in OSE v3.3.0.9 or newer. This is failed on
openshift v3.3.0.9
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git
Step is as below:
1. Create a PV
oc get pv -o yaml
apiVersion: v1
items:
- apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
creationTimestamp: 2016-07-25T05:45:25Z
labels:
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1d
type: local
name: ebs
resourceVersion: "3977"
selfLink: /api/v1/persistentvolumes/ebs
uid: fc06fb71-522a-11e6-bf9c-0ef1eb2be359
spec:
accessModes:
- ReadWriteOnce
awsElasticBlockStore:
fsType: ext4
volumeID: aws://us-east-1d/vol-2f40058b
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: ebs
namespace: chao
resourceVersion: "3975"
uid: b9a3d667-522b-11e6-bf9c-0ef1eb2be359
persistentVolumeReclaimPolicy: Retain
status:
phase: Bound
kind: List
metadata: {}
2. Create a pvc in the namespace chao
3. wget https://raw.githubusercontent.com/openshift/origin/master/examples/deployment/recreate-example.yaml
Add below in this file
volumeMounts:
- mountPath: /var/lib/test
name: test
volumes:
- name: test
persistentVolumeClaim:
claimName: ebs
4. oc create -f recreate-example.yaml
5. Pod is running
NAME READY STATUS RESTARTS AGE NODE
recreate-example-1-xufes 1/1 Running 0 9m ip-172-18-0-79.ec2.internal
6. oadm manage-node ip-172-18-0-79.ec2.internal --schedulable=false
7. edit dc from recreate-example:latest to recreate-example:v1 , to trigger re-deployment
from:
kind: ImageStreamTag
name: recreate-example:v1
[root@dhcp-128-8 ~]# oc status
In project chao on server https://ec2-52-90-208-19.compute-1.amazonaws.com:443
http://recreate-example-chao.0725-hu0.qe.rhcloud.com (svc/recreate-example)
dc/recreate-example deploys istag/recreate-example:v1
deployment #2 running for 2 minutes - 1 pod
deployment #1 deployed about an hour ago
8. Check pod status
[root@dhcp-128-8 ~]# oc describe pods recreate-example-2-a00r3
Name: recreate-example-2-a00r3
Namespace: chao
Node: ip-172-18-9-202.ec2.internal/172.18.9.202
Start Time: Mon, 25 Jul 2016 15:40:49 +0800
Labels: deployment=recreate-example-2,deploymentconfig=recreate-example
Status: Pending
IP:
Controllers: ReplicationController/recreate-example-2
Containers:
deployment-example:
Container ID:
Image: openshift/deployment-example@sha256:c505b916f7e5143a356ff961f2c21aee40fbd2cd906c1e3feeb8d5e978da284b
Image ID:
Port: 8080/TCP
QoS Tier:
cpu: BestEffort
memory: BestEffort
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment Variables:
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
test:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: ebs
ReadOnly: false
default-token-6p2xr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6p2xr
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
11m 11m 1 {default-scheduler } Normal Scheduled Successfully assigned recreate-example-2-a00r3 to ip-172-18-9-202.ec2.internal
9m 17s 5 {kubelet ip-172-18-9-202.ec2.internal} Warning FailedMount Unable to mount volumes for pod "recreate-example-2-a00r3_chao(1b242d80-523b-11e6-bf9c-0ef1eb2be359)": timeout expired waiting for volumes to attach/mount for pod "recreate-example-2-a00r3"/"chao". list of unattached/unmounted volumes=[test]
9m 17s 5 {kubelet ip-172-18-9-202.ec2.internal} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "recreate-example-2-a00r3"/"chao". list of unattached/unmounted volumes=[test]
"oc describe pods" output doesn't indicate the EBS volume is attached to the wrong node. Can you get the openshift node log or kubelet log from ip-172-18-9-202.ec2.internal? First, the ebs volume is mounted to the node ip-172-18-0-79.ec2.internal, and after re-deploy the dc, the ebs volume should mount to the node ip-172-18-9-202.ec2.internal. But the ebs volume still attached to the node ip-172-18-0-79.ec2.internal Please see the detailed info in the end of this https://github.com/kubernetes/kubernetes/issues/28671 , per latest comment[1], is the issue resolved? 1. https://github.com/kubernetes/kubernetes/issues/28671#issuecomment-240039479 I could not reproduce the wrong info on OCP right now. Thanks. I close it for now. if it is still a problem, please reopen it. |
Created attachment 1156272 [details] event log showing error Description of problem: Sometimes a pod is unable to start because its PV is already attached to another instance. I see several occurrences of this happening in dev-preview-int. One of them has been in this state for 2 days so far. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: