Description of problem: Volumes could not detach and umount from node if node service is stopped Version-Release number of selected component (if applicable): openshift v3.4.0.28+dfe3a66 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 How reproducible: Always Steps to Reproduce: 1.Create a dynamic pvc { "kind": "PersistentVolumeClaim", "apiVersion": "v1", "metadata": { "name": "test", "annotations": { "volume.alpha.kubernetes.io/storage-class": "foo" }, "labels": { "name": "dynamic-pvc" } }, "spec": { "accessModes": [ "ReadWriteOnce" ], "resources": { "requests": { "storage": "3Gi" } } } } 2.oc volume dc/registry-console --add --type=persistentVolumeClaim --mount-path=/opt1 --name=v1 --claim-name=test --overwrite 3.Check pod status registry-console-1-69vb2 1/1 Terminating 0 21m 10.129.0.15 ip-172-18-0-26.ec2.internal registry-console-2-zj2mu 1/1 Running 0 2m 10.128.0.18 ip-172-18-0-27.ec2.internal New pod is running on the server ip-172-18-0-27.ec2.internal 4. [root@ip-172-18-0-27 ~]# mount | grep ebs /dev/xvdba on /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1d/vol-5d9e38cc type ext4 (rw,relatime,seclabel,data=ordered) /dev/xvdba on /var/lib/origin/openshift.local.volumes/pods/2eb0df6b-b14d-11e6-85dc-0e44f2d5a65e/volumes/kubernetes.io~aws-ebs/pvc-5f4bf8ee-b14d-11e6-85dc-0e44f2d5a65e type ext4 (rw,relatime,seclabel,data=ordered) 5. stop node service on the node ip-172-18-0-27.ec2.internal systemctl stop atomic-openshift-node 6. Pod will re-scheduled to other node but could not be running registry-console-2-86vfu 0/1 ContainerCreating 0 21m <none> ip-172-18-0-26.ec2.internal [root@ip-172-18-1-18 ~]# oc describe pods registry-console-2-86vfu Name: registry-console-2-86vfu Namespace: default Security Policy: restricted Node: ip-172-18-0-26.ec2.internal/172.18.0.26 Start Time: Wed, 23 Nov 2016 02:28:30 -0500 Labels: app=registry-console deployment=registry-console-2 deploymentconfig=registry-console name=registry-console Status: Pending IP: Controllers: ReplicationController/registry-console-2 Containers: registry-console: Container ID: Image: registry.ops.openshift.com/openshift3/registry-console:3.3 Image ID: Port: 9090/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Liveness: http-get http://:9090/ping delay=10s timeout=5s period=10s #success=1 #failure=3 Readiness: http-get http://:9090/ping delay=0s timeout=5s period=10s #success=1 #failure=3 Volume Mounts: /opt1 from v1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-bi146 (ro) Environment Variables: OPENSHIFT_OAUTH_PROVIDER_URL: https://ec2-54-211-199-58.compute-1.amazonaws.com:8443 OPENSHIFT_OAUTH_CLIENT_ID: cockpit-oauth-client KUBERNETES_INSECURE: false COCKPIT_KUBE_INSECURE: false REGISTRY_ONLY: true REGISTRY_HOST: docker-registry-default.1123-7iz.qe.rhcloud.com Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: v1: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: test ReadOnly: false default-token-bi146: Type: Secret (a volume populated by a Secret) SecretName: default-token-bi146 QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 6m 6m 1 {default-scheduler } Normal Scheduled Successfully assigned registry-console-2-86vfu to ip-172-18-0-26.ec2.internal 4m 4m 1 {kubelet ip-172-18-0-26.ec2.internal} Warning FailedMount Unable to mount volumes for pod "registry-console-2-86vfu_default(6e48e897-b14e-11e6-85dc-0e44f2d5a65e)": timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1] 4m 4m 1 {kubelet ip-172-18-0-26.ec2.internal} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1] 2m 10s 2 {kubelet ip-172-18-0-26.ec2.internal} Warning FailedMount Unable to mount volumes for pod "registry-console-2-86vfu_default(6e48e897-b14e-11e6-85dc-0e44f2d5a65e)": timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1] 2m 10s 2 {kubelet ip-172-18-0-26.ec2.internal} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1] 7. Found volume still mount on the origin node Actual results: Volume could not detached from origin node and pod could not be running Expected results: New pod should be running Additional info:
Per hchen on IRC: This is fixed here: https://github.com/kubernetes/kubernetes/pull/36840 Which is covered in this PR for openshift: https://github.com/openshift/origin/pull/12024
I see this merged in OSE-3.3 Where is the PR for origin-1.4
Patch is merged in 3.3: https://github.com/openshift/origin/pull/12024 and 3.4: https://github.com/openshift/origin/pull/12124 Please re-test.
This has been merged into ocp and is in OCP v3.4.0.33 or newer.
This is failed on openshift v3.4.0.33+71c05b2 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 Some log in master is as below: Dec 6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.498886 13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching. Dec 6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.498902 13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal" Dec 6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.498918 13926 reconciler.go:168] Started AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal" Dec 6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.641721 13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching. Dec 6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.641738 13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal" Dec 6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.761132 13926 aws.go:1241] Assigned mount device ba -> volume vol-3455cca5 Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.822992 13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching. Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.823005 13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal" Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.037276 13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching. Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.037292 13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal" Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.161153 13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching. Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.161170 13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal" Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.254002 13926 aws.go:1260] Releasing in-process attachment entry: ba -> volume vol-3455cca5 Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: E1206 22:38:31.254017 13926 attacher.go:72] Error attaching volume "aws://us-east-1d/vol-3455cca5": Error attaching EBS volume: VolumeInUse: vol-3455cca5 is already attached to an instance Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: E1206 22:38:31.254098 13926 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5\"" failed. No retries permitted until 2016-12-06 22:38:31.754083838 -0500 EST (durationBeforeRetry 500ms). Error: recovered from panic "runtime error: invalid memory address or nil pointer dereference". (err=<nil>) Call stack: Dec 6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.299487 13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching.
https://github.com/openshift/origin/pull/12205
This is failed on the atomic-openshift-3.4.0.35-1.git.0.86b11df.el7.x86_64 Please check log on https://bugzilla.redhat.com/show_bug.cgi?id=1397693#c21 and https://bugzilla.redhat.com/show_bug.cgi?id=1397693#c22
@hchen They are stopped and terminated
This has been merged into ocp and is in OCP v3.4.1.7 or newer.
This is failed on openshift v3.4.1.7 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 5m 5m 1 {default-scheduler } Normal Scheduled Successfully assigned registry-console-2-y5kgt to ip-172-18-15-88.ec2.internal 5m 1m 10 {controller-manager } Warning FailedMount Failed to attach volume "pvc-5b55fcd2-f191-11e6-9747-0e758138dc84" on node "ip-172-18-15-88.ec2.internal" with: Error attaching EBS volume: VolumeInUse: vol-087a2a79f623c5358 is already attached to an instance status code: 400, request id: 3m 1m 2 {kubelet ip-172-18-15-88.ec2.internal} Warning FailedMount Unable to mount volumes for pod "registry-console-2-y5kgt_default(df6cf382-f192-11e6-9747-0e758138dc84)": timeout expired waiting for volumes to attach/mount for pod "registry-console-2-y5kgt"/"default". list of unattached/unmounted volumes=[v1] 3m 1m 2 {kubelet ip-172-18-15-88.ec2.internal} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "registry-console-2-y5kgt"/"default". list of unattached/unmounted volumes=[v1] I will re-test it on newer OCP
Can you elaborate on how it failed? As I said - if atomic-openshift-node is not running on the node, the volume being used won't be unmounted and hence it can't be detached. This is by design. I opened a trello card - https://trello.com/c/hNeu0drR/418-enable-force-detach-when-volume-is-attached-to-a-node-and-node-is-down to fix that problem because fixing this behavior is more than a quick bug fix. To recap - what I am saying is, none of the code we pushed fixes originally reported bug in the ticket. The bug is asking for a feature which does not exist. What we *did* fix is - we fixed the panic that happens when error is being logged via an event. So following panic: " Error: recovered from panic "runtime error: invalid memory address or nil pointer dereference". (err=<nil>) Call stack:" should be gone.
Sorry for the misunderstand. I re-test this bug again on OCP v3.4.1.7. When the pod using ebs volume in the "ContainerCreating" status, could not found error "runtime error: invalid memory address or nil pointer dereference" in the /var/log/message
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0289