1397693 – EBS volume could not detached from node if node service is stopped

Bug 1397693 - EBS volume could not detached from node if node service is stopped

Summary: EBS volume could not detached from node if node service is stopped

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.4.z
Assignee:	Hemant Kumar
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-23 08:15 UTC by Chao Yang
Modified:	2017-02-22 18:10 UTC (History)
CC List:	11 users (show)
Fixed In Version:	atomic-openshift-3.4.0.35-1.git.0.86b11df.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-22 18:10:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0289	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.4.1.7, 3.3.1.14, and 3.2.1.26 bug fix update	2017-02-22 23:10:04 UTC

Description Chao Yang 2016-11-23 08:15:47 UTC

Description of problem:
Volumes could not detach and umount from node if node service is stopped

Version-Release number of selected component (if applicable):
openshift v3.4.0.28+dfe3a66
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0


How reproducible:
Always

Steps to Reproduce:
1.Create a dynamic pvc 
{
  "kind": "PersistentVolumeClaim",
  "apiVersion": "v1",
  "metadata": {
    "name": "test",
    "annotations": {
        "volume.alpha.kubernetes.io/storage-class": "foo"
    },
    "labels": {
        "name": "dynamic-pvc"
    }
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "3Gi"
      }
    }
  }
}
2.oc volume dc/registry-console --add --type=persistentVolumeClaim --mount-path=/opt1 --name=v1 --claim-name=test --overwrite
3.Check pod status
registry-console-1-69vb2   1/1       Terminating   0          21m       10.129.0.15   ip-172-18-0-26.ec2.internal
registry-console-2-zj2mu   1/1       Running       0          2m        10.128.0.18   ip-172-18-0-27.ec2.internal

New pod is running on the server  ip-172-18-0-27.ec2.internal 
4. [root@ip-172-18-0-27 ~]# mount | grep ebs
/dev/xvdba on /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1d/vol-5d9e38cc type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvdba on /var/lib/origin/openshift.local.volumes/pods/2eb0df6b-b14d-11e6-85dc-0e44f2d5a65e/volumes/kubernetes.io~aws-ebs/pvc-5f4bf8ee-b14d-11e6-85dc-0e44f2d5a65e type ext4 (rw,relatime,seclabel,data=ordered)

5. stop node service on the node ip-172-18-0-27.ec2.internal
systemctl stop atomic-openshift-node

6. Pod will re-scheduled to other node but could not be running
registry-console-2-86vfu   0/1       ContainerCreating   0          21m       <none>        ip-172-18-0-26.ec2.internal

[root@ip-172-18-1-18 ~]# oc describe pods registry-console-2-86vfu
Name:			registry-console-2-86vfu
Namespace:		default
Security Policy:	restricted
Node:			ip-172-18-0-26.ec2.internal/172.18.0.26
Start Time:		Wed, 23 Nov 2016 02:28:30 -0500
Labels:			app=registry-console
			deployment=registry-console-2
			deploymentconfig=registry-console
			name=registry-console
Status:			Pending
IP:			
Controllers:		ReplicationController/registry-console-2
Containers:
  registry-console:
    Container ID:	
    Image:		registry.ops.openshift.com/openshift3/registry-console:3.3
    Image ID:		
    Port:		9090/TCP
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Liveness:		http-get http://:9090/ping delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:		http-get http://:9090/ping delay=0s timeout=5s period=10s #success=1 #failure=3
    Volume Mounts:
      /opt1 from v1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bi146 (ro)
    Environment Variables:
      OPENSHIFT_OAUTH_PROVIDER_URL:	https://ec2-54-211-199-58.compute-1.amazonaws.com:8443
      OPENSHIFT_OAUTH_CLIENT_ID:	cockpit-oauth-client
      KUBERNETES_INSECURE:		false
      COCKPIT_KUBE_INSECURE:		false
      REGISTRY_ONLY:			true
      REGISTRY_HOST:			docker-registry-default.1123-7iz.qe.rhcloud.com
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  v1:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	test
    ReadOnly:	false
  default-token-bi146:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-bi146
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath	Type		Reason		Message
  ---------	--------	-----	----					-------------	--------	------		-------
  6m		6m		1	{default-scheduler }					Normal		Scheduled	Successfully assigned registry-console-2-86vfu to ip-172-18-0-26.ec2.internal
  4m		4m		1	{kubelet ip-172-18-0-26.ec2.internal}			Warning		FailedMount	Unable to mount volumes for pod "registry-console-2-86vfu_default(6e48e897-b14e-11e6-85dc-0e44f2d5a65e)": timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1]
  4m		4m		1	{kubelet ip-172-18-0-26.ec2.internal}			Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1]
  2m		10s		2	{kubelet ip-172-18-0-26.ec2.internal}			Warning		FailedMount	Unable to mount volumes for pod "registry-console-2-86vfu_default(6e48e897-b14e-11e6-85dc-0e44f2d5a65e)": timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1]
  2m		10s		2	{kubelet ip-172-18-0-26.ec2.internal}			Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "registry-console-2-86vfu"/"default". list of unattached/unmounted volumes=[v1]

7. Found volume still mount on the origin node
Actual results:
Volume could not detached from origin node and pod could not be running

Expected results:
New pod should be running 

Additional info:

Comment 1 Bradley Childs 2016-11-28 16:47:47 UTC

Per hchen on IRC: 

This is fixed here: https://github.com/kubernetes/kubernetes/pull/36840

Which is covered in this PR for openshift:

https://github.com/openshift/origin/pull/12024

Comment 2 Eric Paris 2016-12-02 20:12:02 UTC

I see this merged in OSE-3.3

Where is the PR for origin-1.4

Comment 5 Jan Safranek 2016-12-05 08:25:13 UTC

Patch is merged in 3.3: https://github.com/openshift/origin/pull/12024
and 3.4: https://github.com/openshift/origin/pull/12124

Please re-test.

Comment 6 Troy Dawson 2016-12-06 20:30:17 UTC

This has been merged into ocp and is in OCP v3.4.0.33 or newer.

Comment 8 Chao Yang 2016-12-07 05:35:23 UTC

This is failed on
openshift v3.4.0.33+71c05b2
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Some log in master is as below:
Dec  6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.498886   13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching.
Dec  6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.498902   13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal"
Dec  6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.498918   13926 reconciler.go:168] Started AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal"
Dec  6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.641721   13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching.
Dec  6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.641738   13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal"
Dec  6 22:38:30 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.761132   13926 aws.go:1241] Assigned mount device ba -> volume vol-3455cca5
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.822992   13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching.
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:30.823005   13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal"
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.037276   13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching.
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.037292   13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal"
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.161153   13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching.
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.161170   13926 reconciler.go:165] Attempting to start AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5" to node "ip-172-18-7-27.ec2.internal"
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.254002   13926 aws.go:1260] Releasing in-process attachment entry: ba -> volume vol-3455cca5
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: E1206 22:38:31.254017   13926 attacher.go:72] Error attaching volume "aws://us-east-1d/vol-3455cca5": Error attaching EBS volume: VolumeInUse: vol-3455cca5 is already attached to an instance
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: E1206 22:38:31.254098   13926 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5\"" failed. No retries permitted until 2016-12-06 22:38:31.754083838 -0500 EST (durationBeforeRetry 500ms). Error: recovered from panic "runtime error: invalid memory address or nil pointer dereference". (err=<nil>) Call stack:
Dec  6 22:38:31 ip-172-18-11-221 atomic-openshift-master: I1206 22:38:31.299487   13926 reconciler.go:161] Volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-3455cca5"/Node "ip-172-18-7-28.ec2.internal" is attached--touching.

Comment 18 Eric Paris 2016-12-09 16:44:56 UTC

https://github.com/openshift/origin/pull/12205

Comment 23 Chao Yang 2016-12-13 05:50:26 UTC

This is failed on the atomic-openshift-3.4.0.35-1.git.0.86b11df.el7.x86_64
Please check log on https://bugzilla.redhat.com/show_bug.cgi?id=1397693#c21 and https://bugzilla.redhat.com/show_bug.cgi?id=1397693#c22

Comment 28 Chao Yang 2016-12-16 01:49:18 UTC

@hchen They are stopped and terminated

Comment 35 Troy Dawson 2017-02-10 22:58:57 UTC

This has been merged into ocp and is in OCP v3.4.1.7 or newer.

Comment 37 Chao Yang 2017-02-13 02:27:57 UTC

This is failed on 
openshift v3.4.1.7
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
  FirstSeen	LastSeen	Count	From			SubobjectPath	Type		Reason		Message
  ---------	--------	-----	----			-------------	--------	------		-------
  5m		5m		1	{default-scheduler }			Normal		Scheduled	Successfully assigned registry-console-2-y5kgt to ip-172-18-15-88.ec2.internal
  5m		1m		10	{controller-manager }			Warning		FailedMount	Failed to attach volume "pvc-5b55fcd2-f191-11e6-9747-0e758138dc84" on node "ip-172-18-15-88.ec2.internal" with: Error attaching EBS volume: VolumeInUse: vol-087a2a79f623c5358 is already attached to an instance
		status code: 400, request id: 
  3m		1m	2	{kubelet ip-172-18-15-88.ec2.internal}		Warning	FailedMount	Unable to mount volumes for pod "registry-console-2-y5kgt_default(df6cf382-f192-11e6-9747-0e758138dc84)": timeout expired waiting for volumes to attach/mount for pod "registry-console-2-y5kgt"/"default". list of unattached/unmounted volumes=[v1]
  3m		1m	2	{kubelet ip-172-18-15-88.ec2.internal}		Warning	FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "registry-console-2-y5kgt"/"default". list of unattached/unmounted volumes=[v1]


I will re-test it on newer OCP

Comment 38 Hemant Kumar 2017-02-13 15:14:46 UTC

Can you elaborate on how it failed? As I said - if atomic-openshift-node is not running on the node, the volume being used won't be unmounted and hence it can't be detached. This is by design. I opened a trello card - https://trello.com/c/hNeu0drR/418-enable-force-detach-when-volume-is-attached-to-a-node-and-node-is-down to fix that problem because fixing this behavior is more than a quick bug fix.

To recap - what I am saying is,  none of the code we pushed fixes originally reported bug in the ticket. The bug is asking for a feature which does not exist.

What we *did* fix is - we fixed the panic that happens when error is being logged via an event. So following panic:

" Error: recovered from panic "runtime error: invalid memory address or nil pointer dereference". (err=<nil>) Call stack:"

should be gone.

Comment 39 Chao Yang 2017-02-14 05:24:49 UTC

Sorry for the misunderstand.
I re-test this bug again on OCP v3.4.1.7.
When the pod using ebs volume in the "ContainerCreating" status, could not found error "runtime error: invalid memory address or nil pointer dereference" in the /var/log/message

Comment 41 errata-xmlrpc 2017-02-22 18:10:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0289

Note You need to log in before you can comment on or make changes to this bug.