1459006 – [AWS] ebs volume failed to unmount in containerized ocp, error "device or resource busy"

Bug 1459006 - [AWS] ebs volume failed to unmount in containerized ocp, error "device or resource busy"

Summary: [AWS] ebs volume failed to unmount in containerized ocp, error "device or res...

Keywords:
Status:	CLOSED DUPLICATE of bug 1427807
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Bradley Childs
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-06 06:10 UTC by Jianwei Hou
Modified:	2017-06-06 18:54 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-06 18:54:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
unmount faliure (43.67 KB, text/plain) 2017-06-06 06:10 UTC, Jianwei Hou	no flags	Details
View All

Description Jianwei Hou 2017-06-06 06:10:01 UTC

Created attachment 1285221 [details]
unmount faliure

Description of problem:
In a containerized ocp, when a Pod is rescheduled, its ebs volume could not be successfully unmounted. It always fail with following error:
```
Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
```

Version-Release number of selected component (if applicable):
openshift v3.5.5.23

How reprooducible: 
Always

Steps to Reproduce:
1. Create PVC/PV. Create an RC using this PVC, with replica=1.
{
    "apiVersion": "v1",
    "kind": "ReplicationController",
    "metadata": {
        "name": "ebs"
    },
    "spec": {
        "replicas": 1,
        "selector": {
            "app": "ebs"
        },
        "template":{
            "metadata": {
                "name": "ebs",
                "labels": {
                    "app": "ebs"
                }
            },
            "spec": {
                "containers": [{
                    "name": "myfrontend",
                    "image": "aosqe/hello-openshift",
                    "imagePullPolicy": "IfNotPresent",
                    "ports": [{
                        "containerPort": 80,
                        "name": "http-server"
                    }],
                    "volumeMounts": [{
                        "mountPath": "/mnt/rbd",
                        "name": "pvol"
                    }]
                }],
                "volumes": [{
                    "name": "pvol",
                    "persistentVolumeClaim": {
                        "claimName": "ebsc"
                    }
                }]
            }
        }
    }
}


2. After the Pod is running, stop the node service where the Pod is scheduled to.
3. Wait for Pod to be rescheduled to another node
#oc get pods              
NAME        READY     STATUS              RESTARTS   AGE
ebs-dpz5p   1/1       Unknown             0          14m
ebs-s2h7b   0/1       ContainerCreating   0          7m

4. Recover the node service, expect the old Pod is deleted and the new Pod become 'Running'.

Actual results:
4. Old pod is deleted but not successfully unmounted. New Pod could never reach the 'Running' status. The /var/log/messages recorded a lot of unmount failures.

# grep vol-0ed79e9051f8d1d4d /var/log/messages
```
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:57:12.280650   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: E0605 07:57:12.285100   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 07:59:12.285077064 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 07:59:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:59:12.309581   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
```


Expected results:
Volume is unmounted and detached from old node, then attached to new node. The pod should become running.

Additional info:
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1455675

Comment 1 Matthew Wong 2017-06-06 18:54:07 UTC

Hm, I think now this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1427807 which is being handled as a containers issue, not storage. The node container holds onto graphdriver mountpoints and prevents unmount from succeeding. Seems we have not found a solution yet, but let's track it there.

*** This bug has been marked as a duplicate of bug 1427807 ***

Note You need to log in before you can comment on or make changes to this bug.