Bug 1459006

Summary: [AWS] ebs volume failed to unmount in containerized ocp, error "device or resource busy"
Product: OpenShift Container Platform Reporter: Jianwei Hou <jhou>
Component: StorageAssignee: Bradley Childs <bchilds>
Status: CLOSED DUPLICATE QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.5.1CC: aos-bugs, mawong
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-06 18:54:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
unmount faliure none

Description Jianwei Hou 2017-06-06 06:10:01 UTC
Created attachment 1285221 [details]
unmount faliure

Description of problem:
In a containerized ocp, when a Pod is rescheduled, its ebs volume could not be successfully unmounted. It always fail with following error:
```
Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
```

Version-Release number of selected component (if applicable):
openshift v3.5.5.23

How reprooducible: 
Always

Steps to Reproduce:
1. Create PVC/PV. Create an RC using this PVC, with replica=1.
{
    "apiVersion": "v1",
    "kind": "ReplicationController",
    "metadata": {
        "name": "ebs"
    },
    "spec": {
        "replicas": 1,
        "selector": {
            "app": "ebs"
        },
        "template":{
            "metadata": {
                "name": "ebs",
                "labels": {
                    "app": "ebs"
                }
            },
            "spec": {
                "containers": [{
                    "name": "myfrontend",
                    "image": "aosqe/hello-openshift",
                    "imagePullPolicy": "IfNotPresent",
                    "ports": [{
                        "containerPort": 80,
                        "name": "http-server"
                    }],
                    "volumeMounts": [{
                        "mountPath": "/mnt/rbd",
                        "name": "pvol"
                    }]
                }],
                "volumes": [{
                    "name": "pvol",
                    "persistentVolumeClaim": {
                        "claimName": "ebsc"
                    }
                }]
            }
        }
    }
}


2. After the Pod is running, stop the node service where the Pod is scheduled to.
3. Wait for Pod to be rescheduled to another node
#oc get pods              
NAME        READY     STATUS              RESTARTS   AGE
ebs-dpz5p   1/1       Unknown             0          14m
ebs-s2h7b   0/1       ContainerCreating   0          7m

4. Recover the node service, expect the old Pod is deleted and the new Pod become 'Running'.

Actual results:
4. Old pod is deleted but not successfully unmounted. New Pod could never reach the 'Running' status. The /var/log/messages recorded a lot of unmount failures.

# grep vol-0ed79e9051f8d1d4d /var/log/messages
```
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:57:12.280650   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: E0605 07:57:12.285100   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 07:59:12.285077064 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 07:59:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:59:12.309581   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
```


Expected results:
Volume is unmounted and detached from old node, then attached to new node. The pod should become running.

Additional info:
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1455675

Comment 1 Matthew Wong 2017-06-06 18:54:07 UTC
Hm, I think now this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1427807 which is being handled as a containers issue, not storage. The node container holds onto graphdriver mountpoints and prevents unmount from succeeding. Seems we have not found a solution yet, but let's track it there.

*** This bug has been marked as a duplicate of bug 1427807 ***