Created attachment 1285221 [details] unmount faliure Description of problem: In a containerized ocp, when a Pod is rescheduled, its ebs volume could not be successfully unmounted. It always fail with following error: ``` Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy ``` Version-Release number of selected component (if applicable): openshift v3.5.5.23 How reprooducible: Always Steps to Reproduce: 1. Create PVC/PV. Create an RC using this PVC, with replica=1. { "apiVersion": "v1", "kind": "ReplicationController", "metadata": { "name": "ebs" }, "spec": { "replicas": 1, "selector": { "app": "ebs" }, "template":{ "metadata": { "name": "ebs", "labels": { "app": "ebs" } }, "spec": { "containers": [{ "name": "myfrontend", "image": "aosqe/hello-openshift", "imagePullPolicy": "IfNotPresent", "ports": [{ "containerPort": 80, "name": "http-server" }], "volumeMounts": [{ "mountPath": "/mnt/rbd", "name": "pvol" }] }], "volumes": [{ "name": "pvol", "persistentVolumeClaim": { "claimName": "ebsc" } }] } } } } 2. After the Pod is running, stop the node service where the Pod is scheduled to. 3. Wait for Pod to be rescheduled to another node #oc get pods NAME READY STATUS RESTARTS AGE ebs-dpz5p 1/1 Unknown 0 14m ebs-s2h7b 0/1 ContainerCreating 0 7m 4. Recover the node service, expect the old Pod is deleted and the new Pod become 'Running'. Actual results: 4. Old pod is deleted but not successfully unmounted. New Pod could never reach the 'Running' status. The /var/log/messages recorded a lot of unmount failures. # grep vol-0ed79e9051f8d1d4d /var/log/messages ``` Jun 5 07:57:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:57:12.280650 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 07:57:12 ip-172-18-15-42 atomic-openshift-node: E0605 07:57:12.285100 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 07:59:12.285077064 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy Jun 5 07:59:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:59:12.309581 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 08:01:12 ip-172-18-15-42 journal: I0605 08:01:12.413553 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 08:01:12 ip-172-18-15-42 journal: E0605 08:01:12.418120 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy Jun 5 08:01:12 ip-172-18-15-42 atomic-openshift-node: I0605 08:01:12.413553 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 08:01:12 ip-172-18-15-42 atomic-openshift-node: E0605 08:01:12.418120 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy ``` Expected results: Volume is unmounted and detached from old node, then attached to new node. The pod should become running. Additional info: Related: https://bugzilla.redhat.com/show_bug.cgi?id=1455675
Hm, I think now this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1427807 which is being handled as a containers issue, not storage. The node container holds onto graphdriver mountpoints and prevents unmount from succeeding. Seems we have not found a solution yet, but let's track it there. *** This bug has been marked as a duplicate of bug 1427807 ***