Tested on openshift v3.5.5.23 1. Create PVC/PV and rc. 2. Stop the node service the Pod is scheduled to. 3. Pod scheduled to new node, but stuck at 'ContainerCreating' state oc get pods NAME READY STATUS RESTARTS AGE ebs-dpz5p 1/1 Unknown 0 14m ebs-s2h7b 0/1 ContainerCreating 0 7m 4. Bring back the node service, the pod ebs-dpz5p is deleted. 5. Volume is not unmounted and detached successfully. New pod could not become 'Running'. oc get pods NAME READY STATUS RESTARTS AGE ebs-s2h7b 0/1 ContainerCreating 0 29m # grep vol-0ed79e9051f8d1d4d /var/log/messages ``` Jun 5 07:57:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:57:12.280650 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 07:57:12 ip-172-18-15-42 atomic-openshift-node: E0605 07:57:12.285100 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 07:59:12.285077064 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy Jun 5 07:59:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:59:12.309581 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 08:01:12 ip-172-18-15-42 journal: I0605 08:01:12.413553 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 08:01:12 ip-172-18-15-42 journal: E0605 08:01:12.418120 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy Jun 5 08:01:12 ip-172-18-15-42 atomic-openshift-node: I0605 08:01:12.413553 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50"). Jun 5 08:01:12 ip-172-18-15-42 atomic-openshift-node: E0605 08:01:12.418120 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy ```
Created attachment 1285030 [details] grep vol-0ed79e9051f8d1d4d /var/log/messages
No the original cluster wasn't containarized. None of the clusters in openshift.io are containarized. Is the cluster Jianwei is using containarized?
Yes, nsenter_mount is doing the mounting. IMO we should test this bug against a non-containerized env and open a new bug against containerized.
Yes - I agree. If the failure was because of atomic-openshift-node running in a container, we might have to double check several different code paths to fix that. If this bug is fixed as is, in non-containarized environments, we should go ahead with accepting it as VEFIFIED. @jianwei - would you agree to that?
I agree, I have verified this is fixed on rpm installed ocp cluster v3.5.5.23. The ebs volume was unmounted and detached from old node, then attached and mounted to new node. Could you please move it to on_qa status? I'll open a new one against containerized ocp
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1459006 to track the containerized env issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1425