Cause: volumes attached to non-running AWS instances get incorrectly marked as detached by the periodic 'verify volumes are attached' routine because non-running AWS instances are not considered nodes by the routine
Consequence: volumes that are incorrectly marked detached will never be detached if or when they need to be later
Fix: consider non-running AWS instances to be nodes in the 'verify volumes are attached' routine
Result: volumes attached to non-running AWS instances are correctly tracked as attached and will be detached when they need to be later
Tested on openshift v3.5.5.23
1. Create PVC/PV and rc.
2. Stop the node service the Pod is scheduled to.
3. Pod scheduled to new node, but stuck at 'ContainerCreating' state
oc get pods
NAME READY STATUS RESTARTS AGE
ebs-dpz5p 1/1 Unknown 0 14m
ebs-s2h7b 0/1 ContainerCreating 0 7m
4. Bring back the node service, the pod ebs-dpz5p is deleted.
5. Volume is not unmounted and detached successfully. New pod could not become 'Running'.
oc get pods
NAME READY STATUS RESTARTS AGE
ebs-s2h7b 0/1 ContainerCreating 0 29m
# grep vol-0ed79e9051f8d1d4d /var/log/messages
```
Jun 5 07:57:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:57:12.280650 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun 5 07:57:12 ip-172-18-15-42 atomic-openshift-node: E0605 07:57:12.285100 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 07:59:12.285077064 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun 5 07:59:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:59:12.309581 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun 5 08:01:12 ip-172-18-15-42 journal: I0605 08:01:12.413553 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun 5 08:01:12 ip-172-18-15-42 journal: E0605 08:01:12.418120 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun 5 08:01:12 ip-172-18-15-42 atomic-openshift-node: I0605 08:01:12.413553 20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun 5 08:01:12 ip-172-18-15-42 atomic-openshift-node: E0605 08:01:12.418120 20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
```
Yes - I agree. If the failure was because of atomic-openshift-node running in a container, we might have to double check several different code paths to fix that.
If this bug is fixed as is, in non-containarized environments, we should go ahead with accepting it as VEFIFIED. @jianwei - would you agree to that?
I agree, I have verified this is fixed on rpm installed ocp cluster v3.5.5.23. The ebs volume was unmounted and detached from old node, then attached and mounted to new node.
Could you please move it to on_qa status?
I'll open a new one against containerized ocp
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2017:1425