Bug 1455675 - [3.5] Volume unmounted from node but not detached - no unmount request in logs
Summary: [3.5] Volume unmounted from node but not detached - no unmount request in logs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.5.z
Assignee: Matthew Wong
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks: 1457510
TreeView+ depends on / blocked
 
Reported: 2017-05-25 18:27 UTC by Hemant Kumar
Modified: 2017-06-15 18:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: volumes attached to non-running AWS instances get incorrectly marked as detached by the periodic 'verify volumes are attached' routine because non-running AWS instances are not considered nodes by the routine Consequence: volumes that are incorrectly marked detached will never be detached if or when they need to be later Fix: consider non-running AWS instances to be nodes in the 'verify volumes are attached' routine Result: volumes attached to non-running AWS instances are correctly tracked as attached and will be detached when they need to be later
Clone Of:
: 1457510 (view as bug list)
Environment:
Last Closed: 2017-06-15 18:40:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
grep vol-0ed79e9051f8d1d4d /var/log/messages (46.02 KB, text/plain)
2017-06-05 12:06 UTC, Jianwei Hou
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1425 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update 2017-06-15 22:35:53 UTC

Comment 7 Jianwei Hou 2017-06-05 12:05:11 UTC
Tested on openshift v3.5.5.23

1. Create PVC/PV and rc.
2. Stop the node service the Pod is scheduled to.
3. Pod scheduled to new node, but stuck at 'ContainerCreating' state
 oc get pods              
NAME        READY     STATUS              RESTARTS   AGE
ebs-dpz5p   1/1       Unknown             0          14m
ebs-s2h7b   0/1       ContainerCreating   0          7m

4. Bring back the node service, the pod ebs-dpz5p is deleted.
5. Volume is not unmounted and detached successfully. New pod could not become 'Running'.
oc get pods
NAME        READY     STATUS              RESTARTS   AGE
ebs-s2h7b   0/1       ContainerCreating   0          29m

# grep vol-0ed79e9051f8d1d4d /var/log/messages
```
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:57:12.280650   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: E0605 07:57:12.285100   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 07:59:12.285077064 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 07:59:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:59:12.309581   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
```

Comment 8 Jianwei Hou 2017-06-05 12:06:53 UTC
Created attachment 1285030 [details]
grep vol-0ed79e9051f8d1d4d /var/log/messages

Comment 10 Hemant Kumar 2017-06-05 13:59:09 UTC
No the original cluster wasn't containarized. None of the clusters in openshift.io are containarized. Is the cluster Jianwei is using containarized?

Comment 11 Matthew Wong 2017-06-05 15:15:47 UTC
Yes, nsenter_mount is doing the mounting. IMO we should test this bug against a non-containerized env and open a new bug against containerized.

Comment 12 Hemant Kumar 2017-06-05 15:18:58 UTC
Yes - I agree. If the failure was because of atomic-openshift-node running in a container, we might have to double check several different code paths to fix that. 

If this bug is fixed as is, in non-containarized environments, we should go ahead with accepting it as VEFIFIED. @jianwei - would you agree to that?

Comment 13 Jianwei Hou 2017-06-06 05:49:48 UTC
I agree, I have verified this is fixed on rpm installed ocp cluster v3.5.5.23. The ebs volume was unmounted and detached from old node, then attached and mounted to new node.
Could you please move it to on_qa status?
I'll open a new one against containerized ocp

Comment 14 Jianwei Hou 2017-06-06 06:12:46 UTC
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1459006 to track the containerized env issue.

Comment 16 errata-xmlrpc 2017-06-15 18:40:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1425


Note You need to log in before you can comment on or make changes to this bug.