1455675 – [3.5] Volume unmounted from node but not detached - no unmount request in logs

Bug 1455675 - [3.5] Volume unmounted from node but not detached - no unmount request in logs

Summary: [3.5] Volume unmounted from node but not detached - no unmount request in logs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.5.z
Assignee:	Matthew Wong
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1457510
TreeView+	depends on / blocked

Reported:	2017-05-25 18:27 UTC by Hemant Kumar
Modified:	2017-06-15 18:40 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: volumes attached to non-running AWS instances get incorrectly marked as detached by the periodic 'verify volumes are attached' routine because non-running AWS instances are not considered nodes by the routine Consequence: volumes that are incorrectly marked detached will never be detached if or when they need to be later Fix: consider non-running AWS instances to be nodes in the 'verify volumes are attached' routine Result: volumes attached to non-running AWS instances are correctly tracked as attached and will be detached when they need to be later
Clone Of:
Clones:	1457510 (view as bug list)
Environment:
Last Closed:	2017-06-15 18:40:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
grep vol-0ed79e9051f8d1d4d /var/log/messages (46.02 KB, text/plain) 2017-06-05 12:06 UTC, Jianwei Hou	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:1425	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.5, 3.4, 3.3, and 3.2 bug fix update	2017-06-15 22:35:53 UTC

Comment 7 Jianwei Hou 2017-06-05 12:05:11 UTC

Tested on openshift v3.5.5.23

1. Create PVC/PV and rc.
2. Stop the node service the Pod is scheduled to.
3. Pod scheduled to new node, but stuck at 'ContainerCreating' state
 oc get pods              
NAME        READY     STATUS              RESTARTS   AGE
ebs-dpz5p   1/1       Unknown             0          14m
ebs-s2h7b   0/1       ContainerCreating   0          7m

4. Bring back the node service, the pod ebs-dpz5p is deleted.
5. Volume is not unmounted and detached successfully. New pod could not become 'Running'.
oc get pods
NAME        READY     STATUS              RESTARTS   AGE
ebs-s2h7b   0/1       ContainerCreating   0          29m

# grep vol-0ed79e9051f8d1d4d /var/log/messages
```
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:57:12.280650   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 07:57:12 ip-172-18-15-42 atomic-openshift-node: E0605 07:57:12.285100   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 07:59:12.285077064 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 07:59:12 ip-172-18-15-42 atomic-openshift-node: I0605 07:59:12.309581   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 journal: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: I0605 08:01:12.413553   20830 reconciler.go:189] UnmountVolume operation started for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (spec.Name: "pvol") from pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50").
Jun  5 08:01:12 ip-172-18-15-42 atomic-openshift-node: E0605 08:01:12.418120   20830 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d\" (\"a87c9660-49e1-11e7-b5c0-0e14d6e6ec50\")" failed. No retries permitted until 2017-06-05 08:03:12.418098061 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/aws-ebs/aws://us-east-1d/vol-0ed79e9051f8d1d4d" (volume.spec.Name: "pvol") pod "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50" (UID: "a87c9660-49e1-11e7-b5c0-0e14d6e6ec50") with: remove /var/lib/origin/openshift.local.volumes/pods/a87c9660-49e1-11e7-b5c0-0e14d6e6ec50/volumes/kubernetes.io~aws-ebs/pvc-a050a061-49e1-11e7-b5c0-0e14d6e6ec50: device or resource busy
```

Comment 8 Jianwei Hou 2017-06-05 12:06:53 UTC

Created attachment 1285030 [details]
grep vol-0ed79e9051f8d1d4d /var/log/messages

Comment 10 Hemant Kumar 2017-06-05 13:59:09 UTC

No the original cluster wasn't containarized. None of the clusters in openshift.io are containarized. Is the cluster Jianwei is using containarized?

Comment 11 Matthew Wong 2017-06-05 15:15:47 UTC

Yes, nsenter_mount is doing the mounting. IMO we should test this bug against a non-containerized env and open a new bug against containerized.

Comment 12 Hemant Kumar 2017-06-05 15:18:58 UTC

Yes - I agree. If the failure was because of atomic-openshift-node running in a container, we might have to double check several different code paths to fix that. 

If this bug is fixed as is, in non-containarized environments, we should go ahead with accepting it as VEFIFIED. @jianwei - would you agree to that?

Comment 13 Jianwei Hou 2017-06-06 05:49:48 UTC

I agree, I have verified this is fixed on rpm installed ocp cluster v3.5.5.23. The ebs volume was unmounted and detached from old node, then attached and mounted to new node.
Could you please move it to on_qa status?
I'll open a new one against containerized ocp

Comment 14 Jianwei Hou 2017-06-06 06:12:46 UTC

Opened https://bugzilla.redhat.com/show_bug.cgi?id=1459006 to track the containerized env issue.

Comment 16 errata-xmlrpc 2017-06-15 18:40:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1425

Note You need to log in before you can comment on or make changes to this bug.