Bug 1419577

Summary: [3.3] Unmount device fails in certain cases
Product: OKD Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED CURRENTRELEASE QA Contact: Chao Yang <chaoyang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.xCC: aos-bugs, aos-storage-staff, eparis, jhou
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-30 12:47:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hemant Kumar 2017-02-06 14:50:42 UTC
Description of problem:

Sometimes when pod is moved as a result of drain or something else, the bind volume is unmounted but AWS attached device doesn't get unmounted from the node. This causes problems with pod never starting on another node.


How reproducible:

Sometimes


Steps to Reproduce:
1. Create a multinode node cluster and create 5-6 deployments (not pods) on a node. Make sure these pods write to the mounted EBS PV (something like busybox write).
2. Now drain that node, so as all pods running on it moves.
3. Check if all moved pods are running successfully. I had to do this several times to reproduce the bug.

Actual results:

One or more pod can get stuck in ContainerCreating because AWS never attaches device on new node. 


Expected results:

All pods should move successfully.


Additional info:

If device pod is using is "busy" (i.e being written to), the first unmount fails with "device is busy error".  Eventually though container is deleted and device becomes "unbusy" but error handling code in volumemanager doesn't kick and it deletes the device from actual state of world - thinking device is unmounted. In other words - current code silently swallows unmount error and because error is not propagated volumemanager thinks device is successfully unmounted.

https://github.com/openshift/ose/pull/602/files#diff-f7240ab860b1b30388948da95bb5b02aR237

Comment 2 Chao Yang 2017-02-15 10:53:51 UTC
Umount issue is passed on 
openshift v3.3.1.13
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

Create 5 app like below
oc new-app php:5.6~https://github.com/openshift/sti-php --context-dir='5.6/test/test-app'
Create a dynamic pvc
{
  "kind": "PersistentVolumeClaim",
  "apiVersion": "v1",
  "metadata": {
    "name": "ebsc2",
    "annotations": {
        "volume.alpha.kubernetes.io/storage-class": "foo"
    },
    "labels": {
        "name": "dynamic-pvc"
    }
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "3Gi"
      }
    }
  }
}
~ 
oc volume dc/sti-php --add --type=persistentVolumeClaim --mount-path=/opt1 --name=v1 --claim-name=ebsc2 --overwrite
oadm manage-node ip-172-18-5-95.ec2.internal --evacuate --pod-selector="app=sti-php"
Check ebs volume is umounted from node ip-172-18-5-95.ec2.internal 
No error message when grep "device is busy error" /var/log/messages