Bug 1298284

Summary: Races in persistent volume attachment when starting a pod.
Product: OKD Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED CURRENTRELEASE QA Contact: Liang Xia <lxia>
Severity: high Docs Contact:
Priority: high    
Version: 3.xCC: aos-bugs, bchilds, jhou, lxia, mturansk, pep
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-12 17:10:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Safranek 2016-01-13 16:19:40 UTC
Description of problem:
When a pod uses a PersistentVolumeClaim, this can happen:



Version-Release number of selected component (if applicable):
origin-3.1.0

How reproducible:
~30%-50%

Steps to Reproduce:
1. create a PV and a claim (I use Cinder volumes, but I saw it on AWS and GCE too)
2. create a pod that uses the claim
3. In a loop:
  3.1 create the pod
  3.2 wait until it's running
  3.3 run 'kubectl describe pods'
  3.4 delete it
  3.5 wait until the volume is unmounted and detached from the node (this is important!)

Actual results:
at step 3.3, you can see it was necessary to restart the pod container(s), "kubectl describe pods" shows something like:

Error syncing pod, skipping: not all containers have started: 0 != 1

Expected results:
The containers start on the first try.

Additional info:
The pod is started eventually, it's just slower (~1 minute in my OpenStack setup, attaching a volume is slow...).

Fix: https://github.com/kubernetes/kubernetes/pull/19600

Comment 1 Mark Turansky 2016-01-13 16:24:56 UTC
In a conversation with Jan, we determined this is not a blocker because kubelet and the volume eventually reach the correct state.  It might take a few minutes to reconcile, making it a bad UX but not a blocker.

Comment 2 Mark Turansky 2016-02-04 13:11:40 UTC
Upstream PR is merged.  Awaiting rebase into Origin.

Comment 3 Jan Safranek 2016-02-08 11:10:21 UTC
In case there is no rebase I filled Origin PR: https://github.com/openshift/origin/pull/7107

Comment 4 Jan Safranek 2016-02-09 14:38:54 UTC
Origin PR merged

Comment 5 Jianwei Hou 2016-02-16 02:45:55 UTC
Verified with
openshift v1.1.2-260-gf556adc
kubernetes v1.2.0-origin
etcd 2.2.2+git

The issue described here is not reproducible, moving to verified.