Bug 1313210

Summary:	Cinder volume could not be attached to disk before the '60s' timeout duration on containerized openshift
Product:	OpenShift Container Platform	Reporter:	Jianwei Hou <jhou>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Status:	CLOSED ERRATA	QA Contact:	Jianwei Hou <jhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.2.0	CC:	agoldste, aos-bugs, jhou, jkrieger, jokerman, jsafrane, mmccomas, mwysocki, pmorie, sdodson, tdawson
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-12 16:30:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jianwei Hou 2016-03-01 08:21:56 UTC

Description of problem:
On container installed AEP, pods with cinder volume can not be attached to disk within the 60s timeout duration. Could not reproduce this on RPM installed environments on same OpenStack.

Version-Release number of selected component (if applicable):
openshift v3.1.1.908
kubernetes v1.2.0-alpha.7-703-gbc4550d
etcd 2.2.5

How reproducible:
Always

Steps to Reproduce:
1. Have environments properly configured openstack as cloud provider

2. Create a PVC which dynamically creates a PV
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/cinder/dynamic-provisioning/pvc.json

3. After PV and PVC are bound, create a pod
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/cinder/pod.json

4. oc get pods; oc get events

Actual results:
After step 4: Pod is in status 'ContainerCreating'.
Events showed timeout error.

FIRSTSEEN   LASTSEEN   COUNT     NAME       KIND      SUBOBJECT   REASON        SOURCE                                           MESSAGE
11m         11m        1         cinderpd   Pod                   Scheduled     {default-scheduler }                             Successfully assigned cinderpd to openshift-163.lab.eng.nay.redhat.com
10m         29s        10        cinderpd   Pod                   FailedMount   {kubelet openshift-163.lab.eng.nay.redhat.com}   Unable to mount volumes for pod "cinderpd_jhou(5fd1dab6-df7f-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s
10m         29s        10        cinderpd   Pod                   FailedSync    {kubelet openshift-163.lab.eng.nay.redhat.com}   Error syncing pod, skipping: Could not attach disk: Timeout after 60s

The node logs showed:
```
Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: E0301 15:55:33.049298   18326 kubelet.go:1716] Unable to mount volumes for pod "cinderpd_jhou(b0fa55bb-df82-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s; skipping pod
Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: E0301 15:55:33.049317   18326 pod_workers.go:138] Error syncing pod b0fa55bb-df82-11e5-9f0d-fa163e554a2b, skipping: Could not attach disk: Timeout after 60s
Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: I0301 15:55:33.049382   18326 server.go:577] Event(api.ObjectReference{Kind:"Pod", Namespace:"jhou", Name:"cinderpd", UID:"b0fa55bb-df82-11e5-9f0d-fa163e554a2b", APIVersion:"v1", ResourceVersion:"3198", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "cinderpd_jhou(b0fa55bb-df82-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s
```
Expected results:
Should be able to mount before timeout occurs.

Additional info:

Comment 1 Jan Safranek 2016-03-16 16:21:41 UTC

I failed to set up OpenShift in OpenStack running as containers. Internal OS1 cloud is slow as hell and ansible script fails at various stages.

Can you please give me access to a machine, where it is reproducible so I can take a look? Or teach me how to provision one, I heard you have some scripting around it.

Comment 2 Jan Safranek 2016-03-17 16:00:03 UTC

Finally, I am able to reproduce it. It's indeed caused by containerized openshift-node. When it attaches a cinder (or any other) volume to the host, it expects appropriate device created in /dev. Since openshift runs in a container, it does not see real /dev/ but the container one. And it times out waiting for the attached device.


As a solution, I would propose to run openshift-node with "docker run -v /dev:/dev". Or OpenShift/Kubernetes must be changed to look for devices in configurable directory, not hardcoded /dev/.

The same should happen also on GCE or with containerized OpenShift attaching iSCSI, Ceph RBD or any other block device. AWS might be protected from this error as device names are assigned by kubelet (and thus do not need to be loaded from /dev) - I did not check this as running containerized OpenShift is quite painful.


Jianwei, I am still very interested in some automated way how to run containerized OpenShift on OpenStack and/or AWS, especially with nightly builds.

Comment 3 Jan Safranek 2016-03-17 17:14:30 UTC

So, can anyone add "-v /dev/:/dev" to /etc/systemd/system/openshift-node.service when running node as container? Is it a good idea?

Reassigning to Contaners component.

Comment 4 Jan Safranek 2016-03-18 11:59:25 UTC

experimental patch: https://github.com/openshift/origin/pull/8119

Comment 5 Jan Safranek 2016-03-18 12:00:59 UTC

Adding Scott to cc:. You're the last one who updated the .service files - can you please look at it?

Comment 6 Paul Morie 2016-03-18 17:25:10 UTC

The mountpoint check should be running in the host mount namespace -- why do we need to mount in /dev?

Comment 7 Jan Safranek 2016-03-18 17:39:48 UTC

Because volume plugins are not aware of running in container. Only mounter is.

Comment 8 Daniel Walsh 2016-03-21 19:53:46 UTC

Why is this assigned to me?  Jon?  This looks to be specific to the container.

Comment 9 Jan Safranek 2016-03-23 11:03:18 UTC

Fixed by https://github.com/openshift/origin/pull/8182

Assigning to Scott, he did the fix. No development work should be needed now, we're just waiting for QE to the bug is really fixed.

Comment 10 Troy Dawson 2016-03-23 21:33:19 UTC

This should be in OSE v3.2.0.7 that was build and pushed to qe today.

Comment 12 Jianwei Hou 2016-03-24 11:05:44 UTC

Tested on
openshift v3.2.0.7
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

Steps were same with bug description.

Pod status remained in 'ContainerCreating'
# oc get pods
NAME       READY     STATUS              RESTARTS   AGE
cinderpd   0/1       ContainerCreating   0          10m

Run docker exec interactively in the node container, tailf the /var/log/messages, found following errors:

```
Mar 24 06:56:06 openshift-111 atomic-openshift-node: I0324 06:56:06.904016    7383 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:"
jhou", Name:"cinderpd", UID:"8d277d8b-f1ad-11e5-af25-fa163e4f5f19", APIVersion:"v1", ResourceVersion:"7283", FieldPath:""}): type: 'Warning' reason
: 'FailedMount' Unable to mount volumes for pod "cinderpd_jhou(8d277d8b-f1ad-11e5-af25-fa163e4f5f19)": exit status 32
Mar 24 06:56:06 openshift-111 atomic-openshift-node: I0324 06:56:06.904103    7383 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:"
jhou", Name:"cinderpd", UID:"8d277d8b-f1ad-11e5-af25-fa163e4f5f19", APIVersion:"v1", ResourceVersion:"7283", FieldPath:""}): type: 'Warning' reason
: 'FailedSync' Error syncing pod, skipping: exit status 32
```

Went to the openstack console -> volumes, the UI showed that the volume was in-use and was attached to my node 'openshift-111.lab.eng.nay.redhat.com'. But openshift thought it was a failed mount.

Furthermore, the PV I created was dynamically provisioned, the pod was stuck in 'ContainerCreating', so I deleted the pod and pvc, the provisioned PV and volume were left behind, they were not deleted.(If this is later considered another issue, I will open another bug to track).

Comment 14 Jan Safranek 2016-03-24 15:35:15 UTC

There is something wrong with OpenShift nsenter mounter, I'll look at it.

Comment 15 Jan Safranek 2016-03-24 16:33:02 UTC

Fixed mounter: https://github.com/kubernetes/kubernetes/pull/23435.
Waiting for review.

Comment 19 Jan Safranek 2016-04-14 07:35:32 UTC

Merged as origin PR: https://github.com/openshift/origin/pull/8501

Comment 20 Troy Dawson 2016-04-15 16:37:30 UTC

Merge failed.  Please try merging again.

Comment 21 Andy Goldstein 2016-04-18 15:37:24 UTC

8501 just merged

Comment 22 Troy Dawson 2016-04-20 16:37:53 UTC

Should be in atomic-openshift-3.2.0.18-1.git.0.c3ac515.el7.  This has been built and staged for qe.

Comment 23 Jianwei Hou 2016-04-22 06:51:33 UTC

Verified on containerized setup of
openshift v3.2.0.18
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

Run reproduce steps, this bug is not reproduced now. Mark this bug as verified.

Comment 25 errata-xmlrpc 2016-05-12 16:30:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064