Bug 1313210 - Cinder volume could not be attached to disk before the '60s' timeout duration on containerized openshift
Summary: Cinder volume could not be attached to disk before the '60s' timeout duration...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Jan Safranek
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-01 08:21 UTC by Jianwei Hou
Modified: 2016-05-12 16:30 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 16:30:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1064 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 20:19:17 UTC

Description Jianwei Hou 2016-03-01 08:21:56 UTC
Description of problem:
On container installed AEP, pods with cinder volume can not be attached to disk within the 60s timeout duration. Could not reproduce this on RPM installed environments on same OpenStack.

Version-Release number of selected component (if applicable):
openshift v3.1.1.908
kubernetes v1.2.0-alpha.7-703-gbc4550d
etcd 2.2.5

How reproducible:
Always

Steps to Reproduce:
1. Have environments properly configured openstack as cloud provider

2. Create a PVC which dynamically creates a PV
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/cinder/dynamic-provisioning/pvc.json

3. After PV and PVC are bound, create a pod
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/cinder/pod.json

4. oc get pods; oc get events

Actual results:
After step 4: Pod is in status 'ContainerCreating'.
Events showed timeout error.

FIRSTSEEN   LASTSEEN   COUNT     NAME       KIND      SUBOBJECT   REASON        SOURCE                                           MESSAGE
11m         11m        1         cinderpd   Pod                   Scheduled     {default-scheduler }                             Successfully assigned cinderpd to openshift-163.lab.eng.nay.redhat.com
10m         29s        10        cinderpd   Pod                   FailedMount   {kubelet openshift-163.lab.eng.nay.redhat.com}   Unable to mount volumes for pod "cinderpd_jhou(5fd1dab6-df7f-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s
10m         29s        10        cinderpd   Pod                   FailedSync    {kubelet openshift-163.lab.eng.nay.redhat.com}   Error syncing pod, skipping: Could not attach disk: Timeout after 60s

The node logs showed:
```
Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: E0301 15:55:33.049298   18326 kubelet.go:1716] Unable to mount volumes for pod "cinderpd_jhou(b0fa55bb-df82-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s; skipping pod
Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: E0301 15:55:33.049317   18326 pod_workers.go:138] Error syncing pod b0fa55bb-df82-11e5-9f0d-fa163e554a2b, skipping: Could not attach disk: Timeout after 60s
Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: I0301 15:55:33.049382   18326 server.go:577] Event(api.ObjectReference{Kind:"Pod", Namespace:"jhou", Name:"cinderpd", UID:"b0fa55bb-df82-11e5-9f0d-fa163e554a2b", APIVersion:"v1", ResourceVersion:"3198", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "cinderpd_jhou(b0fa55bb-df82-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s
```
Expected results:
Should be able to mount before timeout occurs.

Additional info:

Comment 1 Jan Safranek 2016-03-16 16:21:41 UTC
I failed to set up OpenShift in OpenStack running as containers. Internal OS1 cloud is slow as hell and ansible script fails at various stages.

Can you please give me access to a machine, where it is reproducible so I can take a look? Or teach me how to provision one, I heard you have some scripting around it.

Comment 2 Jan Safranek 2016-03-17 16:00:03 UTC
Finally, I am able to reproduce it. It's indeed caused by containerized openshift-node. When it attaches a cinder (or any other) volume to the host, it expects appropriate device created in /dev. Since openshift runs in a container, it does not see real /dev/ but the container one. And it times out waiting for the attached device.


As a solution, I would propose to run openshift-node with "docker run -v /dev:/dev". Or OpenShift/Kubernetes must be changed to look for devices in configurable directory, not hardcoded /dev/.

The same should happen also on GCE or with containerized OpenShift attaching iSCSI, Ceph RBD or any other block device. AWS might be protected from this error as device names are assigned by kubelet (and thus do not need to be loaded from /dev) - I did not check this as running containerized OpenShift is quite painful.


Jianwei, I am still very interested in some automated way how to run containerized OpenShift on OpenStack and/or AWS, especially with nightly builds.

Comment 3 Jan Safranek 2016-03-17 17:14:30 UTC
So, can anyone add "-v /dev/:/dev" to /etc/systemd/system/openshift-node.service when running node as container? Is it a good idea?

Reassigning to Contaners component.

Comment 4 Jan Safranek 2016-03-18 11:59:25 UTC
experimental patch: https://github.com/openshift/origin/pull/8119

Comment 5 Jan Safranek 2016-03-18 12:00:59 UTC
Adding Scott to cc:. You're the last one who updated the .service files - can you please look at it?

Comment 6 Paul Morie 2016-03-18 17:25:10 UTC
The mountpoint check should be running in the host mount namespace -- why do we need to mount in /dev?

Comment 7 Jan Safranek 2016-03-18 17:39:48 UTC
Because volume plugins are not aware of running in container. Only mounter is.

Comment 8 Daniel Walsh 2016-03-21 19:53:46 UTC
Why is this assigned to me?  Jon?  This looks to be specific to the container.

Comment 9 Jan Safranek 2016-03-23 11:03:18 UTC
Fixed by https://github.com/openshift/origin/pull/8182

Assigning to Scott, he did the fix. No development work should be needed now, we're just waiting for QE to the bug is really fixed.

Comment 10 Troy Dawson 2016-03-23 21:33:19 UTC
This should be in OSE v3.2.0.7 that was build and pushed to qe today.

Comment 12 Jianwei Hou 2016-03-24 11:05:44 UTC
Tested on
openshift v3.2.0.7
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

Steps were same with bug description.

Pod status remained in 'ContainerCreating'
# oc get pods
NAME       READY     STATUS              RESTARTS   AGE
cinderpd   0/1       ContainerCreating   0          10m

Run docker exec interactively in the node container, tailf the /var/log/messages, found following errors:

```
Mar 24 06:56:06 openshift-111 atomic-openshift-node: I0324 06:56:06.904016    7383 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:"
jhou", Name:"cinderpd", UID:"8d277d8b-f1ad-11e5-af25-fa163e4f5f19", APIVersion:"v1", ResourceVersion:"7283", FieldPath:""}): type: 'Warning' reason
: 'FailedMount' Unable to mount volumes for pod "cinderpd_jhou(8d277d8b-f1ad-11e5-af25-fa163e4f5f19)": exit status 32
Mar 24 06:56:06 openshift-111 atomic-openshift-node: I0324 06:56:06.904103    7383 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:"
jhou", Name:"cinderpd", UID:"8d277d8b-f1ad-11e5-af25-fa163e4f5f19", APIVersion:"v1", ResourceVersion:"7283", FieldPath:""}): type: 'Warning' reason
: 'FailedSync' Error syncing pod, skipping: exit status 32
```

Went to the openstack console -> volumes, the UI showed that the volume was in-use and was attached to my node 'openshift-111.lab.eng.nay.redhat.com'. But openshift thought it was a failed mount.

Furthermore, the PV I created was dynamically provisioned, the pod was stuck in 'ContainerCreating', so I deleted the pod and pvc, the provisioned PV and volume were left behind, they were not deleted.(If this is later considered another issue, I will open another bug to track).

Comment 14 Jan Safranek 2016-03-24 15:35:15 UTC
There is something wrong with OpenShift nsenter mounter, I'll look at it.

Comment 15 Jan Safranek 2016-03-24 16:33:02 UTC
Fixed mounter: https://github.com/kubernetes/kubernetes/pull/23435.
Waiting for review.

Comment 19 Jan Safranek 2016-04-14 07:35:32 UTC
Merged as origin PR: https://github.com/openshift/origin/pull/8501

Comment 20 Troy Dawson 2016-04-15 16:37:30 UTC
Merge failed.  Please try merging again.

Comment 21 Andy Goldstein 2016-04-18 15:37:24 UTC
8501 just merged

Comment 22 Troy Dawson 2016-04-20 16:37:53 UTC
Should be in atomic-openshift-3.2.0.18-1.git.0.c3ac515.el7.  This has been built and staged for qe.

Comment 23 Jianwei Hou 2016-04-22 06:51:33 UTC
Verified on containerized setup of
openshift v3.2.0.18
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

Run reproduce steps, this bug is not reproduced now. Mark this bug as verified.

Comment 25 errata-xmlrpc 2016-05-12 16:30:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064


Note You need to log in before you can comment on or make changes to this bug.