Description of problem: On container installed AEP, pods with cinder volume can not be attached to disk within the 60s timeout duration. Could not reproduce this on RPM installed environments on same OpenStack. Version-Release number of selected component (if applicable): openshift v3.1.1.908 kubernetes v1.2.0-alpha.7-703-gbc4550d etcd 2.2.5 How reproducible: Always Steps to Reproduce: 1. Have environments properly configured openstack as cloud provider 2. Create a PVC which dynamically creates a PV oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/cinder/dynamic-provisioning/pvc.json 3. After PV and PVC are bound, create a pod oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/cinder/pod.json 4. oc get pods; oc get events Actual results: After step 4: Pod is in status 'ContainerCreating'. Events showed timeout error. FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE 11m 11m 1 cinderpd Pod Scheduled {default-scheduler } Successfully assigned cinderpd to openshift-163.lab.eng.nay.redhat.com 10m 29s 10 cinderpd Pod FailedMount {kubelet openshift-163.lab.eng.nay.redhat.com} Unable to mount volumes for pod "cinderpd_jhou(5fd1dab6-df7f-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s 10m 29s 10 cinderpd Pod FailedSync {kubelet openshift-163.lab.eng.nay.redhat.com} Error syncing pod, skipping: Could not attach disk: Timeout after 60s The node logs showed: ``` Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: E0301 15:55:33.049298 18326 kubelet.go:1716] Unable to mount volumes for pod "cinderpd_jhou(b0fa55bb-df82-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s; skipping pod Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: E0301 15:55:33.049317 18326 pod_workers.go:138] Error syncing pod b0fa55bb-df82-11e5-9f0d-fa163e554a2b, skipping: Could not attach disk: Timeout after 60s Mar 01 15:55:33 openshift-163.lab.eng.nay.redhat.com docker[18282]: I0301 15:55:33.049382 18326 server.go:577] Event(api.ObjectReference{Kind:"Pod", Namespace:"jhou", Name:"cinderpd", UID:"b0fa55bb-df82-11e5-9f0d-fa163e554a2b", APIVersion:"v1", ResourceVersion:"3198", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "cinderpd_jhou(b0fa55bb-df82-11e5-9f0d-fa163e554a2b)": Could not attach disk: Timeout after 60s ``` Expected results: Should be able to mount before timeout occurs. Additional info:
I failed to set up OpenShift in OpenStack running as containers. Internal OS1 cloud is slow as hell and ansible script fails at various stages. Can you please give me access to a machine, where it is reproducible so I can take a look? Or teach me how to provision one, I heard you have some scripting around it.
Finally, I am able to reproduce it. It's indeed caused by containerized openshift-node. When it attaches a cinder (or any other) volume to the host, it expects appropriate device created in /dev. Since openshift runs in a container, it does not see real /dev/ but the container one. And it times out waiting for the attached device. As a solution, I would propose to run openshift-node with "docker run -v /dev:/dev". Or OpenShift/Kubernetes must be changed to look for devices in configurable directory, not hardcoded /dev/. The same should happen also on GCE or with containerized OpenShift attaching iSCSI, Ceph RBD or any other block device. AWS might be protected from this error as device names are assigned by kubelet (and thus do not need to be loaded from /dev) - I did not check this as running containerized OpenShift is quite painful. Jianwei, I am still very interested in some automated way how to run containerized OpenShift on OpenStack and/or AWS, especially with nightly builds.
So, can anyone add "-v /dev/:/dev" to /etc/systemd/system/openshift-node.service when running node as container? Is it a good idea? Reassigning to Contaners component.
experimental patch: https://github.com/openshift/origin/pull/8119
Adding Scott to cc:. You're the last one who updated the .service files - can you please look at it?
The mountpoint check should be running in the host mount namespace -- why do we need to mount in /dev?
Because volume plugins are not aware of running in container. Only mounter is.
Why is this assigned to me? Jon? This looks to be specific to the container.
Fixed by https://github.com/openshift/origin/pull/8182 Assigning to Scott, he did the fix. No development work should be needed now, we're just waiting for QE to the bug is really fixed.
This should be in OSE v3.2.0.7 that was build and pushed to qe today.
Tested on openshift v3.2.0.7 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 Steps were same with bug description. Pod status remained in 'ContainerCreating' # oc get pods NAME READY STATUS RESTARTS AGE cinderpd 0/1 ContainerCreating 0 10m Run docker exec interactively in the node container, tailf the /var/log/messages, found following errors: ``` Mar 24 06:56:06 openshift-111 atomic-openshift-node: I0324 06:56:06.904016 7383 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:" jhou", Name:"cinderpd", UID:"8d277d8b-f1ad-11e5-af25-fa163e4f5f19", APIVersion:"v1", ResourceVersion:"7283", FieldPath:""}): type: 'Warning' reason : 'FailedMount' Unable to mount volumes for pod "cinderpd_jhou(8d277d8b-f1ad-11e5-af25-fa163e4f5f19)": exit status 32 Mar 24 06:56:06 openshift-111 atomic-openshift-node: I0324 06:56:06.904103 7383 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:" jhou", Name:"cinderpd", UID:"8d277d8b-f1ad-11e5-af25-fa163e4f5f19", APIVersion:"v1", ResourceVersion:"7283", FieldPath:""}): type: 'Warning' reason : 'FailedSync' Error syncing pod, skipping: exit status 32 ``` Went to the openstack console -> volumes, the UI showed that the volume was in-use and was attached to my node 'openshift-111.lab.eng.nay.redhat.com'. But openshift thought it was a failed mount. Furthermore, the PV I created was dynamically provisioned, the pod was stuck in 'ContainerCreating', so I deleted the pod and pvc, the provisioned PV and volume were left behind, they were not deleted.(If this is later considered another issue, I will open another bug to track).
There is something wrong with OpenShift nsenter mounter, I'll look at it.
Fixed mounter: https://github.com/kubernetes/kubernetes/pull/23435. Waiting for review.
Merged as origin PR: https://github.com/openshift/origin/pull/8501
Merge failed. Please try merging again.
8501 just merged
Should be in atomic-openshift-3.2.0.18-1.git.0.c3ac515.el7. This has been built and staged for qe.
Verified on containerized setup of openshift v3.2.0.18 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 Run reproduce steps, this bug is not reproduced now. Mark this bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064