Description of problem: Pod are not starting anymore, waiting in "pending" state for several minutes. Version-Release number of selected component (if applicable): Master : 1.5.0 Nodes : 1.4.1 (because of this critical bug https://github.com/openshift/origin/issues/14092) How reproducible: Deploy or remove an existing pod Steps to Reproduce: 1. Create a pod 2. Remove the pod 3. Wait for pending new pod (for minutes) Actual results: Pod is finally starting, after a while (timeout). In logs / events: Unable to mount volumes for pod "xxxxxx": timeout expired waiting for volumes to attach/mount for pod "xxxx"/"xxxx". list of unattached/unmounted volumes=[xxx-secrets yyyy-secrets zzzz-secrets ssl-certs xxxx-secrets default-token-n6pbo] Expected results: Pod starting after a few seconds Additional info: Mailing-list thread: https://lists.openshift.redhat.com/openshift-archives/users/2017-July/msg00036.html
Restarting origin-node seems to make the issue go away (let's see for how long)
Just a few minutes apparently, it's back again (same node)
Can you attach node logs for when this symptom occurs? From the mailing list thread, you appear to isolate the issue to the docker storage driver, but in the BZ you link to https://github.com/openshift/origin/issues/14092. That issue is fixed in 1.5 via https://github.com/openshift/origin/pull/14801 As a result, it appears this issue is localized to a devicemapper problem, but attaching docker and openshift node logs would help us track the problem down further.
reducing priority pending availability of logs.
As mentioned today on the mailing-list: Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:15.197266 65220 docker_manager.go:357] NetworkPlugin cni failed on the status hook for pod 'test-secret-3-deploy' - Unexpected command output Device "eth0" does not exist. Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: with error: exit status 1 Jul 17 10:34:17 prod-node-rbx-2.example.com origin-node[65154]: I0717 10:34:17.519925 65220 docker_manager.go:2177] Determined pod ip after infra change: "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)": "10.1.3.9" Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.314570 65220 docker_manager.go:761] Logging security options: {key:seccomp value:unconfined msg:} Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.708998 65220 docker_manager.go:1711] Failed to create symbolic link to the log file of pod "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)" container "deployment": symlink /var/log/containers/test-secret-3-deploy_issue-29059_deployment-ded8b25b6ad78a620d981292111a2f0a46da14b879f9e862f630228e07e8cd7c.log: no such file or directory We are also seeing: Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.128033 65220 container_gc.go:252] Failed to remove container "749ba6cb1fd2c275442b0f43df35166c596a4717af97841a6a7a047c85a6e02f": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 749ba6cb1fd2c275442b0f43df35166c596a4717af97841a6a7a047c85a6e02f: remove /var/lib/docker/devicemapper/mnt/b3b59435140f06c8b7f924ef6da564bc861ee37fec635366941e3ae0907cf8c2: device or resource busy"} Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.130684 65220 container_gc.go:252] Failed to remove container "8d7876cb67327f1caef7db3d3b858832f37c1c67c200806a02cfd54c32ca6647": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 8d7876cb67327f1caef7db3d3b858832f37c1c67c200806a02cfd54c32ca6647: remove /var/lib/docker/devicemapper/mnt/59c8d8f21f287fd63c7783c1044f925ef2b2f3c00845869fa1e1b412682bdfea: device or resource busy"} Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.133156 65220 container_gc.go:252] Failed to remove container "bd32dbeca3500469bc09add7b5bb713426ff986b22eff941c5e7a6151285193d": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem bd32dbeca3500469bc09add7b5bb713426ff986b22eff941c5e7a6151285193d: remove /var/lib/docker/devicemapper/mnt/0084ba2284507894e953ad84800f87faa6ddc5bd826a4112f21624c8a843a6e1: device or resource busy"} (we have stopped docker on this node last week, removed completely /var/lib/docker, recreated the logical volume, and less than 1 hour later, similar issues were in the logs). When we deploy, we're seeing a LOT of these lines (before the "device or resource busy" ones): Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.193302 65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-drip-secrets" (spec.Name: "drip-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a"). Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.193424 65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-smtp-secrets" (spec.Name: "smtp-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a"). Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.403001 65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/2f808fd2-690e-11e7-9452-005056b1755a-airbrake-secrets" (spec.Name: "airbrake-secrets") pod "2f808fd2-690e-11e7-9452-005056b1755a" (UID: "2f808fd2-690e-11e7-9452-005056b1755a"). Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.403684 65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-access-token-secrets" (spec.Name: "access-token-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a"). Finally, this is certainly explaining the minutes we're waiting for deploys: Jul 17 11:11:49 prod-node-rbx-2.example.com origin-node[65154]: E0717 11:11:49.501517 65220 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/secret/901e222b-68cf-11e7-9452-005056b1755a-access-token-secrets\" (\"901e222b-68cf-11e7-9452-005056b1755a\")" failed. No retries permitted until 2017-07-17 11:13:49.501492583 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/901e222b-68cf-11e7-9452-005056b1755a-access-token-secrets" (volume.spec.Name: "access-token-secrets") pod "901e222b-68cf-11e7-9452-005056b1755a" (UID: "901e222b-68cf-11e7-9452-005056b1755a") with: rename /var/lib/origin/openshift.local.volumes/pods/901e222b-68cf-11e7-9452-005056b1755a/volumes/kubernetes.io~secret/access-token-secrets /var/lib/origin/openshift.local.volumes/pods/901e222b-68cf-11e7- 9452-005056b1755a/volumes/kubernetes.io~secret/wrapped_access-token-secrets.deleting~208587724: device or resource busy Restarting origin-node probably resets this counter
We have rolled back origin-master to 1.4.1, and the problem seem to be gone (at least so far, let's wait a complete 24h to conclude something).
I wonder if it's not a duplicate of #1444096
We have restarted all nodes, using a 1.4 master, and it worked for 1-2 hours, and started to hang again. All the logical volumes on the nodes were reset, we don't see "device or resource busy" messages any more. The only errors we still see: Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 5m 5m 1 {default-scheduler } Normal Scheduled Successfully assigned drip-sync-21-deploy to prod-node-rbx-3.example.com 3m 1m 2 {kubelet prod-node-rbx-3.example.com} Warning FailedMount Unable to mount volumes for pod "drip-sync-21-deploy_gemnasium-production(a8b9406d-6bf1-11e7-bdbc-005056b1755a)": timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2] 3m 1m 2 {kubelet prod-node-rbx-3.example.com} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2] and a few minutes later: Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 6m 6m 1 {default-scheduler } Normal Scheduled Successfully assigned drip-sync-21-deploy to prod-node-rbx-3.example.com 4m 2m 2 {kubelet prod-node-rbx-3.example.com} Warning FailedMount Unable to mount volumes for pod "drip-sync-21-deploy_gemnasium-production(a8b9406d-6bf1-11e7-bdbc-005056b1755a)": timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2] 4m 2m 2 {kubelet prod-node-rbx-3.example.com} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2] 46s 46s 1 {kubelet prod-node-rbx-3.example.com} spec.containers{deployment} Normal Pulled Container image "openshift/origin-deployer:v1.5.1" already present on machine 46s 46s 1 {kubelet prod-node-rbx-3.example.com} spec.containers{deployment} Normal Created Created container with docker id a4c41949d0fe; Security:[seccomp=unconfined] 46s 46s 1 {kubelet prod-node-rbx-3.example.com} spec.containers{deployment} Normal Started Started container with docker id a4c41949d0fe I have checked out git repo, the only change made the day we started to have this issue is the use of env vars set from secrets (env[x].valueFrom.secretKeyRef).
All of this was related to https://github.com/kubernetes/kubernetes/issues/38498 We had a lot of orphans on the nodes. Once removed with: find /var/lib/origin/openshift.local.volumes/pods/ -maxdepth 1 -type d -mtime +30 -exec rm -rf \{\} \; all our issues disappeared, and deploys are now taking less than 5s. We can close this issue now.
See comment #9