1471235 – Pods (even deploy pods) are stuck in pending state

Bug 1471235 - Pods (even deploy pods) are stuck in pending state

Summary: Pods (even deploy pods) are stuck in pending state

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Derek Carr
QA Contact:	Xiaoli Tian
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-14 19:06 UTC by Philippe Lafoucriere
Modified:	2019-07-03 14:53 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-03 14:53:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Philippe Lafoucriere 2017-07-14 19:06:29 UTC

Description of problem:

Pod are not starting anymore, waiting in "pending" state for several minutes.

Version-Release number of selected component (if applicable):

Master : 1.5.0
Nodes : 1.4.1 (because of this critical bug https://github.com/openshift/origin/issues/14092)

How reproducible:

Deploy or remove an existing pod 


Steps to Reproduce:
1. Create a pod
2. Remove the pod
3. Wait for pending new pod (for minutes)

Actual results:

Pod is finally starting, after a while (timeout).

In logs / events:

Unable to mount volumes for pod "xxxxxx": timeout expired waiting for volumes to attach/mount for pod "xxxx"/"xxxx". list of unattached/unmounted volumes=[xxx-secrets yyyy-secrets zzzz-secrets ssl-certs xxxx-secrets default-token-n6pbo]


Expected results:

Pod starting after a few seconds

Additional info:

Mailing-list thread: https://lists.openshift.redhat.com/openshift-archives/users/2017-July/msg00036.html

Comment 1 Philippe Lafoucriere 2017-07-14 19:50:47 UTC

Restarting origin-node seems to make the issue go away (let's see for how long)

Comment 2 Philippe Lafoucriere 2017-07-14 19:54:21 UTC

Just a few minutes apparently, it's back again (same node)

Comment 3 Derek Carr 2017-07-17 14:52:15 UTC

Can you attach node logs for when this symptom occurs?

From the mailing list thread, you appear to isolate the issue to the docker storage driver, but in the BZ you link to https://github.com/openshift/origin/issues/14092. 
 That issue is fixed in 1.5 via https://github.com/openshift/origin/pull/14801

As a result, it appears this issue is localized to a devicemapper problem, but attaching docker and openshift node logs would help us track the problem down further.

Comment 4 Derek Carr 2017-07-17 14:52:40 UTC

reducing priority pending availability of logs.

Comment 5 Philippe Lafoucriere 2017-07-17 15:13:04 UTC

As mentioned today on the mailing-list:

Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:15.197266   65220 docker_manager.go:357] NetworkPlugin cni failed on the status hook for pod 'test-secret-3-deploy' - Unexpected command output Device "eth0" does not exist.
Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: with error: exit status 1
Jul 17 10:34:17 prod-node-rbx-2.example.com origin-node[65154]: I0717 10:34:17.519925   65220 docker_manager.go:2177] Determined pod ip after infra change: "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)": "10.1.3.9"
Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.314570   65220 docker_manager.go:761] Logging security options: {key:seccomp value:unconfined msg:}
Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.708998   65220 docker_manager.go:1711] Failed to create symbolic link to the log file of pod "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)" container "deployment": symlink  /var/log/containers/test-secret-3-deploy_issue-29059_deployment-ded8b25b6ad78a620d981292111a2f0a46da14b879f9e862f630228e07e8cd7c.log: no such file or directory


We are also seeing:

Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.128033   65220 container_gc.go:252] Failed to remove container "749ba6cb1fd2c275442b0f43df35166c596a4717af97841a6a7a047c85a6e02f": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 749ba6cb1fd2c275442b0f43df35166c596a4717af97841a6a7a047c85a6e02f: remove /var/lib/docker/devicemapper/mnt/b3b59435140f06c8b7f924ef6da564bc861ee37fec635366941e3ae0907cf8c2: device or resource busy"}
Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.130684   65220 container_gc.go:252] Failed to remove container "8d7876cb67327f1caef7db3d3b858832f37c1c67c200806a02cfd54c32ca6647": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 8d7876cb67327f1caef7db3d3b858832f37c1c67c200806a02cfd54c32ca6647: remove /var/lib/docker/devicemapper/mnt/59c8d8f21f287fd63c7783c1044f925ef2b2f3c00845869fa1e1b412682bdfea: device or resource busy"}
Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.133156   65220 container_gc.go:252] Failed to remove container "bd32dbeca3500469bc09add7b5bb713426ff986b22eff941c5e7a6151285193d": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem bd32dbeca3500469bc09add7b5bb713426ff986b22eff941c5e7a6151285193d: remove /var/lib/docker/devicemapper/mnt/0084ba2284507894e953ad84800f87faa6ddc5bd826a4112f21624c8a843a6e1: device or resource busy"}

(we have stopped docker on this node last week, removed completely /var/lib/docker, recreated the logical volume, and less than 1 hour later, similar issues were in the logs).

When we deploy, we're seeing a LOT of these lines (before the "device or resource busy" ones):

Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.193302   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-drip-secrets" (spec.Name: "drip-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a").
Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.193424   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-smtp-secrets" (spec.Name: "smtp-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a").
Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.403001   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/2f808fd2-690e-11e7-9452-005056b1755a-airbrake-secrets" (spec.Name: "airbrake-secrets") pod "2f808fd2-690e-11e7-9452-005056b1755a" (UID: "2f808fd2-690e-11e7-9452-005056b1755a").
Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.403684   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-access-token-secrets" (spec.Name: "access-token-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a").


Finally, this is certainly explaining the minutes we're waiting for deploys:

Jul 17 11:11:49 prod-node-rbx-2.example.com origin-node[65154]: E0717 11:11:49.501517   65220 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/secret/901e222b-68cf-11e7-9452-005056b1755a-access-token-secrets\" (\"901e222b-68cf-11e7-9452-005056b1755a\")" failed. No retries permitted until 2017-07-17 11:13:49.501492583 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/901e222b-68cf-11e7-9452-005056b1755a-access-token-secrets" (volume.spec.Name: "access-token-secrets") pod "901e222b-68cf-11e7-9452-005056b1755a" (UID: "901e222b-68cf-11e7-9452-005056b1755a") with: rename /var/lib/origin/openshift.local.volumes/pods/901e222b-68cf-11e7-9452-005056b1755a/volumes/kubernetes.io~secret/access-token-secrets /var/lib/origin/openshift.local.volumes/pods/901e222b-68cf-11e7-
9452-005056b1755a/volumes/kubernetes.io~secret/wrapped_access-token-secrets.deleting~208587724: device or resource busy

Restarting origin-node probably resets this counter

Comment 6 Philippe Lafoucriere 2017-07-17 22:48:14 UTC

We have rolled back origin-master to 1.4.1, and the problem seem to be gone (at least so far, let's wait a complete 24h to conclude something).

Comment 7 Philippe Lafoucriere 2017-07-18 00:46:25 UTC

I wonder if it's not a duplicate of #1444096

Comment 8 Philippe Lafoucriere 2017-07-18 19:53:37 UTC

We have restarted all nodes, using a 1.4 master, and it worked for 1-2 hours, and started to hang again. All the logical volumes on the nodes were reset, we don't see 
"device or resource busy" messages any more.


The only errors we still see:

Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                                            -------------   --------        ------          -------
  5m            5m              1       {default-scheduler }                                            Normal          Scheduled       Successfully assigned drip-sync-21-deploy to prod-node-rbx-3.example.com
  3m            1m              2       {kubelet prod-node-rbx-3.example.com}                  Warning         FailedMount     Unable to mount volumes for pod "drip-sync-21-deploy_gemnasium-production(a8b9406d-6bf1-11e7-bdbc-005056b1755a)": timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]
  3m            1m              2       {kubelet prod-node-rbx-3.example.com}                  Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]



and a few minutes later:

Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath                   Type            Reason          Message
  ---------     --------        -----   ----                                            -------------                   --------        ------          -------
  6m            6m              1       {default-scheduler }                                                            Normal          Scheduled       Successfully assigned drip-sync-21-deploy to prod-node-rbx-3.example.com
  4m            2m              2       {kubelet prod-node-rbx-3.example.com}                                  Warning         FailedMount     Unable to mount volumes for pod "drip-sync-21-deploy_gemnasium-production(a8b9406d-6bf1-11e7-bdbc-005056b1755a)": timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]
  4m            2m              2       {kubelet prod-node-rbx-3.example.com}                                  Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]
  46s           46s             1       {kubelet prod-node-rbx-3.example.com}  spec.containers{deployment}     Normal          Pulled          Container image "openshift/origin-deployer:v1.5.1" already present on machine
  46s           46s             1       {kubelet prod-node-rbx-3.example.com}  spec.containers{deployment}     Normal          Created         Created container with docker id a4c41949d0fe; Security:[seccomp=unconfined]
  46s           46s             1       {kubelet prod-node-rbx-3.example.com}  spec.containers{deployment}     Normal          Started         Started container with docker id a4c41949d0fe


I have checked out git repo, the only change made the day we started to have this issue is the use of env vars set from secrets (env[x].valueFrom.secretKeyRef).

Comment 9 Philippe Lafoucriere 2017-07-25 23:52:18 UTC

All of this was related to https://github.com/kubernetes/kubernetes/issues/38498
We had a lot of orphans on the nodes.
Once removed with:

find /var/lib/origin/openshift.local.volumes/pods/ -maxdepth 1 -type d -mtime +30 -exec rm -rf \{\} \;

all our issues disappeared, and deploys are now taking less than 5s.

We can close this issue now.

Comment 11 Greg Blomquist 2019-07-03 14:53:28 UTC

See comment #9

Note You need to log in before you can comment on or make changes to this bug.