Bug 1471235 - Pods (even deploy pods) are stuck in pending state
Pods (even deploy pods) are stuck in pending state
Status: NEW
Product: OpenShift Origin
Classification: Red Hat
Component: Kubernetes (Show other bugs)
3.x
x86_64 Linux
unspecified Severity low
: ---
: ---
Assigned To: Derek Carr
DeShuai Ma
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-14 15:06 EDT by Philippe Lafoucriere
Modified: 2017-07-25 19:52 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Philippe Lafoucriere 2017-07-14 15:06:29 EDT
Description of problem:

Pod are not starting anymore, waiting in "pending" state for several minutes.

Version-Release number of selected component (if applicable):

Master : 1.5.0
Nodes : 1.4.1 (because of this critical bug https://github.com/openshift/origin/issues/14092)

How reproducible:

Deploy or remove an existing pod 


Steps to Reproduce:
1. Create a pod
2. Remove the pod
3. Wait for pending new pod (for minutes)

Actual results:

Pod is finally starting, after a while (timeout).

In logs / events:

Unable to mount volumes for pod "xxxxxx": timeout expired waiting for volumes to attach/mount for pod "xxxx"/"xxxx". list of unattached/unmounted volumes=[xxx-secrets yyyy-secrets zzzz-secrets ssl-certs xxxx-secrets default-token-n6pbo]


Expected results:

Pod starting after a few seconds

Additional info:

Mailing-list thread: https://lists.openshift.redhat.com/openshift-archives/users/2017-July/msg00036.html
Comment 1 Philippe Lafoucriere 2017-07-14 15:50:47 EDT
Restarting origin-node seems to make the issue go away (let's see for how long)
Comment 2 Philippe Lafoucriere 2017-07-14 15:54:21 EDT
Just a few minutes apparently, it's back again (same node)
Comment 3 Derek Carr 2017-07-17 10:52:15 EDT
Can you attach node logs for when this symptom occurs?

From the mailing list thread, you appear to isolate the issue to the docker storage driver, but in the BZ you link to https://github.com/openshift/origin/issues/14092. 
 That issue is fixed in 1.5 via https://github.com/openshift/origin/pull/14801

As a result, it appears this issue is localized to a devicemapper problem, but attaching docker and openshift node logs would help us track the problem down further.
Comment 4 Derek Carr 2017-07-17 10:52:40 EDT
reducing priority pending availability of logs.
Comment 5 Philippe Lafoucriere 2017-07-17 11:13:04 EDT
As mentioned today on the mailing-list:

Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:15.197266   65220 docker_manager.go:357] NetworkPlugin cni failed on the status hook for pod 'test-secret-3-deploy' - Unexpected command output Device "eth0" does not exist.
Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: with error: exit status 1
Jul 17 10:34:17 prod-node-rbx-2.example.com origin-node[65154]: I0717 10:34:17.519925   65220 docker_manager.go:2177] Determined pod ip after infra change: "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)": "10.1.3.9"
Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.314570   65220 docker_manager.go:761] Logging security options: {key:seccomp value:unconfined msg:}
Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.708998   65220 docker_manager.go:1711] Failed to create symbolic link to the log file of pod "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)" container "deployment": symlink  /var/log/containers/test-secret-3-deploy_issue-29059_deployment-ded8b25b6ad78a620d981292111a2f0a46da14b879f9e862f630228e07e8cd7c.log: no such file or directory


We are also seeing:

Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.128033   65220 container_gc.go:252] Failed to remove container "749ba6cb1fd2c275442b0f43df35166c596a4717af97841a6a7a047c85a6e02f": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 749ba6cb1fd2c275442b0f43df35166c596a4717af97841a6a7a047c85a6e02f: remove /var/lib/docker/devicemapper/mnt/b3b59435140f06c8b7f924ef6da564bc861ee37fec635366941e3ae0907cf8c2: device or resource busy"}
Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.130684   65220 container_gc.go:252] Failed to remove container "8d7876cb67327f1caef7db3d3b858832f37c1c67c200806a02cfd54c32ca6647": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 8d7876cb67327f1caef7db3d3b858832f37c1c67c200806a02cfd54c32ca6647: remove /var/lib/docker/devicemapper/mnt/59c8d8f21f287fd63c7783c1044f925ef2b2f3c00845869fa1e1b412682bdfea: device or resource busy"}
Jul 17 11:07:02 prod-node-rbx-2.example.com origin-node[65154]: W0717 11:07:02.133156   65220 container_gc.go:252] Failed to remove container "bd32dbeca3500469bc09add7b5bb713426ff986b22eff941c5e7a6151285193d": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem bd32dbeca3500469bc09add7b5bb713426ff986b22eff941c5e7a6151285193d: remove /var/lib/docker/devicemapper/mnt/0084ba2284507894e953ad84800f87faa6ddc5bd826a4112f21624c8a843a6e1: device or resource busy"}

(we have stopped docker on this node last week, removed completely /var/lib/docker, recreated the logical volume, and less than 1 hour later, similar issues were in the logs).

When we deploy, we're seeing a LOT of these lines (before the "device or resource busy" ones):

Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.193302   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-drip-secrets" (spec.Name: "drip-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a").
Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.193424   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-smtp-secrets" (spec.Name: "smtp-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a").
Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.403001   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/2f808fd2-690e-11e7-9452-005056b1755a-airbrake-secrets" (spec.Name: "airbrake-secrets") pod "2f808fd2-690e-11e7-9452-005056b1755a" (UID: "2f808fd2-690e-11e7-9452-005056b1755a").
Jul 17 11:09:56 prod-node-rbx-2.example.com origin-node[65154]: I0717 11:09:56.403684   65220 operation_executor.go:803] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/30d75ae3-690e-11e7-9452-005056b1755a-access-token-secrets" (spec.Name: "access-token-secrets") pod "30d75ae3-690e-11e7-9452-005056b1755a" (UID: "30d75ae3-690e-11e7-9452-005056b1755a").


Finally, this is certainly explaining the minutes we're waiting for deploys:

Jul 17 11:11:49 prod-node-rbx-2.example.com origin-node[65154]: E0717 11:11:49.501517   65220 nestedpendingoperations.go:253] Operation for "\"kubernetes.io/secret/901e222b-68cf-11e7-9452-005056b1755a-access-token-secrets\" (\"901e222b-68cf-11e7-9452-005056b1755a\")" failed. No retries permitted until 2017-07-17 11:13:49.501492583 -0400 EDT (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/901e222b-68cf-11e7-9452-005056b1755a-access-token-secrets" (volume.spec.Name: "access-token-secrets") pod "901e222b-68cf-11e7-9452-005056b1755a" (UID: "901e222b-68cf-11e7-9452-005056b1755a") with: rename /var/lib/origin/openshift.local.volumes/pods/901e222b-68cf-11e7-9452-005056b1755a/volumes/kubernetes.io~secret/access-token-secrets /var/lib/origin/openshift.local.volumes/pods/901e222b-68cf-11e7-
9452-005056b1755a/volumes/kubernetes.io~secret/wrapped_access-token-secrets.deleting~208587724: device or resource busy

Restarting origin-node probably resets this counter
Comment 6 Philippe Lafoucriere 2017-07-17 18:48:14 EDT
We have rolled back origin-master to 1.4.1, and the problem seem to be gone (at least so far, let's wait a complete 24h to conclude something).
Comment 7 Philippe Lafoucriere 2017-07-17 20:46:25 EDT
I wonder if it's not a duplicate of #1444096
Comment 8 Philippe Lafoucriere 2017-07-18 15:53:37 EDT
We have restarted all nodes, using a 1.4 master, and it worked for 1-2 hours, and started to hang again. All the logical volumes on the nodes were reset, we don't see 
"device or resource busy" messages any more.


The only errors we still see:

Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                                            -------------   --------        ------          -------
  5m            5m              1       {default-scheduler }                                            Normal          Scheduled       Successfully assigned drip-sync-21-deploy to prod-node-rbx-3.example.com
  3m            1m              2       {kubelet prod-node-rbx-3.example.com}                  Warning         FailedMount     Unable to mount volumes for pod "drip-sync-21-deploy_gemnasium-production(a8b9406d-6bf1-11e7-bdbc-005056b1755a)": timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]
  3m            1m              2       {kubelet prod-node-rbx-3.example.com}                  Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]



and a few minutes later:

Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath                   Type            Reason          Message
  ---------     --------        -----   ----                                            -------------                   --------        ------          -------
  6m            6m              1       {default-scheduler }                                                            Normal          Scheduled       Successfully assigned drip-sync-21-deploy to prod-node-rbx-3.example.com
  4m            2m              2       {kubelet prod-node-rbx-3.example.com}                                  Warning         FailedMount     Unable to mount volumes for pod "drip-sync-21-deploy_gemnasium-production(a8b9406d-6bf1-11e7-bdbc-005056b1755a)": timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]
  4m            2m              2       {kubelet prod-node-rbx-3.example.com}                                  Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "drip-sync-21-deploy"/"gemnasium-production". list of unattached/unmounted volumes=[deployer-token-vroi2]
  46s           46s             1       {kubelet prod-node-rbx-3.example.com}  spec.containers{deployment}     Normal          Pulled          Container image "openshift/origin-deployer:v1.5.1" already present on machine
  46s           46s             1       {kubelet prod-node-rbx-3.example.com}  spec.containers{deployment}     Normal          Created         Created container with docker id a4c41949d0fe; Security:[seccomp=unconfined]
  46s           46s             1       {kubelet prod-node-rbx-3.example.com}  spec.containers{deployment}     Normal          Started         Started container with docker id a4c41949d0fe


I have checked out git repo, the only change made the day we started to have this issue is the use of env vars set from secrets (env[x].valueFrom.secretKeyRef).
Comment 9 Philippe Lafoucriere 2017-07-25 19:52:18 EDT
All of this was related to https://github.com/kubernetes/kubernetes/issues/38498
We had a lot of orphans on the nodes.
Once removed with:

find /var/lib/origin/openshift.local.volumes/pods/ -maxdepth 1 -type d -mtime +30 -exec rm -rf \{\} \;

all our issues disappeared, and deploys are now taking less than 5s.

We can close this issue now.

Note You need to log in before you can comment on or make changes to this bug.