1677120 – OCP 4.0: Unable to deploy more than 50 nodejs/mongodb quickstart apps per worker node due to MountVolume.SetUp failed - couldn't propagate object cache errors

Bug 1677120 - OCP 4.0: Unable to deploy more than 50 nodejs/mongodb quickstart apps per worker node due to MountVolume.SetUp failed - couldn't propagate object cache errors

Summary: OCP 4.0: Unable to deploy more than 50 nodejs/mongodb quickstart apps per wor...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Robert Krawitz
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:	aos-scalability-40
Duplicates (1):	1679585 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-14 05:55 UTC by Walid A.
Modified:	2019-06-04 10:44 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:44:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:44:07 UTC

Description Walid A. 2019-02-14 05:55:54 UTC

Description:

Description of problem:
When trying to scale nodejs/mongodb quickstart app deployment past 250 pods per worker node, we are getting pod volume mount errors around 50 pods per node pod density.  This is causing the nodejs-mogodb-example pods to terminate, and eventually worker nodes go NodeNotReady as we try to deploy more quickstart apps.  This is on AWS cluster with 3 masters (m4.xlarge) and 3 worker nodes (m4.16xlarge).  KubeletArgument maxPods was set to 500 (from default 250) in the jopes of being able to scale pod deployment past 250 per node.

Events from namespace where the pod volume mount errors are observed:

# oc get events -n svt-5-njsmdb-7
LAST SEEN   TYPE      REASON                        KIND                    MESSAGE
28m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/mongodb-1-deploy to ip-10-0-144-70.us-east-2.compute.internal
28m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "deployer-token-cjwz8" : couldn't propagate object cache: timed out waiting for the condition
28m         Normal    Pulled                        Pod                     Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b25ce4783fabc70f42a42d920f2ea94fcb37aaf054a6eeb28e016e7b33ef45f" already present on machine
28m         Normal    Created                       Pod                     Created container
28m         Normal    Started                       Pod                     Started container
28m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/mongodb-1-fqw5h to ip-10-0-161-37.us-east-2.compute.internal
28m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition
27m         Normal    Pulled                        Pod                     Container image "image-registry.openshift-image-registry.svc:5000/openshift/mongodb@sha256:5c6e8d47d47dda6e8bf02335fe1d4cbe1c95b6dd0245a6b62fe2b2ee5d417fa4" already present on machine
27m         Normal    Created                       Pod                     Created container
27m         Normal    Started                       Pod                     Started container
28m         Normal    SuccessfulCreate              ReplicationController   Created pod: mongodb-1-fqw5h
28m         Normal    DeploymentCreated             DeploymentConfig        Created new replication controller "mongodb-1" for version 1
26m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-54h6w to ip-10-0-161-37.us-east-2.compute.internal
26m         Normal    Pulling                       Pod                     pulling image "image-registry.openshift-image-registry.svc:5000/svt-5-njsmdb-7/nodejs-mongodb-example@sha256:28df5cd72749253a9197b760d41da04334166b0754d88fe9564fcb9ebba7fdee"
26m         Normal    Pulled                        Pod                     Successfully pulled image "image-registry.openshift-image-registry.svc:5000/svt-5-njsmdb-7/nodejs-mongodb-example@sha256:28df5cd72749253a9197b760d41da04334166b0754d88fe9564fcb9ebba7fdee"
26m         Normal    Created                       Pod                     Created container
26m         Normal    Started                       Pod                     Started container
12m         Normal    Killing                       Pod                     Killing container with id cri-o://nodejs-mongodb-example:Need to kill Pod
26m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-6kg45 to ip-10-0-144-70.us-east-2.compute.internal
26m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition
28m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-build to ip-10-0-161-37.us-east-2.compute.internal
28m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "builder-dockercfg-c9hwd-pull" : couldn't propagate object cache: timed out waiting for the condition
28m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "build-ca-bundles" : couldn't propagate object cache: timed out waiting for the condition
28m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "builder-token-nqbd6" : couldn't propagate object cache: timed out waiting for the condition
28m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "builder-dockercfg-c9hwd-push" : couldn't propagate object cache: timed out waiting for the condition
27m         Normal    Pulled                        Pod                     Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c45bc57db718723861f83f43d49e8e0719367d23b32e18770f90f2376b63eb68" already present on machine
27m         Normal    Created                       Pod                     Created container
27m         Normal    Started                       Pod                     Started container
26m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-deploy to ip-10-0-161-37.us-east-2.compute.internal
26m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "deployer-token-cjwz8" : couldn't propagate object cache: timed out waiting for the condition
26m         Normal    Pulled                        Pod                     Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b25ce4783fabc70f42a42d920f2ea94fcb37aaf054a6eeb28e016e7b33ef45f" already present on machine
26m         Normal    Created                       Pod                     Created container
26m         Normal    Started                       Pod                     Started container
26m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-jlt44 to ip-10-0-128-227.us-east-2.compute.internal
26m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition
26m         Normal    Scheduled                     Pod                     Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-slbpx to ip-10-0-128-227.us-east-2.compute.internal
26m         Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition
26m         Normal    SuccessfulCreate              ReplicationController   Created pod: nodejs-mongodb-example-1-jlt44
26m         Normal    SuccessfulCreate              ReplicationController   Created pod: nodejs-mongodb-example-1-6kg45
26m         Normal    SuccessfulCreate              ReplicationController   Created pod: nodejs-mongodb-example-1-54h6w
26m         Normal    SuccessfulCreate              ReplicationController   Created pod: nodejs-mongodb-example-1-slbpx
24m         Normal    BuildCompleted                Build                   Build svt-5-njsmdb-7/nodejs-mongodb-example-1 completed successfully
12m         Normal    SuccessfulDelete              ReplicationController   Deleted pod: nodejs-mongodb-example-1-54h6w
12m         Normal    SuccessfulDelete              ReplicationController   Deleted pod: nodejs-mongodb-example-1-jlt44
12m         Normal    SuccessfulDelete              ReplicationController   Deleted pod: nodejs-mongodb-example-1-slbpx
12m         Normal    SuccessfulDelete              ReplicationController   Deleted pod: nodejs-mongodb-example-1-6kg45
26m         Normal    DeploymentCreated             DeploymentConfig        Created new replication controller "nodejs-mongodb-example-1" for version 1
12m         Normal    ReplicationControllerScaled   DeploymentConfig        Scaled replication controller "nodejs-mongodb-example-1" from 4 to 0
# 


We are also seeing hyperkube error:
Feb 12 16:20:02 ip-10-0-144-70 hyperkube[14763]: E0212 16:20:02.137651   14763 reflector.go:134] object-"svt-5-njsmdb-7"/"deployer-dockercfg-pf8f7": Failed to list *v1.Secret: secrets "deployer-dockercfg-pf8f7" is forbidden: User "system:node:ip-10-0-144-70.us-east-2.compute.internal" cannot list resource "secrets" in API group "" in the namespace "svt-5-njsmdb-7": no path found to object



Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-01-30-145955   True        False         22h     Cluster version is 4.0.0-0.nightly-2019-01-30-145955
"image": "registry.svc.ci.openshift.org/ocp/release@sha256:d03ce0ef85540a1fff8bfc1c408253404aaecb2b958d7c3f24896f3597c3715b",
"version": "4.0.0-0.nightly-2019-01-30-145955"

# ./openshift-install version
./openshift-install v4.0.0-0.150.0.0-dirty


How reproducible:
Reproduced several times.

Steps to Reproduce:
1. Create a new OCP 4.0 cluster with nightly payload:
  a. export BUILD_VERSION=4.0.0-0.nightly-2019-01-30-145955
  b. oc adm release info --pullspecs registry.svc.ci.openshift.org/ocp/release:${BUILD_VERSION} | grep installer
  c. export CONTAINER_ID=$(docker create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ae9b43dac5708dae29a2c0899aa21e99f56d54540222f5b7da383261af0ba741)
  d. mkdir installer_2019-01-30-145955, 
  e. cd installer_2019-01-30-145955
  f. docker cp $CONTAINER_ID:/usr/bin/openshift-install .
  g. export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-30-145955
  h. export _OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-30-145955
  i. ./openshift-install --dir=$(pwd) create cluster --log-level=debug

2. git clone the openshift/svt repo:
   git clone https://github.com/openshift/svt.git
   
3. cd svt/openshift_scalability/
4. Run python cluster-loader with config file to deploy 10 projects each with one nodejs-mongodb quickstart build, deploymentconfig, resulting in one mongodb pod and 4 replicas of nodejs-mongodb-example pods.  Total 50 pods.  Config file and template is in the attached Logs.
5. Wait 10 minutes
6. Deploy next 10 projects with 50 more mongo/nodejs pods
7. Repeat steps 5-6 three more times 

Actual results:
After about 50 mongo/nodejs running pods per worker node, we started getting errors "MountVolume.SetUp failed for volume ... couldn't propagate object cache: timed out waiting for the condition" and the nodejs-mongodb-example started to get deleted.  Each subsequent project with new quickstart mongodb/nodejs deployment was not successful, as we kep getting these MountVolume.Setup failed errors.  Eventually the worker node go NotReady.

Expected results:
All mongodb and nodejs-mongodb-example pods should running in all the projects

Additional info:
Links to journal logs from worker nodes, events, and oc command outputs are provided in next comment.

Comment 2 Robert Krawitz 2019-02-18 16:23:19 UTC

It looks like there are some other dependencies (I'm getting a dependency on rbd, for instance).  Running

pip install rbd

gives

$ pip install rbd
Collecting rbd
  Could not find a version that satisfies the requirement rbd (from versions: )
No matching distribution found for rbd

Also, what is the exact conmmand required for step 4?

Comment 9 Robert Krawitz 2019-02-19 16:54:48 UTC

Reproduced in testing.  Note additional error of "MountVolume.SetUp failed for volume "build-system-configs" : configmap "nodejs-mongodb-example-1-sys-config" not found".

$ oc get events --all-namespaces |grep svt |grep FailedMount
svt-1-njsmdb-0                               2m30s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "deployer-token-ppjdx" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-1                               2m31s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "builder-token-cqclw" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-1                               2m31s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "build-ca-bundles" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-1                               2m31s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "build-system-configs" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-2                               2m32s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "build-system-configs" : configmap "nodejs-mongodb-example-1-sys-config" not found
svt-1-njsmdb-2                               2m32s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "build-ca-bundles" : configmap "nodejs-mongodb-example-1-ca" not found
svt-1-njsmdb-4                               2m32s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "builder-dockercfg-xsh89-pull" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-4                               2m32s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "build-system-configs" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-4                               2m32s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "builder-token-k9xkq" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-4                               2m32s       Warning   FailedMount                   Pod                     MountVolume.SetUp failed for volume "builder-dockercfg-xsh89-push" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-5   2m33s   Warning   FailedMount                Pod                     MountVolume.SetUp failed for volume "builder-dockercfg-smlzd-pull" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-5   2m33s   Warning   FailedMount                Pod                     MountVolume.SetUp failed for volume "builder-dockercfg-smlzd-push" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-8   2m32s   Warning   FailedMount                Pod                     MountVolume.SetUp failed for volume "deployer-token-8z5l8" : couldn't propagate object cache: timed out waiting for the condition
svt-1-njsmdb-9   2m34s   Warning   FailedMount                Pod                     MountVolume.SetUp failed for volume "build-system-configs" : configmap "nodejs-mongodb-example-1-sys-config" not found

Comment 29 Ryan Phillips 2019-03-08 22:24:49 UTC

*** Bug 1679585 has been marked as a duplicate of this bug. ***

Comment 34 Walid A. 2019-03-19 00:29:23 UTC

Verified on OCP 4.0 cluster on AWS with 3 m5.xlarge masters and 3 m5.12xlarge worker nodes.

# ./openshift-install version
./openshift-install v4.0.22-201903141706-dirty
built from commit 61cc5c968744a41a7e2f78a8bad14c35101e83dd
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-15-063749   True        False         6h5m    Cluster version is 4.0.0-0.nightly-2019-03-15-063749
# oc version
Client Version: version.Info{Major:"4", Minor:"0+", GitVersion:"v4.0.22", GitCommit:"219bbe2f0c", GitTreeState:"", BuildDate:"2019-03-10T22:23:11Z", GoVersion:"", Compiler:"", Platform:""}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.4+befe71b", GitCommit:"befe71b", GitTreeState:"clean", BuildDate:"2019-03-15T05:57:43Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
# oc version --short
Client Version: v4.0.22
Server Version: v1.12.4+befe71b


Ran 150 projects, each with 5 nodejs/mongodb apps deployed.  The last 20 pods went into pending state as we reached the max 250 pods per worker node.
Value of kubeletArgument maxPods was the default 250.  I did not see any nodejs app pods terminating after being started.

This is the first time on OCP 4.0 that we were able to scale nodejs quickstart app deployment up to the maxPods 250 pods per node value.

########### svt pods Running: 
709
########### svt pods NOT Running: 
449
########### svt pods NOT Running and NOT Completed: 
20
########### svt pods NOT Running and NOT Completed are: 
svt-15-njsmdb-0                                         mongodb-1-qjvjh                                                       0/1     Pending     0          5m13s
svt-15-njsmdb-0                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          3m32s
svt-15-njsmdb-1                                         mongodb-1-dcvj8                                                       0/1     Pending     0          6m10s
svt-15-njsmdb-1                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          5m24s
svt-15-njsmdb-2                                         mongodb-1-9k2ll                                                       0/1     Pending     0          6m8s
svt-15-njsmdb-2                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          90s
svt-15-njsmdb-3                                         mongodb-1-deploy                                                      0/1     Pending     0          6m14s
svt-15-njsmdb-3                                         nodejs-mongodb-example-1-build                                        0/1     Pending     0          6m14s
svt-15-njsmdb-4                                         mongodb-1-gknfd                                                       0/1     Pending     0          5m12s
svt-15-njsmdb-4                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          4m34s
svt-15-njsmdb-5                                         mongodb-1-s7594                                                       0/1     Pending     0          6m10s
svt-15-njsmdb-5                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          5m28s
svt-15-njsmdb-6                                         mongodb-1-p5m25                                                       0/1     Pending     0          21s
svt-15-njsmdb-6                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          2m27s
svt-15-njsmdb-7                                         mongodb-1-5jb8p                                                       0/1     Pending     0          6m9s
svt-15-njsmdb-7                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          5m5s
svt-15-njsmdb-8                                         mongodb-1-x7gq9                                                       0/1     Pending     0          4m55s
svt-15-njsmdb-8                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          30s
svt-15-njsmdb-9                                         mongodb-1-bqjlb                                                       0/1     Pending     0          6m11s
svt-15-njsmdb-9                                         nodejs-mongodb-example-1-deploy                                       0/1     Pending     0          5m27s

############ Pods per worker node:
Non-terminated Pods:                      (250 in total)
Non-terminated Pods:                      (250 in total)
Non-terminated Pods:                      (250 in total)

Comment 36 errata-xmlrpc 2019-06-04 10:44:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.