Hide Forgot
Description: Description of problem: When trying to scale nodejs/mongodb quickstart app deployment past 250 pods per worker node, we are getting pod volume mount errors around 50 pods per node pod density. This is causing the nodejs-mogodb-example pods to terminate, and eventually worker nodes go NodeNotReady as we try to deploy more quickstart apps. This is on AWS cluster with 3 masters (m4.xlarge) and 3 worker nodes (m4.16xlarge). KubeletArgument maxPods was set to 500 (from default 250) in the jopes of being able to scale pod deployment past 250 per node. Events from namespace where the pod volume mount errors are observed: # oc get events -n svt-5-njsmdb-7 LAST SEEN TYPE REASON KIND MESSAGE 28m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/mongodb-1-deploy to ip-10-0-144-70.us-east-2.compute.internal 28m Warning FailedMount Pod MountVolume.SetUp failed for volume "deployer-token-cjwz8" : couldn't propagate object cache: timed out waiting for the condition 28m Normal Pulled Pod Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b25ce4783fabc70f42a42d920f2ea94fcb37aaf054a6eeb28e016e7b33ef45f" already present on machine 28m Normal Created Pod Created container 28m Normal Started Pod Started container 28m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/mongodb-1-fqw5h to ip-10-0-161-37.us-east-2.compute.internal 28m Warning FailedMount Pod MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition 27m Normal Pulled Pod Container image "image-registry.openshift-image-registry.svc:5000/openshift/mongodb@sha256:5c6e8d47d47dda6e8bf02335fe1d4cbe1c95b6dd0245a6b62fe2b2ee5d417fa4" already present on machine 27m Normal Created Pod Created container 27m Normal Started Pod Started container 28m Normal SuccessfulCreate ReplicationController Created pod: mongodb-1-fqw5h 28m Normal DeploymentCreated DeploymentConfig Created new replication controller "mongodb-1" for version 1 26m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-54h6w to ip-10-0-161-37.us-east-2.compute.internal 26m Normal Pulling Pod pulling image "image-registry.openshift-image-registry.svc:5000/svt-5-njsmdb-7/nodejs-mongodb-example@sha256:28df5cd72749253a9197b760d41da04334166b0754d88fe9564fcb9ebba7fdee" 26m Normal Pulled Pod Successfully pulled image "image-registry.openshift-image-registry.svc:5000/svt-5-njsmdb-7/nodejs-mongodb-example@sha256:28df5cd72749253a9197b760d41da04334166b0754d88fe9564fcb9ebba7fdee" 26m Normal Created Pod Created container 26m Normal Started Pod Started container 12m Normal Killing Pod Killing container with id cri-o://nodejs-mongodb-example:Need to kill Pod 26m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-6kg45 to ip-10-0-144-70.us-east-2.compute.internal 26m Warning FailedMount Pod MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition 28m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-build to ip-10-0-161-37.us-east-2.compute.internal 28m Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-dockercfg-c9hwd-pull" : couldn't propagate object cache: timed out waiting for the condition 28m Warning FailedMount Pod MountVolume.SetUp failed for volume "build-ca-bundles" : couldn't propagate object cache: timed out waiting for the condition 28m Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-token-nqbd6" : couldn't propagate object cache: timed out waiting for the condition 28m Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-dockercfg-c9hwd-push" : couldn't propagate object cache: timed out waiting for the condition 27m Normal Pulled Pod Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c45bc57db718723861f83f43d49e8e0719367d23b32e18770f90f2376b63eb68" already present on machine 27m Normal Created Pod Created container 27m Normal Started Pod Started container 26m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-deploy to ip-10-0-161-37.us-east-2.compute.internal 26m Warning FailedMount Pod MountVolume.SetUp failed for volume "deployer-token-cjwz8" : couldn't propagate object cache: timed out waiting for the condition 26m Normal Pulled Pod Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2b25ce4783fabc70f42a42d920f2ea94fcb37aaf054a6eeb28e016e7b33ef45f" already present on machine 26m Normal Created Pod Created container 26m Normal Started Pod Started container 26m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-jlt44 to ip-10-0-128-227.us-east-2.compute.internal 26m Warning FailedMount Pod MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition 26m Normal Scheduled Pod Successfully assigned svt-5-njsmdb-7/nodejs-mongodb-example-1-slbpx to ip-10-0-128-227.us-east-2.compute.internal 26m Warning FailedMount Pod MountVolume.SetUp failed for volume "default-token-msjpm" : couldn't propagate object cache: timed out waiting for the condition 26m Normal SuccessfulCreate ReplicationController Created pod: nodejs-mongodb-example-1-jlt44 26m Normal SuccessfulCreate ReplicationController Created pod: nodejs-mongodb-example-1-6kg45 26m Normal SuccessfulCreate ReplicationController Created pod: nodejs-mongodb-example-1-54h6w 26m Normal SuccessfulCreate ReplicationController Created pod: nodejs-mongodb-example-1-slbpx 24m Normal BuildCompleted Build Build svt-5-njsmdb-7/nodejs-mongodb-example-1 completed successfully 12m Normal SuccessfulDelete ReplicationController Deleted pod: nodejs-mongodb-example-1-54h6w 12m Normal SuccessfulDelete ReplicationController Deleted pod: nodejs-mongodb-example-1-jlt44 12m Normal SuccessfulDelete ReplicationController Deleted pod: nodejs-mongodb-example-1-slbpx 12m Normal SuccessfulDelete ReplicationController Deleted pod: nodejs-mongodb-example-1-6kg45 26m Normal DeploymentCreated DeploymentConfig Created new replication controller "nodejs-mongodb-example-1" for version 1 12m Normal ReplicationControllerScaled DeploymentConfig Scaled replication controller "nodejs-mongodb-example-1" from 4 to 0 # We are also seeing hyperkube error: Feb 12 16:20:02 ip-10-0-144-70 hyperkube[14763]: E0212 16:20:02.137651 14763 reflector.go:134] object-"svt-5-njsmdb-7"/"deployer-dockercfg-pf8f7": Failed to list *v1.Secret: secrets "deployer-dockercfg-pf8f7" is forbidden: User "system:node:ip-10-0-144-70.us-east-2.compute.internal" cannot list resource "secrets" in API group "" in the namespace "svt-5-njsmdb-7": no path found to object Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-01-30-145955 True False 22h Cluster version is 4.0.0-0.nightly-2019-01-30-145955 "image": "registry.svc.ci.openshift.org/ocp/release@sha256:d03ce0ef85540a1fff8bfc1c408253404aaecb2b958d7c3f24896f3597c3715b", "version": "4.0.0-0.nightly-2019-01-30-145955" # ./openshift-install version ./openshift-install v4.0.0-0.150.0.0-dirty How reproducible: Reproduced several times. Steps to Reproduce: 1. Create a new OCP 4.0 cluster with nightly payload: a. export BUILD_VERSION=4.0.0-0.nightly-2019-01-30-145955 b. oc adm release info --pullspecs registry.svc.ci.openshift.org/ocp/release:${BUILD_VERSION} | grep installer c. export CONTAINER_ID=$(docker create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ae9b43dac5708dae29a2c0899aa21e99f56d54540222f5b7da383261af0ba741) d. mkdir installer_2019-01-30-145955, e. cd installer_2019-01-30-145955 f. docker cp $CONTAINER_ID:/usr/bin/openshift-install . g. export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-30-145955 h. export _OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-30-145955 i. ./openshift-install --dir=$(pwd) create cluster --log-level=debug 2. git clone the openshift/svt repo: git clone https://github.com/openshift/svt.git 3. cd svt/openshift_scalability/ 4. Run python cluster-loader with config file to deploy 10 projects each with one nodejs-mongodb quickstart build, deploymentconfig, resulting in one mongodb pod and 4 replicas of nodejs-mongodb-example pods. Total 50 pods. Config file and template is in the attached Logs. 5. Wait 10 minutes 6. Deploy next 10 projects with 50 more mongo/nodejs pods 7. Repeat steps 5-6 three more times Actual results: After about 50 mongo/nodejs running pods per worker node, we started getting errors "MountVolume.SetUp failed for volume ... couldn't propagate object cache: timed out waiting for the condition" and the nodejs-mongodb-example started to get deleted. Each subsequent project with new quickstart mongodb/nodejs deployment was not successful, as we kep getting these MountVolume.Setup failed errors. Eventually the worker node go NotReady. Expected results: All mongodb and nodejs-mongodb-example pods should running in all the projects Additional info: Links to journal logs from worker nodes, events, and oc command outputs are provided in next comment.
It looks like there are some other dependencies (I'm getting a dependency on rbd, for instance). Running pip install rbd gives $ pip install rbd Collecting rbd Could not find a version that satisfies the requirement rbd (from versions: ) No matching distribution found for rbd Also, what is the exact conmmand required for step 4?
Reproduced in testing. Note additional error of "MountVolume.SetUp failed for volume "build-system-configs" : configmap "nodejs-mongodb-example-1-sys-config" not found". $ oc get events --all-namespaces |grep svt |grep FailedMount svt-1-njsmdb-0 2m30s Warning FailedMount Pod MountVolume.SetUp failed for volume "deployer-token-ppjdx" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-1 2m31s Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-token-cqclw" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-1 2m31s Warning FailedMount Pod MountVolume.SetUp failed for volume "build-ca-bundles" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-1 2m31s Warning FailedMount Pod MountVolume.SetUp failed for volume "build-system-configs" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-2 2m32s Warning FailedMount Pod MountVolume.SetUp failed for volume "build-system-configs" : configmap "nodejs-mongodb-example-1-sys-config" not found svt-1-njsmdb-2 2m32s Warning FailedMount Pod MountVolume.SetUp failed for volume "build-ca-bundles" : configmap "nodejs-mongodb-example-1-ca" not found svt-1-njsmdb-4 2m32s Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-dockercfg-xsh89-pull" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-4 2m32s Warning FailedMount Pod MountVolume.SetUp failed for volume "build-system-configs" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-4 2m32s Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-token-k9xkq" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-4 2m32s Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-dockercfg-xsh89-push" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-5 2m33s Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-dockercfg-smlzd-pull" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-5 2m33s Warning FailedMount Pod MountVolume.SetUp failed for volume "builder-dockercfg-smlzd-push" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-8 2m32s Warning FailedMount Pod MountVolume.SetUp failed for volume "deployer-token-8z5l8" : couldn't propagate object cache: timed out waiting for the condition svt-1-njsmdb-9 2m34s Warning FailedMount Pod MountVolume.SetUp failed for volume "build-system-configs" : configmap "nodejs-mongodb-example-1-sys-config" not found
*** Bug 1679585 has been marked as a duplicate of this bug. ***
Verified on OCP 4.0 cluster on AWS with 3 m5.xlarge masters and 3 m5.12xlarge worker nodes. # ./openshift-install version ./openshift-install v4.0.22-201903141706-dirty built from commit 61cc5c968744a41a7e2f78a8bad14c35101e83dd # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-15-063749 True False 6h5m Cluster version is 4.0.0-0.nightly-2019-03-15-063749 # oc version Client Version: version.Info{Major:"4", Minor:"0+", GitVersion:"v4.0.22", GitCommit:"219bbe2f0c", GitTreeState:"", BuildDate:"2019-03-10T22:23:11Z", GoVersion:"", Compiler:"", Platform:""} Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.4+befe71b", GitCommit:"befe71b", GitTreeState:"clean", BuildDate:"2019-03-15T05:57:43Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"} # oc version --short Client Version: v4.0.22 Server Version: v1.12.4+befe71b Ran 150 projects, each with 5 nodejs/mongodb apps deployed. The last 20 pods went into pending state as we reached the max 250 pods per worker node. Value of kubeletArgument maxPods was the default 250. I did not see any nodejs app pods terminating after being started. This is the first time on OCP 4.0 that we were able to scale nodejs quickstart app deployment up to the maxPods 250 pods per node value. ########### svt pods Running: 709 ########### svt pods NOT Running: 449 ########### svt pods NOT Running and NOT Completed: 20 ########### svt pods NOT Running and NOT Completed are: svt-15-njsmdb-0 mongodb-1-qjvjh 0/1 Pending 0 5m13s svt-15-njsmdb-0 nodejs-mongodb-example-1-deploy 0/1 Pending 0 3m32s svt-15-njsmdb-1 mongodb-1-dcvj8 0/1 Pending 0 6m10s svt-15-njsmdb-1 nodejs-mongodb-example-1-deploy 0/1 Pending 0 5m24s svt-15-njsmdb-2 mongodb-1-9k2ll 0/1 Pending 0 6m8s svt-15-njsmdb-2 nodejs-mongodb-example-1-deploy 0/1 Pending 0 90s svt-15-njsmdb-3 mongodb-1-deploy 0/1 Pending 0 6m14s svt-15-njsmdb-3 nodejs-mongodb-example-1-build 0/1 Pending 0 6m14s svt-15-njsmdb-4 mongodb-1-gknfd 0/1 Pending 0 5m12s svt-15-njsmdb-4 nodejs-mongodb-example-1-deploy 0/1 Pending 0 4m34s svt-15-njsmdb-5 mongodb-1-s7594 0/1 Pending 0 6m10s svt-15-njsmdb-5 nodejs-mongodb-example-1-deploy 0/1 Pending 0 5m28s svt-15-njsmdb-6 mongodb-1-p5m25 0/1 Pending 0 21s svt-15-njsmdb-6 nodejs-mongodb-example-1-deploy 0/1 Pending 0 2m27s svt-15-njsmdb-7 mongodb-1-5jb8p 0/1 Pending 0 6m9s svt-15-njsmdb-7 nodejs-mongodb-example-1-deploy 0/1 Pending 0 5m5s svt-15-njsmdb-8 mongodb-1-x7gq9 0/1 Pending 0 4m55s svt-15-njsmdb-8 nodejs-mongodb-example-1-deploy 0/1 Pending 0 30s svt-15-njsmdb-9 mongodb-1-bqjlb 0/1 Pending 0 6m11s svt-15-njsmdb-9 nodejs-mongodb-example-1-deploy 0/1 Pending 0 5m27s ############ Pods per worker node: Non-terminated Pods: (250 in total) Non-terminated Pods: (250 in total) Non-terminated Pods: (250 in total)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758