Description of problem: During Node Vertical OpenShift Scalability testing, where we deploy up to 250 pause-pods (image "gcr.io/google_containers/pause-amd64:3.0") per OCP application node, docker-registry pod goes into CrashLoopBackOff mode and keeps restarting but never returns to ready status. This is on an AWS OCP 3.6.76 environment of 1 master/etc, 1 infra node and 2 application. docker-registry starts out initially in running state, and starts restarting after we deploy about 100-125 pausepods during the test. Version-Release number of selected component (if applicable): # openshift version openshift v3.6.76 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 How reproducible: Always Reproducible Steps to Reproduce: 1. install OCP 3.6.76 cluster of 1 master/etcd. 1 infra nodes and 2 applications node with openshift-ansible BYO config.yml playbook 2. On master node, set defaultNodeSelector: "region=primary" in /etc/origin/master/master-config.yaml oc annotate namespace default openshift.io/node-selector='region=infra' --overwrite=true systemctl restart atomic-openshift-master 3. This build had registry-console pod failing to deploy. I tried to redeploy it with: oc rollout latest dc/registry-console. But it failed to re-deploy. 4. start our cluster-loader test tool which will deploy the pause-pods sequentially in batches of 40, pausing 3 minutes, then resuming. It uses oc create -f <image json template>. 5. Wait till about 100+ pause-pods have been deployed on each application node before checking status of pods in default project with oc get pods. Actual results: # oc get pods NAME READY STATUS RESTARTS AGE docker-registry-1-wcwgb 0/1 CrashLoopBackOff 223 13h router-1-bcj7n 1/1 Running 0 13h The "pause-pods" get deployed successfully but registry console goes into CrashLoopBackOff Expected results: docker-registry pod should be remaining in READY 1/1 and Status should be Running Additional info: pod logs and links to journal messages from master and infra node are in next private comment
Does scaling up registry fix this problem?
Scaling docker-registry to 3 replicas does not seem to help. I am hitting the same issues with docker-registry going into CrashLoopbackOff and restarting during the test. This on latest ocp version 3.6.79-1 # oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE clusterproject0 pausepods0 1/1 Running 0 2m clusterproject0 pausepods1 1/1 Running 0 2m clusterproject0 pausepods10 1/1 Running 0 2m clusterproject0 pausepods11 1/1 Running 0 2m clusterproject0 pausepods12 1/1 Running 0 2m clusterproject0 pausepods13 1/1 Running 0 2m clusterproject0 pausepods14 1/1 Running 0 2m clusterproject0 pausepods15 1/1 Running 0 1m clusterproject0 pausepods16 1/1 Running 0 1m clusterproject0 pausepods17 1/1 Running 0 1m clusterproject0 pausepods18 1/1 Running 0 1m clusterproject0 pausepods19 1/1 Running 0 1m clusterproject0 pausepods2 1/1 Running 0 2m clusterproject0 pausepods20 1/1 Running 0 1m clusterproject0 pausepods21 1/1 Running 0 1m clusterproject0 pausepods22 1/1 Running 0 1m clusterproject0 pausepods23 1/1 Running 0 1m clusterproject0 pausepods24 1/1 Running 0 1m clusterproject0 pausepods25 1/1 Running 0 1m clusterproject0 pausepods26 1/1 Running 0 1m clusterproject0 pausepods27 1/1 Running 0 1m clusterproject0 pausepods28 1/1 Running 0 1m clusterproject0 pausepods29 1/1 Running 0 1m clusterproject0 pausepods3 1/1 Running 0 2m clusterproject0 pausepods30 1/1 Running 0 1m clusterproject0 pausepods31 1/1 Running 0 1m clusterproject0 pausepods32 1/1 Running 0 1m clusterproject0 pausepods33 1/1 Running 0 1m clusterproject0 pausepods34 1/1 Running 0 1m clusterproject0 pausepods35 1/1 Running 0 1m clusterproject0 pausepods36 1/1 Running 0 1m clusterproject0 pausepods37 1/1 Running 0 1m clusterproject0 pausepods38 1/1 Running 0 1m clusterproject0 pausepods39 1/1 Running 0 1m clusterproject0 pausepods4 1/1 Running 0 2m clusterproject0 pausepods5 1/1 Running 0 2m clusterproject0 pausepods6 1/1 Running 0 2m clusterproject0 pausepods7 1/1 Running 0 2m clusterproject0 pausepods8 1/1 Running 0 2m clusterproject0 pausepods9 1/1 Running 0 2m default docker-registry-1-6wm34 0/1 CrashLoopBackOff 6 10h default docker-registry-1-6xsc8 0/1 CrashLoopBackOff 6 17m default docker-registry-1-w2kj9 0/1 CrashLoopBackOff 6 17m default registry-console-3-7j9q7 0/1 Running 4 26m default router-1-g1zz7 1/1 Running 0 10h # # attaching latest logs
My hunch is that this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1454948
*** This bug has been marked as a duplicate of bug 1454948 ***