Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1452225

Summary: OCP 3.6 - docker-registry keeps restarting and CrashLoopBackOff when deploying 100+ number of pause-pods
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Status: CLOSED DUPLICATE QA Contact: Meng Bo <bmeng>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, jeder, mfojtik, mifiedle, wabouham
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: aos-scalability-36
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-20 18:42:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Walid A. 2017-05-18 15:45:41 UTC
Description of problem:

During Node Vertical OpenShift Scalability testing, where we deploy up to 250 pause-pods (image "gcr.io/google_containers/pause-amd64:3.0") per OCP application node, docker-registry pod goes into CrashLoopBackOff mode and keeps restarting but never returns to ready status.  This is on an AWS OCP 3.6.76 environment of 1 master/etc, 1 infra node and 2 application.  docker-registry starts out initially in running state, and starts restarting after we deploy about 100-125 pausepods during the test.  


Version-Release number of selected component (if applicable):

# openshift version
openshift v3.6.76
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

How reproducible:
Always Reproducible

Steps to Reproduce:
1. install OCP 3.6.76 cluster of 1 master/etcd. 1 infra nodes and 2 applications node with openshift-ansible BYO config.yml playbook
2. On master node, set defaultNodeSelector: "region=primary" in /etc/origin/master/master-config.yaml
oc annotate namespace default openshift.io/node-selector='region=infra' --overwrite=true
systemctl restart atomic-openshift-master

3. This build had registry-console pod failing to deploy.  I tried to redeploy it with:
oc rollout latest dc/registry-console.  But it failed to re-deploy.

4. start our cluster-loader test tool which will deploy the pause-pods sequentially in batches of 40, pausing 3 minutes, then resuming.  It uses oc create -f <image json template>.

5. Wait till about 100+ pause-pods have been deployed on each application node before checking status of pods in default project with oc get pods.

Actual results:
# oc get pods
NAME                      READY     STATUS             RESTARTS   AGE
docker-registry-1-wcwgb   0/1       CrashLoopBackOff   223        13h
router-1-bcj7n            1/1       Running            0          13h
The "pause-pods" get deployed successfully but registry console goes into CrashLoopBackOff

Expected results:
docker-registry pod should be remaining in READY 1/1 and Status should be Running

Additional info:

pod logs and links to journal messages from master and infra node are in next private comment

Comment 2 Michal Fojtik 2017-05-25 10:15:28 UTC
Does scaling up registry fix this problem?

Comment 3 Walid A. 2017-05-25 20:03:00 UTC
Scaling docker-registry to 3 replicas does not seem to help.  I am hitting the same issues with docker-registry going into CrashLoopbackOff and restarting during the test.
This on latest ocp version 3.6.79-1


# oc get pods --all-namespaces
NAMESPACE         NAME                       READY     STATUS             RESTARTS   AGE
clusterproject0   pausepods0                 1/1       Running            0          2m
clusterproject0   pausepods1                 1/1       Running            0          2m
clusterproject0   pausepods10                1/1       Running            0          2m
clusterproject0   pausepods11                1/1       Running            0          2m
clusterproject0   pausepods12                1/1       Running            0          2m
clusterproject0   pausepods13                1/1       Running            0          2m
clusterproject0   pausepods14                1/1       Running            0          2m
clusterproject0   pausepods15                1/1       Running            0          1m
clusterproject0   pausepods16                1/1       Running            0          1m
clusterproject0   pausepods17                1/1       Running            0          1m
clusterproject0   pausepods18                1/1       Running            0          1m
clusterproject0   pausepods19                1/1       Running            0          1m
clusterproject0   pausepods2                 1/1       Running            0          2m
clusterproject0   pausepods20                1/1       Running            0          1m
clusterproject0   pausepods21                1/1       Running            0          1m
clusterproject0   pausepods22                1/1       Running            0          1m
clusterproject0   pausepods23                1/1       Running            0          1m
clusterproject0   pausepods24                1/1       Running            0          1m
clusterproject0   pausepods25                1/1       Running            0          1m
clusterproject0   pausepods26                1/1       Running            0          1m
clusterproject0   pausepods27                1/1       Running            0          1m
clusterproject0   pausepods28                1/1       Running            0          1m
clusterproject0   pausepods29                1/1       Running            0          1m
clusterproject0   pausepods3                 1/1       Running            0          2m
clusterproject0   pausepods30                1/1       Running            0          1m
clusterproject0   pausepods31                1/1       Running            0          1m
clusterproject0   pausepods32                1/1       Running            0          1m
clusterproject0   pausepods33                1/1       Running            0          1m
clusterproject0   pausepods34                1/1       Running            0          1m
clusterproject0   pausepods35                1/1       Running            0          1m
clusterproject0   pausepods36                1/1       Running            0          1m
clusterproject0   pausepods37                1/1       Running            0          1m
clusterproject0   pausepods38                1/1       Running            0          1m
clusterproject0   pausepods39                1/1       Running            0          1m
clusterproject0   pausepods4                 1/1       Running            0          2m
clusterproject0   pausepods5                 1/1       Running            0          2m
clusterproject0   pausepods6                 1/1       Running            0          2m
clusterproject0   pausepods7                 1/1       Running            0          2m
clusterproject0   pausepods8                 1/1       Running            0          2m
clusterproject0   pausepods9                 1/1       Running            0          2m
default           docker-registry-1-6wm34    0/1       CrashLoopBackOff   6          10h
default           docker-registry-1-6xsc8    0/1       CrashLoopBackOff   6          17m
default           docker-registry-1-w2kj9    0/1       CrashLoopBackOff   6          17m
default           registry-console-3-7j9q7   0/1       Running            4          26m
default           router-1-g1zz7             1/1       Running            0          10h
# 
# 

attaching latest logs

Comment 6 Ben Bennett 2017-06-02 18:37:29 UTC
My hunch is that this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1454948

Comment 7 Ben Bennett 2017-06-20 18:42:43 UTC

*** This bug has been marked as a duplicate of bug 1454948 ***