Bug 1452225 - OCP 3.6 - docker-registry keeps restarting and CrashLoopBackOff when deploying 100+ number of pause-pods
Summary: OCP 3.6 - docker-registry keeps restarting and CrashLoopBackOff when deployin...
Keywords:
Status: CLOSED DUPLICATE of bug 1454948
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.6.0
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: Meng Bo
URL:
Whiteboard: aos-scalability-36
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-18 15:45 UTC by Walid A.
Modified: 2017-06-20 18:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-20 18:42:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Walid A. 2017-05-18 15:45:41 UTC
Description of problem:

During Node Vertical OpenShift Scalability testing, where we deploy up to 250 pause-pods (image "gcr.io/google_containers/pause-amd64:3.0") per OCP application node, docker-registry pod goes into CrashLoopBackOff mode and keeps restarting but never returns to ready status.  This is on an AWS OCP 3.6.76 environment of 1 master/etc, 1 infra node and 2 application.  docker-registry starts out initially in running state, and starts restarting after we deploy about 100-125 pausepods during the test.  


Version-Release number of selected component (if applicable):

# openshift version
openshift v3.6.76
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

How reproducible:
Always Reproducible

Steps to Reproduce:
1. install OCP 3.6.76 cluster of 1 master/etcd. 1 infra nodes and 2 applications node with openshift-ansible BYO config.yml playbook
2. On master node, set defaultNodeSelector: "region=primary" in /etc/origin/master/master-config.yaml
oc annotate namespace default openshift.io/node-selector='region=infra' --overwrite=true
systemctl restart atomic-openshift-master

3. This build had registry-console pod failing to deploy.  I tried to redeploy it with:
oc rollout latest dc/registry-console.  But it failed to re-deploy.

4. start our cluster-loader test tool which will deploy the pause-pods sequentially in batches of 40, pausing 3 minutes, then resuming.  It uses oc create -f <image json template>.

5. Wait till about 100+ pause-pods have been deployed on each application node before checking status of pods in default project with oc get pods.

Actual results:
# oc get pods
NAME                      READY     STATUS             RESTARTS   AGE
docker-registry-1-wcwgb   0/1       CrashLoopBackOff   223        13h
router-1-bcj7n            1/1       Running            0          13h
The "pause-pods" get deployed successfully but registry console goes into CrashLoopBackOff

Expected results:
docker-registry pod should be remaining in READY 1/1 and Status should be Running

Additional info:

pod logs and links to journal messages from master and infra node are in next private comment

Comment 2 Michal Fojtik 2017-05-25 10:15:28 UTC
Does scaling up registry fix this problem?

Comment 3 Walid A. 2017-05-25 20:03:00 UTC
Scaling docker-registry to 3 replicas does not seem to help.  I am hitting the same issues with docker-registry going into CrashLoopbackOff and restarting during the test.
This on latest ocp version 3.6.79-1


# oc get pods --all-namespaces
NAMESPACE         NAME                       READY     STATUS             RESTARTS   AGE
clusterproject0   pausepods0                 1/1       Running            0          2m
clusterproject0   pausepods1                 1/1       Running            0          2m
clusterproject0   pausepods10                1/1       Running            0          2m
clusterproject0   pausepods11                1/1       Running            0          2m
clusterproject0   pausepods12                1/1       Running            0          2m
clusterproject0   pausepods13                1/1       Running            0          2m
clusterproject0   pausepods14                1/1       Running            0          2m
clusterproject0   pausepods15                1/1       Running            0          1m
clusterproject0   pausepods16                1/1       Running            0          1m
clusterproject0   pausepods17                1/1       Running            0          1m
clusterproject0   pausepods18                1/1       Running            0          1m
clusterproject0   pausepods19                1/1       Running            0          1m
clusterproject0   pausepods2                 1/1       Running            0          2m
clusterproject0   pausepods20                1/1       Running            0          1m
clusterproject0   pausepods21                1/1       Running            0          1m
clusterproject0   pausepods22                1/1       Running            0          1m
clusterproject0   pausepods23                1/1       Running            0          1m
clusterproject0   pausepods24                1/1       Running            0          1m
clusterproject0   pausepods25                1/1       Running            0          1m
clusterproject0   pausepods26                1/1       Running            0          1m
clusterproject0   pausepods27                1/1       Running            0          1m
clusterproject0   pausepods28                1/1       Running            0          1m
clusterproject0   pausepods29                1/1       Running            0          1m
clusterproject0   pausepods3                 1/1       Running            0          2m
clusterproject0   pausepods30                1/1       Running            0          1m
clusterproject0   pausepods31                1/1       Running            0          1m
clusterproject0   pausepods32                1/1       Running            0          1m
clusterproject0   pausepods33                1/1       Running            0          1m
clusterproject0   pausepods34                1/1       Running            0          1m
clusterproject0   pausepods35                1/1       Running            0          1m
clusterproject0   pausepods36                1/1       Running            0          1m
clusterproject0   pausepods37                1/1       Running            0          1m
clusterproject0   pausepods38                1/1       Running            0          1m
clusterproject0   pausepods39                1/1       Running            0          1m
clusterproject0   pausepods4                 1/1       Running            0          2m
clusterproject0   pausepods5                 1/1       Running            0          2m
clusterproject0   pausepods6                 1/1       Running            0          2m
clusterproject0   pausepods7                 1/1       Running            0          2m
clusterproject0   pausepods8                 1/1       Running            0          2m
clusterproject0   pausepods9                 1/1       Running            0          2m
default           docker-registry-1-6wm34    0/1       CrashLoopBackOff   6          10h
default           docker-registry-1-6xsc8    0/1       CrashLoopBackOff   6          17m
default           docker-registry-1-w2kj9    0/1       CrashLoopBackOff   6          17m
default           registry-console-3-7j9q7   0/1       Running            4          26m
default           router-1-g1zz7             1/1       Running            0          10h
# 
# 

attaching latest logs

Comment 6 Ben Bennett 2017-06-02 18:37:29 UTC
My hunch is that this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1454948

Comment 7 Ben Bennett 2017-06-20 18:42:43 UTC

*** This bug has been marked as a duplicate of bug 1454948 ***


Note You need to log in before you can comment on or make changes to this bug.