1452225 – OCP 3.6 - docker-registry keeps restarting and CrashLoopBackOff when deploying 100+ number of pause-pods

Bug 1452225 - OCP 3.6 - docker-registry keeps restarting and CrashLoopBackOff when deploying 100+ number of pause-pods

Summary: OCP 3.6 - docker-registry keeps restarting and CrashLoopBackOff when deployin...

Keywords:
Status:	CLOSED DUPLICATE of bug 1454948
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:	aos-scalability-36
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-18 15:45 UTC by Walid A.
Modified:	2017-06-20 18:42 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-20 18:42:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Walid A. 2017-05-18 15:45:41 UTC

Description of problem:

During Node Vertical OpenShift Scalability testing, where we deploy up to 250 pause-pods (image "gcr.io/google_containers/pause-amd64:3.0") per OCP application node, docker-registry pod goes into CrashLoopBackOff mode and keeps restarting but never returns to ready status. This is on an AWS OCP 3.6.76 environment of 1 master/etc, 1 infra node and 2 application. docker-registry starts out initially in running state, and starts restarting after we deploy about 100-125 pausepods during the test.

Version-Release number of selected component (if applicable):

# openshift version
openshift v3.6.76
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

How reproducible:
Always Reproducible

Steps to Reproduce:
1. install OCP 3.6.76 cluster of 1 master/etcd. 1 infra nodes and 2 applications node with openshift-ansible BYO config.yml playbook
2. On master node, set defaultNodeSelector: "region=primary" in /etc/origin/master/master-config.yaml
oc annotate namespace default openshift.io/node-selector='region=infra' --overwrite=true
systemctl restart atomic-openshift-master

3. This build had registry-console pod failing to deploy. I tried to redeploy it with:
oc rollout latest dc/registry-console. But it failed to re-deploy.

4. start our cluster-loader test tool which will deploy the pause-pods sequentially in batches of 40, pausing 3 minutes, then resuming. It uses oc create -f <image json template>.

5. Wait till about 100+ pause-pods have been deployed on each application node before checking status of pods in default project with oc get pods.

Actual results:
# oc get pods
NAME READY STATUS RESTARTS AGE
docker-registry-1-wcwgb 0/1 CrashLoopBackOff 223 13h
router-1-bcj7n 1/1 Running 0 13h
The "pause-pods" get deployed successfully but registry console goes into CrashLoopBackOff

Expected results:
docker-registry pod should be remaining in READY 1/1 and Status should be Running

Additional info:

pod logs and links to journal messages from master and infra node are in next private comment

Comment 2 Michal Fojtik 2017-05-25 10:15:28 UTC

Does scaling up registry fix this problem?

Comment 3 Walid A. 2017-05-25 20:03:00 UTC

Scaling docker-registry to 3 replicas does not seem to help.  I am hitting the same issues with docker-registry going into CrashLoopbackOff and restarting during the test.
This on latest ocp version 3.6.79-1


# oc get pods --all-namespaces
NAMESPACE         NAME                       READY     STATUS             RESTARTS   AGE
clusterproject0   pausepods0                 1/1       Running            0          2m
clusterproject0   pausepods1                 1/1       Running            0          2m
clusterproject0   pausepods10                1/1       Running            0          2m
clusterproject0   pausepods11                1/1       Running            0          2m
clusterproject0   pausepods12                1/1       Running            0          2m
clusterproject0   pausepods13                1/1       Running            0          2m
clusterproject0   pausepods14                1/1       Running            0          2m
clusterproject0   pausepods15                1/1       Running            0          1m
clusterproject0   pausepods16                1/1       Running            0          1m
clusterproject0   pausepods17                1/1       Running            0          1m
clusterproject0   pausepods18                1/1       Running            0          1m
clusterproject0   pausepods19                1/1       Running            0          1m
clusterproject0   pausepods2                 1/1       Running            0          2m
clusterproject0   pausepods20                1/1       Running            0          1m
clusterproject0   pausepods21                1/1       Running            0          1m
clusterproject0   pausepods22                1/1       Running            0          1m
clusterproject0   pausepods23                1/1       Running            0          1m
clusterproject0   pausepods24                1/1       Running            0          1m
clusterproject0   pausepods25                1/1       Running            0          1m
clusterproject0   pausepods26                1/1       Running            0          1m
clusterproject0   pausepods27                1/1       Running            0          1m
clusterproject0   pausepods28                1/1       Running            0          1m
clusterproject0   pausepods29                1/1       Running            0          1m
clusterproject0   pausepods3                 1/1       Running            0          2m
clusterproject0   pausepods30                1/1       Running            0          1m
clusterproject0   pausepods31                1/1       Running            0          1m
clusterproject0   pausepods32                1/1       Running            0          1m
clusterproject0   pausepods33                1/1       Running            0          1m
clusterproject0   pausepods34                1/1       Running            0          1m
clusterproject0   pausepods35                1/1       Running            0          1m
clusterproject0   pausepods36                1/1       Running            0          1m
clusterproject0   pausepods37                1/1       Running            0          1m
clusterproject0   pausepods38                1/1       Running            0          1m
clusterproject0   pausepods39                1/1       Running            0          1m
clusterproject0   pausepods4                 1/1       Running            0          2m
clusterproject0   pausepods5                 1/1       Running            0          2m
clusterproject0   pausepods6                 1/1       Running            0          2m
clusterproject0   pausepods7                 1/1       Running            0          2m
clusterproject0   pausepods8                 1/1       Running            0          2m
clusterproject0   pausepods9                 1/1       Running            0          2m
default           docker-registry-1-6wm34    0/1       CrashLoopBackOff   6          10h
default           docker-registry-1-6xsc8    0/1       CrashLoopBackOff   6          17m
default           docker-registry-1-w2kj9    0/1       CrashLoopBackOff   6          17m
default           registry-console-3-7j9q7   0/1       Running            4          26m
default           router-1-g1zz7             1/1       Running            0          10h
# 
# 

attaching latest logs

Comment 6 Ben Bennett 2017-06-02 18:37:29 UTC

My hunch is that this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1454948

Comment 7 Ben Bennett 2017-06-20 18:42:43 UTC


*** This bug has been marked as a duplicate of bug 1454948 ***

Note You need to log in before you can comment on or make changes to this bug.