Bug 1370587

Summary: It is taking 30+ seconds for build pods to start running
Product: OpenShift Online Reporter: Steve Speicher <sspeiche>
Component: BuildAssignee: Stefanie Forrester <dakini>
Status: CLOSED CURRENTRELEASE QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.xCC: aos-bugs, bparees, cewong, dakini, sspeiche
Target Milestone: ---Flags: sspeiche: needinfo+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-08 16:58:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steve Speicher 2016-08-26 17:10:40 UTC
Description of problem:
On preview (prod) system, whenever I start a build (either automatically when adding to project or select "start build"), the build stays in "Pending" state for over 30 seconds. This happens fairly consistently, close to 100% of the time.

Version-Release number of selected component (if applicable):
OpenShift Master:
v3.2.1.13-5-gddf7d17
Kubernetes Master:
v1.2.0-36-g4a3f9c5
https://console.preview.openshift.com/console/about


How reproducible:
Create a

Steps to Reproduce:
1. Create a project
2. Add to project node.js builder image, v4 with sample is fine
3. Watch how long it takes to start
4. once finished, selected 'start build' and observe time for build pod to be running.



Expected results:


Additional info:

Comment 1 Cesar Wong 2016-08-26 18:01:04 UTC
Every time a build pod is created, it is taking 20 to 30 secs to pull the builder image:

Events:
  FirstSeen    	LastSeen       	Count  	From   					SubobjectPath  			Type   		Reason 		Message
  ---------    	--------       	-----  	----   					-------------  			--------       	------ 		-------
  43s  		43s    		1      	{default-scheduler }   							Normal 		Scheduled      	Successfully assigned simple-ruby-6-build to ip-172-31-9-165.ec2.internal
  41s  		41s    		1      	{kubelet ip-172-31-9-165.ec2.internal} 	spec.containers{sti-build}     	Normal 		Pulling		pulling image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.2.1.13"
  11s  		11s    		1      	{kubelet ip-172-31-9-165.ec2.internal} 	spec.containers{sti-build}     	Normal 		Pulled 		Successfully pulled image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.2.1.13"
  10s  		10s    		1      	{kubelet ip-172-31-9-165.ec2.internal} 	spec.containers{sti-build}     	Normal 		Created		Created container with docker id 4cfc652667fb
  9s   		9s     		1      	{kubelet ip-172-31-9-165.ec2.internal} 	spec.containers{sti-build}     	Normal 		Started		Started container with docker id 4cfc652667fb

I don't see an imagePullPolicy of always set on the pod, but that's what it looks like is in effect.

Comment 2 Cesar Wong 2016-08-26 18:54:23 UTC
registry.qe.openshift.com was taking a long time to respond.

Comment 3 Cesar Wong 2016-08-26 18:54:50 UTC
registry.qe.openshift.com was taking a long time to respond.

Comment 4 Steve Speicher 2016-08-26 19:11:55 UTC
May I ask why we are pulling from registry.qe.openshift.com and why the pull times should affect dev preview users? Why aren't the builder images already pre-pulled/cached on the builder nodes?

We should be monitoring for this situation and kicking the registry again when it happens

Comment 5 Ben Parees 2016-08-26 19:13:00 UTC
they are pre-pulled, it was a no-op pull that still took 20-30s to compare the layers even though they already existed.  (after kicking the registry it's down to 0.5s)

Comment 6 Steve Speicher 2016-08-26 19:20:10 UTC
Just took 30s again, 2 attempts in a row :(

Comment 7 Cesar Wong 2016-08-26 20:12:18 UTC
Yup, it's slow for me as well again.

Comment 8 Steve Speicher 2016-08-29 14:57:33 UTC
What is the plan/process to get the registry slowness fixed? Is it just replacing the qe registry with the ops one? If so, when do we plan to have this rolled out or a shorter term solution.

Comment 9 Stefanie Forrester 2016-08-29 15:45:39 UTC
Replacing the QE registry is our best option. We were waiting until the prod release, but I can probably get it done today instead.

Comment 10 Stefanie Forrester 2016-08-29 23:10:32 UTC
The issue should be resolved since we're using the Ops Registry now. If not, let me know and I'll scale it up.

Comment 11 Steve Speicher 2016-08-30 00:52:16 UTC
Ack - builds starting much faster now. Averages about 6 seconds from time build is started until it is running. It would be nicer if it was less than 6, 3.3 might give us that lightning speed. Are pulls going through registry API calls or directly from S3?

Comment 12 Steve Speicher 2016-08-30 12:49:44 UTC
Got one build that took over a minute to start running, he's the build pod details

oc describe pod php-32-build
Name:		php-32-build
Namespace:	sspeiche-test3
Node:		ip-172-31-8-200.ec2.internal/172.31.8.200
Start Time:	Tue, 30 Aug 2016 08:46:42 -0400
Labels:		openshift.io/build.name=php-32
Status:		Running
IP:		10.1.58.9
Controllers:	<none>
Containers:
  sti-build:
    Container ID:	docker://c06e3c817e9a405765ba93f97e379e61496959fb3eb9d32cb9e453d62e5b783c
    Image:		registry.ops.openshift.com/openshift3/ose-sti-builder:v3.2.1.13
    Image ID:		docker://de13933fd4e9bdbe127dd0fdc1bac3a5a3acbe73faa96c56aaf539d93c432645
    Port:		
    Args:
      --loglevel=2
    QoS Tier:
      memory:	Burstable
      cpu:	Burstable
    Limits:
      cpu:	1
      memory:	512Mi
    Requests:
      cpu:		60m
      memory:		307Mi
    State:		Running
      Started:		Tue, 30 Aug 2016 08:46:45 -0400
    Ready:		True
    Restart Count:	0
    Environment Variables:
      BUILD:	{"kind":"Build","apiVersion":"v1","metadata":{"name":"php-32","namespace":"sspeiche-test3","selfLink":"/oapi/v1/namespaces/sspeiche-test3/builds/php-32","uid":"c2b394b6-6eaf-11e6-ad33-0e3d364e19a5","resourceVersion":"138275425","creationTimestamp":"2016-08-30T12:46:24Z","labels":{"app":"php","buildconfig":"php","openshift.io/build-config.name":"php"},"annotations":{"openshift.io/build-config.name":"php","openshift.io/build.number":"32"}},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/openshift/cakephp-ex.git","ref":"master"},"secrets":null},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"registry.access.redhat.com/rhscl/php-56-rhel7:latest"}}},"output":{"to":{"kind":"DockerImage","name":"172.30.47.227:5000/sspeiche-test3/php:latest"},"pushSecret":{"name":"builder-dockercfg-y42jt"}},"resources":{},"postCommit":{}},"status":{"phase":"New","outputDockerImageReference":"172.30.47.227:5000/sspeiche-test3/php:latest","config":{"kind":"BuildConfig","namespace":"sspeiche-test3","name":"php"}}}

      BUILD_LOGLEVEL:		2
      SOURCE_REPOSITORY:	https://github.com/openshift/cakephp-ex.git
      SOURCE_URI:		https://github.com/openshift/cakephp-ex.git
      SOURCE_REF:		master
      ORIGIN_VERSION:		v3.2.1.13-5-gddf7d17
      ALLOWED_UIDS:		1-
      DROP_CAPS:		KILL,MKNOD,SETGID,SETUID,SYS_CHROOT
      PUSH_DOCKERCFG_PATH:	/var/run/secrets/openshift.io/push
Conditions:
  Type		Status
  Ready 	True 
Volumes:
  docker-socket:
    Type:	HostPath (bare host directory volume)
    Path:	/var/run/docker.sock
  builder-dockercfg-y42jt-push:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	builder-dockercfg-y42jt
  builder-token-8v11g:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	builder-token-8v11g
Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath			Type		Reason		Message
  ---------	--------	-----	----					-------------			--------	------		-------
  50s		50s		1	{kubelet ip-172-31-8-200.ec2.internal}	spec.containers{sti-build}	Normal		Pulling		pulling image "registry.ops.openshift.com/openshift3/ose-sti-builder:v3.2.1.13"
  49s		49s		1	{kubelet ip-172-31-8-200.ec2.internal}	spec.containers{sti-build}	Normal		Pulled		Successfully pulled image "registry.ops.openshift.com/openshift3/ose-sti-builder:v3.2.1.13"
  48s		48s		1	{kubelet ip-172-31-8-200.ec2.internal}	spec.containers{sti-build}	Normal		Created		Created container with docker id c06e3c817e9a
  48s		48s		1	{kubelet ip-172-31-8-200.ec2.internal}	spec.containers{sti-build}	Normal		Started		Started container with docker id c06e3c817e9a
  39s		39s		1	{default-scheduler }							Normal		FailedScheduling	Binding rejected: Timeout: request did not complete within allowed duration

Comment 13 Cesar Wong 2016-08-30 12:51:36 UTC
This latest one doesn't look like a registry issue anymore. It looks more like an issue on the node with Docker:

  39s		39s		1	{default-scheduler }							Normal		FailedScheduling	Binding rejected: Timeout: request did not complete within allowed duration

Comment 14 Stefanie Forrester 2016-09-06 20:51:23 UTC
NOC looked into this today, but they didn't see any issues with docker on the node, and they didn't see any pods in FailedScheduling.

Comment 15 Stefanie Forrester 2016-09-08 16:24:41 UTC
I'm unable to reproduce the issue. I tried 7 new builds in a row and they all started up within about 5-6 seconds. I didn't hit any timeouts.

This is what I used:

oadm new-project dakinitest
oc new-app https://github.com/openshift/cakephp-ex.git -n dakinitest

[root@preview-master-e69da ~]# oc get pods -n dakinitest
NAME                 READY     STATUS    RESTARTS   AGE
cakephp-ex-1-build   1/1       Running   0          5s

I looked at events across all namespaces and I don't see any mention of "timeout" in the logs. So I think the original issue is no longer occurring. Steve, can you confirm?

Comment 16 Steve Speicher 2016-09-08 16:56:38 UTC
Stefanie, I have not seen this issue since my comment #12. Perhaps the iptables sync interval change fixed it. This can be closed.