Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1400132

Summary: Node reports as Ready, but it takes about 15 mins to schedule a build
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: BuildAssignee: Cesar Wong <cewong>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.1CC: acarter, adellape, bparees, cewong, haowang, jokerman, knakayam, lucarval, mgoldman, mifiedle, mmccomas, mmilata, mrobson, pweil, rhowe, sten, trankin, twaugh, wzheng
Target Milestone: ---   
Target Release: 3.4.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The builds controller iterates through all builds in the system and processes completed builds to see if there are other builds that need to be started. It will keep iterating through completed builds regardless of when the build was completed. Consequence: Scheduling a new build can take a long time when there is a great number of completed builds for the controller to process (> 1000). Fix: Process a build once only on completion to decide whether other builds should be started. Ignore them in the regular build controller loop. Result: Builds start quickly regardless of how many completed builds exist in the system.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-31 20:19:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from 10.3.8.111
none
CPU Profiling
none
Stacktrace none

Description Vadim Rutkovsky 2016-11-30 13:56:10 UTC
Description of problem:

OSBS team has upgraded rcm-img-docker02.build.eng.bos.redhat.com to OCP 3.3.1 and now it takes about 15 mins for build to move from New status to Pending and further

Version-Release number of selected component (if applicable):
oc v3.1.1.6-33-g81eabcc
kubernetes v1.1.0-origin-1107-g4c8e6f4

A server with both master and node services runnning

How reproducible:
Always

Steps to Reproduce:
1. Start a build via brew:

brew container-build extras-rhel-7.2-docker-candidate --repo-url="http://file.brq.redhat.com/~vrutkovs/buildroot.repo" git://pkgs.devel.redhat.com/rpms/osbs-buildroot-docker#f24f4e47dcb71fcbb9cd6bdab6b756265d8cf4fb --git-branch=private-vrutkovs-all-from-git --scratch --channel-override="testing"

2. Wait until a new build - 'scratch-<timestamp>' in 'default' namespace - appears on that machine

Actual results:
The build is stuck in New state and doesn't move to Pending and further

Expected results:
Build moves to Pending and goes on

Additional info:
Node restart seems to fix it - after ~15 mins the builds in queue will be scheduled correctly.

The symptoms are similar to bug #1374407, however the build gets stuck in New state, not Pending.

Comment 2 Avesh Agarwal 2016-12-01 02:12:04 UTC
ok will look into it today.

Comment 3 Avesh Agarwal 2016-12-01 10:42:56 UTC
I tried it with 3.3.1 but could not reproduce. 

Could you provide master and node logs (with log level at least 5) when the issue happened and also any more information about your environment?

Comment 5 Avesh Agarwal 2016-12-01 13:07:10 UTC
Vadim, I can select weekly or daily logs in kibana in the links you sent above but it only shows first 500 events, do you know how to save complete daily or weekly logs locally?

Comment 6 Luiz Carvalho 2016-12-01 16:50:44 UTC
We noticed longer than 15 minutes. In some cases +6 hours.
It looks like the culprit is that this cluster had too many builds, 2937.
Once we deleted them, the issue went away.

This is an all-in-one OSE cluster (v3.3.1.5).

Comment 8 Avesh Agarwal 2017-01-06 22:04:21 UTC
Luiz,

Could you attach logs during the time when the problem occurred or give access to the cluster to look at to me? I will look into it first this on Monday.

Comment 19 Vadim Rutkovsky 2017-01-10 14:11:28 UTC
Created attachment 1239118 [details]
Logs from 10.3.8.111

Extracted a part of logs from 10.3.8.111 machine for yesterday: osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40. but was scheduled on 15:40 only

Comment 20 Avesh Agarwal 2017-01-10 14:19:37 UTC
(In reply to Vadim Rutkovsky from comment #19)
> Created attachment 1239118 [details]
> Logs from 10.3.8.111
> 
> Extracted a part of logs from 10.3.8.111 machine for yesterday:
> osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40.
> but was scheduled on 15:40 only

I think its because I had restarted master around 15:40 to increase log level to 5  (see my previous comments) and then the build was scheduled so these logs seem old and with log level 2 not providing much help. I am waiting when the problem happens again so that I can investigate with the increased log level.

Comment 24 Vadim Rutkovsky 2017-01-13 20:00:20 UTC
Created attachment 1240503 [details]
CPU Profiling

Added a cpu profiling data for a cluster when no builds are running (but top still shows 140%+ cpu usage).

Comment 25 Vadim Rutkovsky 2017-01-16 22:06:47 UTC
Created attachment 1241431 [details]
Stacktrace

Generated a stacktrace on the cluster using
'curl -o stack.txt http://localhost:6060/debug/pprof/goroutine?debug=2'

It seems goroutine 4143 is stuck while applying a policy to a build:

...
github.com/openshift/origin/pkg/client.(*builds).List(0xc8415abda0, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, 0x0, 0x0, ...)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/client/builds.go:48 +0x19f
github.com/openshift/origin/pkg/build/client.OSClientBuildClient.List(0x7fa412ddb838, 0xc822cc1fe8, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, ...)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/client/clients.go:68 +0xab
github.com/openshift/origin/pkg/build/client.(*OSClientBuildClient).List(0xc82a519f30, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, ...)
	<autogenerated>:4 +0xe7
github.com/openshift/origin/pkg/build/util.BuildConfigBuilds(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0xc82c1b92f0, 0xc849e786d8, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/util/util.go:124 +0xd6
github.com/openshift/origin/pkg/build/controller/policy.GetNextConfigBuild(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0x0, 0x0, 0x0, 0xc8582b1500, ...)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:96 +0xea
github.com/openshift/origin/pkg/build/controller/policy.handleComplete(0x7fa409ede2d0, 0xc82a519f30, 0x7fa409ede2a8, 0xc82a519f20, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:135 +0xf2
github.com/openshift/origin/pkg/build/controller/policy.(*SerialPolicy).OnComplete(0xc8292ba260, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/serial.go:34 +0x5a
github.com/openshift/origin/pkg/build/controller.(*BuildController).HandleBuild(0xc8292b8800, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/controller.go:99 +0x694
github.com/openshift/origin/pkg/build/controller/factory.(*BuildControllerFactory).Create.func1(0x4a01ea0, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/factory/factory.go:106 +0x8b
...

Comment 26 Vadim Rutkovsky 2017-01-17 17:05:37 UTC
I think I found a workaround for this:

$ cat go_template_file
{{.metadata.name}} - {{index .metadata.labels "openshift.io/build.start-policy"}}
$ oc get builds -a --no-headers -o name | xargs -n1 oc get -o go-template-file=go_template_file | grep "<no value>" > builds
$ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}'
$ sudo systemctl restart atomic-openshift-master

The issue seems to be caused by builds, created in OSE 3.2. Those didn't have 'build.start-policy' label set, so atomic-openshift-master was logging a lot of warnings and tried to set a policy for those. 

This wasted a lot of time - and the changes were not written back to the build object. Since build scheduler is limited with 30 mins, with a sufficient number of builds the process gets stuck eventually

Comment 27 Avesh Agarwal 2017-01-17 17:24:36 UTC
Hi Vadim,

Thanks, I am going through the information you provided and will get back to you.

Comment 28 Avesh Agarwal 2017-01-19 18:53:24 UTC
(In reply to Vadim Rutkovsky from comment #26)
> I think I found a workaround for this:
> 
> $ cat go_template_file
> {{.metadata.name}} - {{index .metadata.labels
> "openshift.io/build.start-policy"}}
> $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o
> go-template-file=go_template_file | grep "<no value>" > builds
> $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata":
> {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}'
> $ sudo systemctl restart atomic-openshift-master
> 
> The issue seems to be caused by builds, created in OSE 3.2. Those didn't
> have 'build.start-policy' label set, so atomic-openshift-master was logging
> a lot of warnings and tried to set a policy for those. 
> 
> This wasted a lot of time - and the changes were not written back to the
> build object. Since build scheduler is limited with 30 mins, with a
> sufficient number of builds the process gets stuck eventually

So are you able to reproduce the issue after this work around?

Comment 30 Vadim Rutkovsky 2017-01-19 19:52:14 UTC
(In reply to Avesh Agarwal from comment #28)
> (In reply to Vadim Rutkovsky from comment #26)
> > I think I found a workaround for this:
> > 
> > $ cat go_template_file
> > {{.metadata.name}} - {{index .metadata.labels
> > "openshift.io/build.start-policy"}}
> > $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o
> > go-template-file=go_template_file | grep "<no value>" > builds
> > $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata":
> > {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}'
> > $ sudo systemctl restart atomic-openshift-master
> > 
> > The issue seems to be caused by builds, created in OSE 3.2. Those didn't
> > have 'build.start-policy' label set, so atomic-openshift-master was logging
> > a lot of warnings and tried to set a policy for those. 
> > 
> > This wasted a lot of time - and the changes were not written back to the
> > build object. Since build scheduler is limited with 30 mins, with a
> > sufficient number of builds the process gets stuck eventually
> 
> So are you able to reproduce the issue after this work around?

Yes, this worked for out dev cluster - after the master restart the builds were scheduled again and didn't get stuck for already several days

Comment 42 Wenjing Zheng 2017-01-24 09:12:09 UTC
Could anyone provide more info on how QE verify this bug?

Comment 43 Vadim Rutkovsky 2017-01-24 11:25:03 UTC
(In reply to Wenjing Zheng from comment #42)
> Could anyone provide more info on how QE verify this bug?

I think the only way to reproduce this is to create a lot of builds (~3000) and check their state - builds shouldn't hang up in New

Comment 50 Mike Fiedler 2017-01-25 18:46:15 UTC
Verified on 3.4.1.2.

Reproduced by creating 3K Completed builds on 3.4.0.40 and then starting a new build with oc start-build.   It took > 20 min for the build to actually run.

Updated to 3.4.1.2 and
   a) started a new build on top of the existing builds.  It started immediately
   b) deleted the project and retested with a project with 5K completed builds.   New builds started immediately.

Comment 53 errata-xmlrpc 2017-01-31 20:19:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218

Comment 54 Ryan Howe 2017-02-07 14:55:40 UTC
*** Bug 1418906 has been marked as a duplicate of this bug. ***